Monitor Disk Linux với Grafana & Prometheus

Tại sao cần giám sát Disk?

Disk (ổ đĩa) là một trong những thành phần quan trọng nhất của server, đồng thời cũng là điểm thắt cổ chai phổ biến nhất trong môi trường production. Việc giám sát disk giúp bạn:

Phát hiện sớm khi disk sắp đầy để tránh downtime đột ngột.
Theo dõi Read/Write Latency – chỉ số quan trọng cho database server.
Phân tích I/O bottleneck để tối ưu hiệu năng ứng dụng.
Cảnh báo (Alert) tự động khi vượt ngưỡng nguy hiểm.

Stack phổ biến nhất hiện nay để làm điều này là Prometheus + Node Exporter + Grafana – tất cả đều mã nguồn mở và miễn phí.

Kiến trúc tổng quan

Trước khi bắt đầu, hãy hiểu rõ vai trò của từng thành phần trong stack:

Thành phần	Vai trò	Port mặc định
Node Exporter	Thu thập metrics từ OS Linux (CPU, RAM, Disk, Network)	9100
Prometheus	Scrape (kéo) metrics từ Node Exporter, lưu vào time-series DB	9090
Grafana	Kết nối Prometheus, hiển thị Dashboard trực quan	3000
Alertmanager	Nhận alert từ Prometheus, gửi thông báo qua Email/Slack/Telegram	9093

Bước 1: Cài đặt Node Exporter

Node Exporter cần được cài đặt trên mỗi server bạn muốn giám sát. Thực hiện các lệnh sau:

# Tải Node Exporter phiên bản mới nhất
cd /tmp
wget https://github.com/prometheus/node_exporter/releases/download/v1.8.2/node_exporter-1.8.2.linux-amd64.tar.gz

# Giải nén
tar xvfz node_exporter-1.8.2.linux-amd64.tar.gz

# Copy binary vào /usr/local/bin
sudo cp node_exporter-1.8.2.linux-amd64/node_exporter /usr/local/bin/
sudo chmod +x /usr/local/bin/node_exporter

Tạo systemd service cho Node Exporter

Tạo file service để Node Exporter tự khởi động cùng hệ thống:

sudo tee /etc/systemd/system/node_exporter.service << 'EOF'
[Unit]
Description=Prometheus Node Exporter
Wants=network-online.target
After=network-online.target

[Service]
User=nobody
Group=nogroup
Type=simple
ExecStart=/usr/local/bin/node_exporter

[Install]
WantedBy=multi-user.target
EOF

# Kích hoạt và khởi động service
sudo systemctl daemon-reload
sudo systemctl enable node_exporter
sudo systemctl start node_exporter

# Kiểm tra trạng thái
sudo systemctl status node_exporter

Xác nhận Node Exporter đang chạy và expose metrics:

curl http://localhost:9100/metrics | grep node_disk

Bước 2: Cài đặt Prometheus

# Tải Prometheus
cd /tmp
wget https://github.com/prometheus/prometheus/releases/download/v2.52.0/prometheus-2.52.0.linux-amd64.tar.gz
tar xvfz prometheus-2.52.0.linux-amd64.tar.gz

# Tạo user và thư mục
sudo useradd --no-create-home --shell /bin/false prometheus
sudo mkdir -p /etc/prometheus /var/lib/prometheus

# Copy binary
sudo cp prometheus-2.52.0.linux-amd64/prometheus /usr/local/bin/
sudo cp prometheus-2.52.0.linux-amd64/promtool /usr/local/bin/
sudo cp -r prometheus-2.52.0.linux-amd64/consoles /etc/prometheus
sudo cp -r prometheus-2.52.0.linux-amd64/console_libraries /etc/prometheus

Cấu hình prometheus.yml

sudo tee /etc/prometheus/prometheus.yml << 'EOF'
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'node_exporter'
    static_configs:
      - targets: ['localhost:9100']
        labels:
          instance: 'server-01'
          env: 'production'

  # Thêm nhiều server tại đây:
  # - targets: ['192.168.1.101:9100']
  #   labels:
  #     instance: 'server-02'
EOF

Tạo systemd service cho Prometheus

sudo tee /etc/systemd/system/prometheus.service << 'EOF'
[Unit]
Description=Prometheus Monitoring
Wants=network-online.target
After=network-online.target

[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/usr/local/bin/prometheus \
    --config.file=/etc/prometheus/prometheus.yml \
    --storage.tsdb.path=/var/lib/prometheus/ \
    --web.console.templates=/etc/prometheus/consoles \
    --web.console.libraries=/etc/prometheus/console_libraries \
    --storage.tsdb.retention.time=30d

[Install]
WantedBy=multi-user.target
EOF

sudo chown -R prometheus:prometheus /etc/prometheus /var/lib/prometheus
sudo systemctl daemon-reload
sudo systemctl enable prometheus
sudo systemctl start prometheus

Bước 3: Cài đặt Grafana

# Thêm Grafana APT repository (Ubuntu/Debian)
sudo apt-get install -y apt-transport-https software-properties-common
wget -q -O - https://apt.grafana.com/gpg.key | sudo gpg --dearmor -o /usr/share/keyrings/grafana.gpg
echo "deb [signed-by=/usr/share/keyrings/grafana.gpg] https://apt.grafana.com stable main" | sudo tee /etc/apt/sources.list.d/grafana.list

# Cài đặt
sudo apt-get update
sudo apt-get install grafana -y

# Khởi động Grafana
sudo systemctl enable grafana-server
sudo systemctl start grafana-server

Truy cập Grafana tại http://<server-ip>:3000. Tài khoản mặc định: admin / admin.

Bước 4: Kết nối Prometheus vào Grafana

Đăng nhập Grafana → vào menu Connections → Data sources.
Nhấn Add data source → chọn Prometheus.
Trong ô Prometheus server URL, nhập: http://localhost:9090
Nhấn Save & test. Nếu thấy thông báo xanh "Data source is working" là thành công.

Bước 5: Import Dashboard "Node Exporter Full"

Thay vì tạo dashboard từ đầu, hãy import sẵn dashboard cộng đồng ID 1860 – một trong những dashboard phổ biến nhất với hàng triệu lượt tải:

Vào Dashboards → New → Import.
Nhập ID: 1860 → nhấn Load.
Chọn Prometheus làm data source.
Nhấn Import. Dashboard sẽ xuất hiện ngay lập tức với đầy đủ panel cho CPU, RAM, Disk, Network.

Các PromQL Query quan trọng cho Disk Monitoring

Nếu bạn muốn tạo custom panel, dưới đây là các query cần thiết nhất:

Dung lượng disk đã dùng (%)

100 - (
  (node_filesystem_avail_bytes{
    mountpoint="/",
    fstype!="tmpfs",
    device!~"/dev/loop.*"
  } * 100)
  / node_filesystem_size_bytes{
    mountpoint="/",
    fstype!="tmpfs"
  }
)

Disk Read Throughput (MB/s)

rate(node_disk_read_bytes_total{device!~"loop.*"}[5m]) / 1024 / 1024

Disk Write Throughput (MB/s)

rate(node_disk_written_bytes_total{device!~"loop.*"}[5m]) / 1024 / 1024

Disk Read IOPS

rate(node_disk_reads_completed_total{device!~"loop.*"}[5m])

Disk Write IOPS

rate(node_disk_writes_completed_total{device!~"loop.*"}[5m])

Disk I/O Latency – Read (ms)

rate(node_disk_read_time_seconds_total{device!~"loop.*"}[5m])
/ rate(node_disk_reads_completed_total{device!~"loop.*"}[5m])
* 1000

Disk I/O Latency – Write (ms)

rate(node_disk_write_time_seconds_total{device!~"loop.*"}[5m])
/ rate(node_disk_writes_completed_total{device!~"loop.*"}[5m])
* 1000

Disk Utilization (%)

rate(node_disk_io_time_seconds_total{device!~"loop.*"}[5m]) * 100

Bước 6: Cài đặt Alert cảnh báo disk đầy

Cấu hình Alert rule trong Prometheus để tự động cảnh báo khi disk vượt ngưỡng 80%. Tạo file rule:

sudo tee /etc/prometheus/alert_rules.yml << 'EOF'
groups:
  - name: disk_alerts
    rules:
      - alert: DiskSpaceWarning
        expr: |
          100 - ((node_filesystem_avail_bytes{fstype!="tmpfs",device!~"/dev/loop.*"} * 100)
          / node_filesystem_size_bytes{fstype!="tmpfs"}) > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Disk sắp đầy trên {{ $labels.instance }}"
          description: "Partition {{ $labels.mountpoint }} đã dùng {{ $value | printf \"%.1f\" }}%"

      - alert: DiskSpaceCritical
        expr: |
          100 - ((node_filesystem_avail_bytes{fstype!="tmpfs",device!~"/dev/loop.*"} * 100)
          / node_filesystem_size_bytes{fstype!="tmpfs"}) > 90
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "NGUY HIỂM: Disk gần đầy trên {{ $labels.instance }}"
          description: "Partition {{ $labels.mountpoint }} đã dùng {{ $value | printf \"%.1f\" }}% – Cần xử lý ngay!"
EOF

Thêm rule file vào prometheus.yml:

rule_files:
  - "/etc/prometheus/alert_rules.yml"

Reload Prometheus để áp dụng:

sudo systemctl reload prometheus
# hoặc dùng HTTP API
curl -X POST http://localhost:9090/-/reload

Bảng tóm tắt các Disk Metrics quan trọng

Metric	Ý nghĩa	Ngưỡng cảnh báo
Disk Usage %	Phần trăm dung lượng đã sử dụng	Warning: 80% · Critical: 90%
Read/Write IOPS	Số thao tác đọc/ghi mỗi giây	Tùy loại disk (SSD/HDD)
Read/Write Latency	Thời gian đáp ứng mỗi thao tác I/O	SSD: <5ms · HDD: <20ms
Disk Throughput	Tốc độ đọc/ghi (MB/s)	Phụ thuộc phần cứng
Disk Utilization	% thời gian disk đang bận xử lý I/O	Warning: 70% · Critical: 90%

Mẹo tối ưu và bảo mật

Giới hạn retention: Prometheus mặc định lưu 15 ngày. Dùng --storage.tsdb.retention.time=30d để điều chỉnh phù hợp.
Bảo mật Node Exporter: Dùng Nginx reverse proxy với Basic Auth để không expose port 9100 ra internet.
Monitor nhiều server: Chỉ cần cài Node Exporter trên mỗi server, sau đó thêm target vào prometheus.yml.
Dùng file_sd_config: Với môi trường lớn, dùng file_sd_configs để thêm target động không cần restart Prometheus.
Tích hợp Alertmanager: Kết hợp với Alertmanager để gửi cảnh báo qua Telegram, Slack, hoặc Email khi disk gặp sự cố.

Kết luận

Chỉ với 3 công cụ mã nguồn mở – Node Exporter, Prometheus và Grafana – bạn đã có được một hệ thống giám sát disk chuyên nghiệp, đầy đủ từ Dashboard đến Alert. Đây là nền tảng lý tưởng để mở rộng sang giám sát CPU, RAM, Network và các service khác trên toàn bộ hạ tầng server của bạn. Đặc biệt, toàn bộ stack này hoàn toàn miễn phí và có thể scale lên hàng nghìn server mà không tốn thêm chi phí license.

Discover Intelligence

Giám sát Disk trên Linux bằng Grafana + Prometheus + Node Exporter