Skip to content

Check disk-io

Overview

Checks disk bandwidth over a period of time. The check tracks the maximum bandwidth and alerts if the bandwidth over the last n reads is above a certain percentage (by default 80/90% over the last 5 reads). This works similar to Load5, but at the disk I/O level.

On Linux, the check plugin by default tries to find "important" disks automatically and returns only useful perfdata information, so as not to waste disk space in a time series database with unnecessary disk information (as in earlier versions). To do this, it looks for disks that are mounted to a folder.

Disk I/O always starts at 10 MiB/sec, but stores the highest measured bandwidth, so it adjusts the RWmax/s value accordingly. For this reason, this check takes some time to warm up its (cached) readings: The check will throw some warnings and criticals during the first major disk activities above 10Mib/sec until the maximum bandwidth of the disk has been determined.

iowait (Linux only)

On Linux, the check also monitors the system-wide iowait percentage. iowait represents CPU time spent idle while waiting for I/O operations to complete. While technically a CPU metric, its diagnostic value is entirely in the disk I/O context, which is why it is part of this check rather than a separate one.

The raw iowait value is normalized by multiplying it with the number of logical CPUs, so that 100% always means one CPU core is fully I/O-saturated, regardless of the total number of CPUs. Values above 100% indicate that more than one core is waiting for I/O. This normalization approach is inspired by Glances, which uses 100 / N (where N = number of CPUs) as its critical threshold for raw iowait. The reason such thresholds appear low in Glances is that raw iowait is reported as a percentage of total CPU time across all cores: on a 4-core system, 25% raw iowait already means one entire core is doing nothing but waiting for I/O. By normalizing the value, the default thresholds (80/90%) work consistently across any hardware.

Like bandwidth alerts, iowait alerts only trigger after --count consecutive threshold violations, suppressing short spikes.

Example

The (shortened) result of ./disk-io --count 5 --warning 80 --critical 90 could look like this:

iowait: 0.1%. /dev/dm-4: 0.0B/s read1, 48.7KiB/s write1, 48.7KiB/s total, 227.9MiB/s max

Name ! RWmax/s ! R1/s     ! W1/s     ! R5/s     ! W5/s     ! RW5/s              
-----+---------+----------+----------+----------+----------+--------------------
dm-0 ! 44.9MiB ! 42.8MiB  ! 17.2MiB  ! 23.1MiB  ! 18.6MiB  ! 36.3MiB [CRITICAL] 
dm-1 ! 10.0MiB ! 4.7KiB   ! 4.0KiB   ! 2.0KiB   ! 6.8KiB   ! 8.7KiB             
...

The first line shows the current iowait percentage followed by the disk with the currently highest bandwidth usage (here dm-0).

The table columns mean:

  • RWmax: Here, a maximum bandwidth of 44.9 MB/sec was determined.
  • R1, W1: The current bandwidth is 23.6 MB/sec read and 17.2 MB/sec write.
  • R5, W5: The bandwidth from now to 5 measured values in the past is 23.1 MB/sec read and 18.6 MB/sec write.
  • First line in the table, RW5: Compared to the current values, there was a higher bandwidth for a while. Since a maximum of 44.9 MB/sec bandwidth has been measured for this disk so far, a mean bandwidth (RW5) value of 36.3 MB/sec results in a warning (36.3 MB/sec >= 44.9 MB/sec * 80%). The current value of 42.8 MB/sec doesn't matter, this is only a peak. The check alerts because there is unusual high disk I/O over a certain amount of time.

Hints:

  • --count=5 (the default) while checking every minute means that the check will report an alert if any of your disks have been above a threshold in the last 5 minutes.
  • The check uses the SQLite database $TEMP/linuxfabrik-monitoring-plugins-disk-io.db to store its historical data.

Fact Sheet

Fact Value
Check Plugin Download https://github.com/Linuxfabrik/monitoring-plugins/tree/main/check-plugins/disk-io
Check Interval Recommendation Once a minute
Can be called without parameters Yes
Compiled for Windows Yes
3rd Party Python modules psutil
Handles Periods Yes
Uses SQLite DBs $TEMP/linuxfabrik-monitoring-plugins-disk-io.db

Help

usage: disk-io [-h] [-V] [--always-ok] [--count COUNT] [--critical CRIT]
               [--iowait-critical IOWAIT_CRIT] [--iowait-warning IOWAIT_WARN]
               [--match MATCH] [--top TOP] [--warning WARN]

Checks disk I/O bandwidth over time and alerts on sustained saturation, not
short spikes. The check records per-disk read/write counters and then derives
current (R1/W1) and period averages (R{COUNT}/W{COUNT}). It compares the
period’s total bandwidth against the maximum ever observed for that disk
(RWmax). WARN/CRIT trigger if the period average exceeds the configured
percentage of RWmax for COUNT consecutive runs. On Linux, the check also
monitors the system-wide iowait percentage (CPU time spent waiting for I/O).
The raw iowait value is normalized by multiplying it with the number of
logical CPUs, so that 100% always means one CPU core is fully I/O-saturated,
regardless of the total number of CPUs. This makes the default thresholds
(80/90%) work consistently across different hardware. Like bandwidth alerts,
iowait alerts require COUNT consecutive threshold violations. Perfdata is
emitted for each disk (busy_time, read_bytes, read_time, write_bytes,
write_time) and for iowait, so you can graph trends. On Linux the check
automatically focuses on "real" block devices with mountpoints; on Windows it
uses psutil’s disk counters. Optionally, `--top` lists the processes that
generated the most I/O traffic (read/write totals) to help identify offenders.
This check is cross-platform and works on Linux, Windows, and all psutil-
supported systems. The check stores its short trend state locally in an SQLite
DB to evaluate sustained load across runs.

options:
  -h, --help            show this help message and exit
  -V, --version         show program's version number and exit
  --always-ok           Always returns OK.
  --count COUNT         Number of times the value must exceed specified
                        thresholds before alerting. Default: 5
  --critical CRIT       Threshold for disk bandwidth saturation (over the last
                        `--count` measurements) as a percentage of the maximum
                        bandwidth the disk can support. Default: >= 90
  --iowait-critical IOWAIT_CRIT
                        Set the critical threshold for normalized iowait in
                        percent (Linux only). The iowait value is normalized
                        so that 100% means one CPU core is fully
                        I/O-saturated. Values above 100% indicate that more
                        than one core is waiting for I/O. Default: >= 90
  --iowait-warning IOWAIT_WARN
                        Set the warning threshold for normalized iowait in
                        percent (Linux only). The iowait value is normalized
                        so that 100% means one CPU core is fully
                        I/O-saturated. Values above 100% indicate that more
                        than one core is waiting for I/O. Default: >= 80
  --match MATCH         Match on disk names. Uses Python regular expressions
                        without any external flags like `re.IGNORECASE`. The
                        regular expression is applied to each line of the
                        output. Examples: `(?i)example` to match the word
                        "example" in a case-insensitive manner.
                        `^(?!.*example).*$` to match any string except
                        "example" (negative lookahead). `(?: ... )*` is a non-
                        capturing group that matches any sequence of
                        characters that satisfy the condition inside it, zero
                        or more times. Default:
  --top TOP             List x "Top processes that generated the most I/O
                        traffic". Use `--top=0` to disable this feature.
                        Default: 5
  --warning WARN        Threshold for disk bandwidth saturation (over the last
                        `--count` measurements) as a percentage of the maximum
                        bandwidth the disk can support. Default: >= 80

Usage Examples

Just check disk dm-0 (if listed as /dev/dm-0):

./disk-io --match='.*dm-0$'

Match all disks except vdc, vdh and vdz:

./disk-io --match='^(?:(?!.*vdc|.*vdh|.*vdz).)*$'

Example Output:

iowait: 0.1%. /dev/dm-8: 5.6KiB/s read1, 2.2MiB/s write1, 2.2MiB/s total, 10.0MiB/s max

Name ! MntPnts        ! DvMppr           ! RWmax/s ! R1/s   ! W1/s    ! R5/s   ! W5/s    ! RW5/s   
-----+----------------+------------------+---------+--------+---------+--------+---------+---------
dm-0 ! /              ! rl-root          ! 10.0MiB ! 0.0B   ! 426.0B  ! 0.0B   ! 343.0B  ! 343.0B  
vda2 ! /boot          !                  ! 10.0MiB ! 0.0B   ! 0.0B    ! 0.0B   ! 0.0B    ! 0.0B    
vda1 ! /boot/efi      !                  ! 10.0MiB ! 0.0B   ! 0.0B    ! 0.0B   ! 0.0B    ! 0.0B    
dm-5 ! /var           ! rl-var           ! 10.0MiB ! 0.0B   ! 586.0B  ! 0.0B   ! 1.1KiB  ! 1.1KiB  
dm-8 ! /data          ! rl-lv_data       ! 10.0MiB ! 5.6KiB ! 2.2MiB  ! 8.3KiB ! 2.3MiB  ! 2.3MiB  
dm-6 ! /tmp           ! rl-tmp           ! 10.0MiB ! 0.0B   ! 4.8KiB  ! 0.0B   ! 7.1KiB  ! 7.1KiB  
dm-7 ! /home          ! rl-home          ! 10.0MiB ! 0.0B   ! 0.0B    ! 0.0B   ! 0.0B    ! 0.0B    
dm-2 ! /var/tmp       ! rl-var_tmp       ! 10.0MiB ! 0.0B   ! 0.0B    ! 0.0B   ! 0.0B    ! 0.0B    
dm-4 ! /var/log       ! rl-var_log       ! 10.0MiB ! 0.0B   ! 51.8KiB ! 0.0B   ! 51.2KiB ! 51.2KiB 
dm-3 ! /var/log/audit ! rl-var_log_audit ! 10.0MiB ! 0.0B   ! 918.0B  ! 0.0B   ! 876.0B  ! 876.0B  

Top 5 processes that generate the most I/O traffic (r/w):
1. nfsd: 149.2GiB/5.7TiB
2. systemd: 695.7GiB/169.9GiB
3. systemd-journald: 33.9MiB/124.4GiB
4. icinga2: 7.9GiB/4.9GiB
5. rsyslogd: 114.8MiB/4.1GiB

States

  • WARN or CRIT if the bandwidth over the last n measured values is above a certain percentage, compared to the all time maximum bandwidth of this drive.
  • WARN or CRIT if iowait exceeds the threshold for --count consecutive runs (Linux only).

Perfdata / Metrics

Global:

Name Type Description
iowait Percentage System-wide iowait (Linux only).

Per (matched) disk, where <disk> is the block device name:

Name Type Description
<disk>_busy_time Continous Counter Time spent doing actual I/Os (in milliseconds).
<disk>_read_bytes Continous Counter Number of bytes read.
<disk>_read_time Continous Counter Time spent reading from disk (in milliseconds).
<disk>_write_bytes Continous Counter Number of bytes written.
<disk>_write_time Continous Counter Time spent writing to disk (in milliseconds).

Troubleshooting

psutil raised error "not sure how to interpret line '...'" or Nothing checked. Running Kernel >= 4.18, this check needs the Python module psutil v5.7.0+
Update the psutil library. On RHEL 8+, use at least python38 and python38-psutil if using dnf.

Credits, License