Linux server high load average — triage and fix
TL;DR
How to triage and resolve high load average on Linux servers causing slow responses and timeouts.
Key facts
- Topic
- Production error triage
- Stack
- Linux
TL;DR
A high load average means more processes are demanding CPU or I/O time than the system can serve concurrently. On a 4-core server, a load average of 4.0 means the CPUs are fully utilised. Above that, processes queue and response times degrade.
Understanding load average
uptime
# 10:32:05 up 45 days, load average: 12.50, 8.20, 4.10
# Format: 1-minute, 5-minute, 15-minute averages
Compare load to your CPU count:
nproc
# If load > nproc, the system is overloaded
Diagnosis workflow
Determine whether the load is CPU-bound or I/O-bound:
top -bn1 | head -5
# Look at %wa (I/O wait) — if high, the bottleneck is disk, not CPU
Find the top consumers:
# By CPU
ps aux --sort=-%cpu | head -15
# By I/O (requires sysstat)
iotop -oPa
iostat -x 1 5
Check for runaway processes:
# Processes in D state (uninterruptible sleep = I/O wait)
ps aux | awk '$8 ~ /D/'
Common causes and fixes
CPU-bound: A runaway application process, aggressive cron job, or unoptimised query processing.
renice +10 -p <pid>
cpulimit -p <pid> -l 50
I/O-bound: Disk-heavy operations (database, logging, backups running during peak traffic).
iostat -x 1
# Look for high %util on a device
Reschedule backups and heavy batch jobs to off-peak hours. Move write-heavy logging to a separate disk or use tmpfs for temporary files.
Too many workers: PHP-FPM, Gunicorn, or Node.js running more workers than cores. Reduce worker count to match available CPU cores minus one to leave headroom for the OS.
Where Reflex helps
Reflex tracks load average, CPU utilisation, and I/O wait continuously. When load crosses a threshold, it can identify the top consumers, reschedule non-critical jobs, restart overloaded services, and alert your team with the full diagnostic context. See How it works.