Skip to main content

Linux server high load average — triage and fix

TL;DR

How to triage and resolve high load average on Linux servers causing slow responses and timeouts.

Key facts

Topic
Production error triage
Stack
Linux

TL;DR

A high load average means more processes are demanding CPU or I/O time than the system can serve concurrently. On a 4-core server, a load average of 4.0 means the CPUs are fully utilised. Above that, processes queue and response times degrade.

Understanding load average

uptime
# 10:32:05 up 45 days, load average: 12.50, 8.20, 4.10
# Format: 1-minute, 5-minute, 15-minute averages

Compare load to your CPU count:

nproc
# If load > nproc, the system is overloaded

Diagnosis workflow

Determine whether the load is CPU-bound or I/O-bound:

top -bn1 | head -5
# Look at %wa (I/O wait) — if high, the bottleneck is disk, not CPU

Find the top consumers:

# By CPU
ps aux --sort=-%cpu | head -15

# By I/O (requires sysstat)
iotop -oPa
iostat -x 1 5

Check for runaway processes:

# Processes in D state (uninterruptible sleep = I/O wait)
ps aux | awk '$8 ~ /D/'

Common causes and fixes

CPU-bound: A runaway application process, aggressive cron job, or unoptimised query processing.

renice +10 -p <pid>
cpulimit -p <pid> -l 50

I/O-bound: Disk-heavy operations (database, logging, backups running during peak traffic).

iostat -x 1
# Look for high %util on a device

Reschedule backups and heavy batch jobs to off-peak hours. Move write-heavy logging to a separate disk or use tmpfs for temporary files.

Too many workers: PHP-FPM, Gunicorn, or Node.js running more workers than cores. Reduce worker count to match available CPU cores minus one to leave headroom for the OS.

Where Reflex helps

Reflex tracks load average, CPU utilisation, and I/O wait continuously. When load crosses a threshold, it can identify the top consumers, reschedule non-critical jobs, restart overloaded services, and alert your team with the full diagnostic context. See How it works.