uWSGI worker dies under load — production fix
TL;DR
How to diagnose and fix uWSGI workers dying under traffic spikes due to harakiri timeouts, memory limits, and worker recycling issues.
Key facts
- Topic
- Production error triage
- Stack
- Python / Linux
TL;DR
uWSGI workers dying under load usually means the master process is killing workers that exceed configured limits — harakiri timeout, memory ceiling, or max-requests recycling. During traffic spikes, all workers can be simultaneously killed and respawned, leaving zero capacity and causing a cascade of 502 errors from nginx.
Common causes
- Harakiri timeout — the
harakirioption kills any worker that takes longer than N seconds on a single request. Under load, slow database queries or external API calls push response times past this threshold, causing mass worker kills - Memory limits (
reload-on-rss) — workers exceeding the RSS threshold are recycled. If your application leaks memory or handles large payloads, every worker can hit the limit simultaneously under spike traffic - max-requests recycling — workers configured with
max-requestsrestart after N requests. Under sustained load, all workers may hit this threshold at the same time, causing a thundering-herd restart - Cheap/cheaper misconfiguration — the adaptive worker scaling (
cheaperalgorithm) scales down too aggressively during quiet periods, leaving insufficient workers when traffic spikes
Diagnosis workflow
Check uWSGI logs for the kill reason:
grep -E "harakiri|SIGKILL|respawn|oom|RSS" /var/log/uwsgi/app.log | tail -30
Enable the stats server for real-time worker inspection:
; uwsgi.ini
stats = 127.0.0.1:1717
stats-http = true
Query the stats endpoint:
curl -s http://127.0.0.1:1717 | python3 -m json.tool
Look at each worker's status field (busy, idle, sig), requests count, and rss value. Workers in sig state are being killed.
Monitor from the OS level:
watch -n 1 'ps aux | grep uwsgi | grep -v grep | wc -l'
Fix harakiri settings
Set harakiri high enough for your slowest legitimate request, but low enough to kill genuinely stuck workers:
; uwsgi.ini
harakiri = 60
harakiri-verbose = true
The harakiri-verbose option logs the full backtrace of the killed request, making it possible to identify which endpoint is slow.
Fix memory recycling
Stagger worker recycling to prevent all workers restarting simultaneously:
; uwsgi.ini
reload-on-rss = 512
max-requests = 5000
max-requests-delta = 500
The max-requests-delta adds a random offset (0–500) to each worker's max-requests threshold, so they do not all restart at the same time.
Configure adaptive scaling
; uwsgi.ini
master = true
processes = 8
cheaper = 2
cheaper-initial = 4
cheaper-step = 1
cheaper-algo = busyness
cheaper-overload = 5
cheaper-busyness-multiplier = 30
cheaper-busyness-min = 20
This keeps a minimum of 2 workers, starts with 4, and scales up to 8 based on busyness. The cheaper-overload check interval (seconds) controls how quickly the master reacts to load changes.
Where Reflex helps
Reflex monitors uWSGI worker states, harakiri events, and memory consumption across your fleet. When worker death rates spike, Reflex can adjust worker counts, restart the master process with tuned configuration, and verify capacity recovers — providing a full incident timeline with the exact harakiri backtraces and memory figures. See How it works.