Skip to main content

uWSGI worker dies under load — production fix

TL;DR

How to diagnose and fix uWSGI workers dying under traffic spikes due to harakiri timeouts, memory limits, and worker recycling issues.

Key facts

Topic
Production error triage
Stack
Python / Linux

TL;DR

uWSGI workers dying under load usually means the master process is killing workers that exceed configured limits — harakiri timeout, memory ceiling, or max-requests recycling. During traffic spikes, all workers can be simultaneously killed and respawned, leaving zero capacity and causing a cascade of 502 errors from nginx.

Common causes

  • Harakiri timeout — the harakiri option kills any worker that takes longer than N seconds on a single request. Under load, slow database queries or external API calls push response times past this threshold, causing mass worker kills
  • Memory limits (reload-on-rss) — workers exceeding the RSS threshold are recycled. If your application leaks memory or handles large payloads, every worker can hit the limit simultaneously under spike traffic
  • max-requests recycling — workers configured with max-requests restart after N requests. Under sustained load, all workers may hit this threshold at the same time, causing a thundering-herd restart
  • Cheap/cheaper misconfiguration — the adaptive worker scaling (cheaper algorithm) scales down too aggressively during quiet periods, leaving insufficient workers when traffic spikes

Diagnosis workflow

Check uWSGI logs for the kill reason:

grep -E "harakiri|SIGKILL|respawn|oom|RSS" /var/log/uwsgi/app.log | tail -30

Enable the stats server for real-time worker inspection:

; uwsgi.ini
stats = 127.0.0.1:1717
stats-http = true

Query the stats endpoint:

curl -s http://127.0.0.1:1717 | python3 -m json.tool

Look at each worker's status field (busy, idle, sig), requests count, and rss value. Workers in sig state are being killed.

Monitor from the OS level:

watch -n 1 'ps aux | grep uwsgi | grep -v grep | wc -l'

Fix harakiri settings

Set harakiri high enough for your slowest legitimate request, but low enough to kill genuinely stuck workers:

; uwsgi.ini
harakiri = 60
harakiri-verbose = true

The harakiri-verbose option logs the full backtrace of the killed request, making it possible to identify which endpoint is slow.

Fix memory recycling

Stagger worker recycling to prevent all workers restarting simultaneously:

; uwsgi.ini
reload-on-rss = 512
max-requests = 5000
max-requests-delta = 500

The max-requests-delta adds a random offset (0–500) to each worker's max-requests threshold, so they do not all restart at the same time.

Configure adaptive scaling

; uwsgi.ini
master = true
processes = 8
cheaper = 2
cheaper-initial = 4
cheaper-step = 1
cheaper-algo = busyness
cheaper-overload = 5
cheaper-busyness-multiplier = 30
cheaper-busyness-min = 20

This keeps a minimum of 2 workers, starts with 4, and scales up to 8 based on busyness. The cheaper-overload check interval (seconds) controls how quickly the master reacts to load changes.

Where Reflex helps

Reflex monitors uWSGI worker states, harakiri events, and memory consumption across your fleet. When worker death rates spike, Reflex can adjust worker counts, restart the master process with tuned configuration, and verify capacity recovers — providing a full incident timeline with the exact harakiri backtraces and memory figures. See How it works.