Skip to main content

Rails Puma worker crash — production recovery

TL;DR

How to diagnose and recover from Puma worker crashes in production Rails applications.

Key facts

Topic
Production error triage
Stack
Ruby / Linux

TL;DR

Puma worker crashes in production cause dropped requests and 502 errors from the upstream nginx proxy. Unlike a clean shutdown, a crashed worker exits abnormally — the Puma master process detects the missing worker and forks a replacement, but in-flight requests on that worker are lost. If multiple workers crash simultaneously (common with OOM kills on constrained servers), the entire application becomes unavailable.

Crash types

Segfaults from native extensions

A SIGSEGV (segmentation fault) in a Puma worker almost always originates from a native C extension — not from Ruby code. Common culprits:

  • nokogiri — parsing malformed HTML/XML
  • mysql2 / pg — driver-level crashes on unusual query results or connection state corruption
  • sassc / libsass — CSS compilation failures
  • image processing gems (mini_magick, vips) — corrupted image inputs

Diagnose from system logs:

dmesg | grep -i segfault
journalctl -u puma --since "1 hour ago" | grep -i "signal\|segfault\|abort"

Update the offending gem and its underlying native library. If the crash is reproducible, isolate it:

bundle exec ruby -e "require 'nokogiri'; Nokogiri::HTML('<broken')"

OOM kills

The kernel OOM killer sends SIGKILL to the highest-scoring process — usually the largest Puma worker:

dmesg | grep -i "killed process"

See the Rails OOM error guide for detailed memory diagnosis and jemalloc configuration.

Deadlocks in threaded mode

Puma runs multiple threads per worker. Deadlocks cause workers to hang (not crash), resulting in timeouts:

# Send SIGINFO (or SIGURG on Linux) to get a thread backtrace
kill -URG $(cat tmp/pids/server.pid)

Check the Puma log output for thread backtraces showing where each thread is blocked.

Configure Puma for resilience

lowlevel_error_handler

Catch errors that escape the Rails middleware stack:

# config/puma.rb
lowlevel_error_handler do |e, env, status|
  Rails.logger.error("Puma lowlevel error: #{e.class} - #{e.message}")
  Rails.logger.error(e.backtrace&.first(10)&.join("\n"))
  [500, { "Content-Type" => "text/plain" }, ["Internal Server Error\n"]]
end

Phased restart for recovery

Puma's phased restart (SIGUSR1) replaces workers one at a time without dropping the listening socket:

# Restart workers without downtime
kill -USR1 $(cat /var/www/myapp/tmp/pids/server.pid)

# Or via pumactl
bundle exec pumactl -S tmp/pids/puma.state phased-restart

Systemd service configuration

[Unit]
Description=Puma Rails Server
After=network.target postgresql.service

[Service]
User=deploy
Group=deploy
WorkingDirectory=/var/www/myapp
ExecStart=/home/deploy/.rbenv/shims/bundle exec puma -C config/puma.rb
ExecReload=/bin/kill -USR1 $MAINPID
Restart=always
RestartSec=5
Environment=RAILS_ENV=production
EnvironmentFile=/var/www/myapp/.env
KillMode=mixed
TimeoutStopSec=30

[Install]
WantedBy=multi-user.target

The ExecReload directive maps systemctl reload puma to a phased restart. KillMode=mixed sends SIGTERM to the master and SIGKILL to remaining workers after TimeoutStopSec.

Monitor with puma-status

Install the puma-status gem for a quick health overview:

gem install puma-status
puma-status tmp/pids/puma.state

This shows per-worker thread utilisation, request backlog, and memory usage — essential for diagnosing whether crashes correlate with load spikes.

Quick recovery checklist

systemctl status puma
journalctl -u puma --since "10 minutes ago" --no-pager | tail -50
systemctl reload puma   # phased restart
sleep 5
curl -s -o /dev/null -w "%{http_code}" http://localhost:3000/up

Where Reflex helps

Reflex monitors Puma master and worker process health, restart frequency, and request error rates. When workers crash, Reflex can trigger a phased restart, verify the application responds to health checks, and correlate crashes with recent deployments or traffic spikes — providing your team with a full incident timeline and diagnostic context. See How it works.