Skip to main content

PM2 cluster mode failing — diagnosis and fix

TL;DR

How to diagnose and fix PM2 cluster mode failures where workers die immediately, ports conflict, or load balancing breaks.

Key facts

Topic
Production error triage
Stack
Node.js / PM2 / Linux

TL;DR

PM2 cluster mode uses Node.js's built-in cluster module to fork multiple worker processes sharing the same port. When cluster mode fails, you see workers restarting in a tight loop, port EADDRINUSE errors, uneven load distribution, or workers dying silently on startup. The root cause is almost always the application binding ports incorrectly or not handling the cluster lifecycle.

Common causes

  • Port conflicts — the application calls app.listen() in code that runs in both master and worker contexts, causing EADDRINUSE
  • Workers dying immediately — a startup error (missing env var, DB connection failure) kills every forked worker, triggering PM2's restart limit
  • Graceful shutdown not working — workers receive SIGINT but do not drain connections, causing dropped requests during restarts and deployments
  • NODE_APP_INSTANCE misuse — scheduled tasks or cron-like operations run in every worker instead of being isolated to instance 0
  • Serialisation errors — workers try to share non-serialisable state (database connections, file handles) across the cluster

Diagnosis workflow

Check the status of all cluster workers:

pm2 status
pm2 describe app-name

Look at the restart count and uptime. If uptime is seconds and restarts are high, workers are crash-looping. Check logs for the specific error:

pm2 logs app-name --lines 200 --err

Monitor resource usage in real time:

pm2 monit

Verify which processes are listening on your port:

ss -tlnp | grep :3000

In healthy cluster mode, you should see a single entry — PM2's master process sharing the port across workers via the cluster module.

Fix the ecosystem configuration

A robust cluster mode setup:

module.exports = {
  apps: [{
    name: 'app',
    script: 'dist/server.js',
    exec_mode: 'cluster',
    instances: 'max',
    max_memory_restart: '1500M',
    kill_timeout: 8000,
    listen_timeout: 10000,
    wait_ready: true,
    max_restarts: 15,
    min_uptime: '10s',
    restart_delay: 5000,
    exp_backoff_restart_delay: 100,
  }],
};

Setting wait_ready: true tells PM2 to wait for process.send('ready') from your application before routing traffic to the worker, preventing requests hitting a half-initialised instance.

Signal-based graceful shutdown

Your application must handle shutdown signals to drain connections:

process.on('SIGINT', () => {
  server.close(() => {
    console.log('Connections drained, exiting');
    process.exit(0);
  });
  setTimeout(() => process.exit(1), 8000);
});

Isolate scheduled work to a single instance

if (process.env.NODE_APP_INSTANCE === '0') {
  startScheduledJobs();
}

Without this guard, every worker runs the same cron-like task, causing duplicate emails, duplicate database writes, and race conditions.

Load balancing verification

PM2 uses round-robin scheduling by default on Linux. Verify traffic distributes evenly:

pm2 status

Check the cpu and mem columns — if one worker shows significantly higher usage, sticky sessions or a client keep-alive configuration may be routing disproportionate traffic.

Where Reflex helps

Reflex monitors PM2 cluster worker health, restart rates, and per-worker resource consumption. When workers enter a crash loop or load becomes unbalanced, Reflex can reset restart counters, scale worker count, execute a rolling restart, and verify all workers are healthy — alerting your team with the full incident context. See How it works.