Node.js memory leak in production — detection and resolution
TL;DR
How to detect, diagnose, and fix memory leaks in production Node.js applications.
Key facts
- Topic
- Production error triage
- Stack
- Node.js / Linux
TL;DR
A Node.js memory leak manifests as steadily increasing RSS (Resident Set Size) over hours or days until the process is killed by the OOM killer or PM2's max_memory_restart threshold. Unlike a single OOM spike, leaks are gradual and often go unnoticed until production traffic amplifies them.
Common leak patterns
- Event listeners accumulating — attaching listeners in a request handler without removing them
- Global caches without eviction — objects stored in module-level Maps or Sets that grow indefinitely
- Closures retaining large scopes — callbacks holding references to request/response objects
- Unreferenced timers —
setIntervalorsetTimeoutcapturing outer variables
Detection
Monitor heap usage over time. A healthy process has a sawtooth pattern (grows, GC clears, grows again). A leaking process trends upward:
pm2 monit
For precise diagnosis, take two heap snapshots 10–15 minutes apart:
node --inspect app.js
# In Chrome DevTools: Memory > Take heap snapshot (twice)
# Compare retained size differences between snapshots
Alternatively, use Clinic.js for a high-level overview:
npx clinic doctor -- node app.js
Common fixes
Remove listeners properly:
function handler(req, res) {
const onData = (chunk) => { /* process chunk */ };
stream.on('data', onData);
res.on('close', () => stream.removeListener('data', onData));
}
Add eviction to in-process caches:
const { LRUCache } = require('lru-cache');
const cache = new LRUCache({ max: 500, ttl: 1000 * 60 * 15 });
Clear intervals when they are no longer needed and avoid storing request-scoped data at module level.
Production safety net
Configure PM2 to restart before the leak causes a crash:
{ max_memory_restart: '1500M' }
This buys time while you find and fix the root cause.
Where Reflex helps
Reflex tracks Node.js memory trends and detects leak patterns — steadily rising RSS without corresponding GC reclamation. When it identifies a leak, it can trigger a rolling restart across cluster workers, preserving availability while your team investigates the root cause. See How it works.