Skip to main content
Tutorial

Node.js server monitoring — the complete 2026 guide

The Reflex Team9 min18 May 2026

Node.js is deceptively easy to deploy and deceptively hard to keep healthy. You run node server.js, traffic arrives, and everything looks fine — until the event loop blocks for 800ms on a synchronous JSON parse, memory creeps past 1.5 GB because nobody noticed a closure holding a reference to every request object, and your health check still returns 200 because it never actually tested anything meaningful.

This is the guide we wish existed when we started supporting Node.js servers on Reflex. It covers the tools, the signals, and the practices that separate "it works on my machine" from "it survives Tuesday at 2am."

PM2: the baseline process manager

If you are running Node.js in production without a process manager, you are one unhandled rejection away from silence. PM2 is the de facto standard and earns its place for good reasons: automatic restarts, cluster mode across CPU cores, log management, and a built-in monitoring dashboard.

A sensible ecosystem.config.js for a production API server:

module.exports = {
  apps: [{
    name: 'api',
    script: './dist/server.js',
    instances: 'max',
    exec_mode: 'cluster',
    max_memory_restart: '512M',
    env_production: {
      NODE_ENV: 'production',
      PORT: 3000,
    },
    exp_backoff_restart_delay: 100,
    max_restarts: 10,
    min_uptime: '10s',
  }],
};

Key settings worth understanding: max_memory_restart is your safety net against slow leaks — PM2 will gracefully restart a worker that crosses the threshold. exp_backoff_restart_delay prevents crash loops from hammering your server with instant restarts. min_uptime tells PM2 what counts as a "real" start versus an immediate crash.

Run pm2 monit for a live terminal dashboard showing CPU, memory, loop delay, and request counts per worker. For remote visibility, pm2 plus offers a hosted dashboard, though many teams prefer shipping metrics to their own stack.

Memory leak detection

Node.js memory leaks are subtle. The garbage collector hides small leaks for hours until RSS crosses a threshold and the process either gets OOM-killed or starts thrashing GC pauses.

Heap snapshots are the gold standard for finding leaks. In production, you cannot afford to take them on every request, but you can instrument strategically:

const v8 = require('v8');
const fs = require('fs');

function dumpHeap() {
  const filename = `/tmp/heap-${process.pid}-${Date.now()}.heapsnapshot`;
  const snapshotStream = v8.writeHeapSnapshot(filename);
  return snapshotStream;
}

Trigger this via an admin endpoint (authenticated, rate-limited, never exposed publicly) or via a signal handler. Compare two snapshots taken minutes apart in Chrome DevTools — objects that grow between snapshots are your leak candidates.

Common leak patterns in Node.js:

  • Event listeners that accumulate without removal — especially on long-lived objects like database pools or WebSocket connections
  • Closures capturing request-scoped data in module-level caches
  • Unbounded arrays used as in-memory queues without size limits
  • Global error handlers that store error objects with full stack traces

Monitor process.memoryUsage() and export rss, heapUsed, and heapTotal as metrics. A healthy Node.js server shows sawtooth memory patterns as GC runs. A leaking server shows a steady upward slope.

CPU profiling in production

The V8 inspector protocol supports CPU profiling without restarting the process. Connect to a running instance:

node --inspect=0.0.0.0:9229 server.js

Never bind the inspector to 0.0.0.0 in production without firewall rules — it grants full code execution. Use SSH tunnelling or bind to 127.0.0.1 and proxy through your bastion.

For non-interactive profiling, the --prof flag generates V8 tick logs you can process with --prof-process. More practical for production: use 0x or clinic.js in staging with production-like load to generate flamegraphs that expose hot functions.

Event loop lag is the single most important CPU-adjacent metric. When the loop blocks, every connection queued behind it stalls. Measure it:

const { monitorEventLoopDelay } = require('perf_hooks');
const histogram = monitorEventLoopDelay({ resolution: 20 });
histogram.enable();

setInterval(() => {
  const p99 = histogram.percentile(99) / 1e6;
  if (p99 > 100) {
    console.warn(`Event loop p99 lag: ${p99.toFixed(1)}ms`);
  }
  histogram.reset();
}, 10_000);

Anything above 100ms at p99 means your users are feeling it. Common culprits: synchronous file I/O, JSON serialisation of large objects, DNS resolution without caching, and CPU-bound crypto operations that should be offloaded to worker threads.

Health checks that actually check health

A health endpoint that returns { "status": "ok" } without checking dependencies is a liar. Build health checks that verify what matters:

app.get('/health', async (req, res) => {
  const checks = {};

  try {
    await db.query('SELECT 1');
    checks.database = 'ok';
  } catch {
    checks.database = 'fail';
  }

  try {
    await redis.ping();
    checks.cache = 'ok';
  } catch {
    checks.cache = 'fail';
  }

  checks.uptime = process.uptime();
  checks.memory = process.memoryUsage();
  checks.eventLoopLag = histogram.percentile(99) / 1e6;

  const healthy = Object.values(checks)
    .filter(v => typeof v === 'string')
    .every(v => v === 'ok');

  res.status(healthy ? 200 : 503).json(checks);
});

Your load balancer should hit this endpoint. When it returns 503, traffic should route away from that instance before users notice. Set health check intervals to 10-15 seconds — fast enough to catch failures, slow enough to avoid unnecessary load.

Log aggregation

Structured logging is non-negotiable. console.log with string concatenation is a debugging tool, not a production logging strategy.

Use pino for Node.js — it is the fastest structured logger available and outputs JSON by default. Pipe logs to a transport that ships them to your aggregation stack (ELK, Loki, Datadog, or even CloudWatch):

const pino = require('pino');
const logger = pino({
  level: process.env.LOG_LEVEL || 'info',
  formatters: {
    level: (label) => ({ level: label }),
  },
  redact: ['req.headers.authorization', 'req.headers.cookie'],
});

The redact option is critical — never log auth tokens or session cookies. Correlation IDs should flow through every log line so you can trace a request across services.

How Reflex handles Node.js servers

The reflexd agent monitors Node.js processes the same way it monitors any Linux service: process health via systemd or PM2, memory and CPU trends, disk pressure, and log pattern detection. When a Node.js worker crosses memory thresholds or stops responding to health checks, the Brain can trigger a graceful restart via PM2's API — not a blind kill -9.

For teams running Node.js alongside PHP, Python, or Go on the same infrastructure, Reflex provides a single pane of glass that understands each runtime's failure modes without requiring per-language monitoring tools.

The reflexd agent also watches PM2's process list directly, detecting when workers enter a crash loop (restarting faster than min_uptime allows) and alerting before the exponential backoff delay masks the problem from your uptime checker.

Key takeaways

Node.js monitoring is not one tool — it is a stack of practices: PM2 for process supervision, heap snapshots for leak hunting, event loop delay for CPU pressure, honest health checks for load balancer integration, and structured logging for post-incident diagnosis. Get the fundamentals right and the exotic failure modes become rare. Skip them and you will learn what "event loop starvation" means the hard way.