March 2026 — Reflex self-healing in production

These figures are aggregated and anonymised across Reflex workspaces that opted into aggregate telemetry. They illustrate how the Brain and Pipeline behave together in the wild — not a guarantee for any single deployment.

Headline metrics

Median time to first repair hypothesis: 42 seconds after an alert fired (down from 58 seconds in February after the async evaluate rollout stabilised).
Playbook success without human SSH: 78% of OOM-kill incidents closed with an automated worker recycle + memory advisory, verified by a post-check metric window.
Deploy-correlated rollbacks: 11 auto-rollbacks triggered when post-deploy health checks failed or when the Brain correlated a fresh release with a spike in application error rates. All completed within the deployment timeout budget.

What worked well

Queue worker death detection continues to be the highest-volume automated path. Teams that enabled Supervisor-aware playbooks saw fewer false positives when workers were intentionally restarted during deploys.

What we are tightening next

Hosts with aggressive log rotation sometimes starve the agent of enough context to score deploy correlation confidently. We are expanding the default log tail window for PHP-FPM pools on Studio+ tiers.

Reflex never includes customer hostnames, repository names, or raw env values in these public summaries.