On-call for agents — the 3am playbook — Production Harnesses — observability, recovery, the bill

The move. Write the playbook before you need it. The 3am version of yourself cannot.

The 3am version of yourself is cognitively impaired. This is not a metaphor.

Sleep deprivation produces measurable degradation in decision quality, working memory, and the ability to reason under uncertainty.

If your on-call playbook requires novel problem-solving under time pressure with partial information, your on-call playbook is not a playbook — it is a hope.

A real playbook eliminates the novel problem-solving from the early phases of an incident.

The first ten minutes are not about root cause. They are about three simpler questions: how bad is this, who else needs to know, and what stops it from getting worse.

These questions have predetermined answers in a good playbook; the on-call engineer fills in the blanks from the current situation rather than inventing the framework under pressure.

This discipline is not new. Google's SRE book documented it; PagerDuty and Incident.io have built product categories around it.

What is new is the application of incident management to autonomous agent systems, where the failure modes are different and the failure channels are unfamiliar to teams that have previously run stateless services.

The main difference: in a traditional service incident, the service is either up or down. The blast radius is bounded by what the service does.

In an autonomous agent incident, the system may be running — it is just running in the wrong direction. It may be spending money, taking actions, and accumulating state while the on-call engineer is trying to understand what is happening.

The urgency of "stop the bleeding" is higher for agents than for stateless services.

The picture

A flowchart with a timer. The on-call receives a page. Arrow to the first decision diamond: "User-impacting?" If yes, proceed to triage. If no, log and investigate at human pace.

In the triage box (0–2 minutes): three questions with hard time limits. Is this bounded (affects one user or session) or expanding (affects all users or all sessions)? Is it actively spending money or taking irreversible actions? Is there a kill switch that stops the damage?

Arrow to the contain box (2–5 minutes): if yes to kill switch, trigger it. If no, throttle what you can — reduce agent concurrency, disable the affected workflow via feature flag, notify the team that a deploy may be needed. Document what you did and when.

Arrow to the communicate box (5–10 minutes): internal Slack message in the incidents channel: "Incident active. [Brief description]. [Mitigation in progress or deployed]. [Next update in 20 minutes]." If customer-visible, update the status page.

Arrow to the debug box (10+ minutes): now you use the traces. Replay the incident if possible. Form a hypothesis. Test it. Implement a fix. Verify with monitoring.

Arrow to the resolve box: restore service, close the kill switch if still open, schedule postmortem within 48 hours.

Why it matters now

The 2024–2026 wave of autonomous agent systems has brought classical on-call discipline back to product teams that had been quiet on it. Single-prompt AI features failed in visible, bounded ways; agents fail in more subtle and continuous ways that require the full incident-management response.

The other driver is that teams shipping agents are often smaller than the teams that have traditionally run on-call rotations. A two-person team shipping an agent product cannot afford to reinvent incident management during each incident. They especially cannot afford the productivity loss of poorly-run incidents — the ones that resolve eventually but leave no institutional learning, or the ones that resolve the wrong way because a tired engineer made a decision that should have been made by a written playbook.

A source you should trust

Google's SRE book, specifically the chapter on on-call and incident management, is the canonical reference. The framing — treating on-call as a professional discipline with explicit playbooks, explicit escalation paths, and blameless postmortems — is forty years of engineering operational wisdom in one document. Most of it applies directly to agent systems with vocabulary substitution.

PagerDuty's incident runbook documentation and Incident.io's runbook templates are the applied versions. They are more specific than the SRE book and easier to adapt to a small team's context. The Incident.io templates in particular are designed for teams without a dedicated SRE function, which describes most teams shipping agent products.

A recipe

A 3am playbook template for an autonomous agent system. Write this out for your specific system before launch; the template is the skeleton, not the playbook.

Triage (0–2 minutes). Severity? Expanding or bounded? User-impacting? Actively spending money or taking irreversible actions? Assign a severity level: P1 (expanding, active cost or irreversible actions, user-impacting), P2 (bounded, no active cost, user-impacting), P3 (non-user-impacting). The severity level determines the rest of the playbook.
Contain (2–5 minutes). For P1: trigger the appropriate kill switch (global, feature, user, or instance — per Lesson 5). For P2: throttle or disable the affected workflow via feature flag. For P3: log and schedule investigation. Document the containment action with a timestamp.
Verify the production environment (2–5 minutes, in parallel with contain). Confirm you are looking at the right production environment. What URL is the monitoring stack reporting on? Is the trace you are reading from the right system? This is the lesson from the polish-foundation-sprint deploy-mismatch: the single most important diagnostic question in the first five minutes is "are we looking at the right production?"
Communicate (5–10 minutes). Post in the incident channel: what is happening, what containment has been applied, what the current state is, when the next update will come. If customer-visible, update the status page with a brief statement. Do not diagnose in public; communicate containment status.
Debug (10+ minutes). Open the trace for the affected session. Replay if the vendor surface is available. Form a hypothesis based on the span that diverged. Test the hypothesis with a fix. Verify with monitoring that the fix resolved the issue.
Resolve and follow up. Restore service. Close any kill switches that are still open. Schedule the postmortem within 48 hours — not "when we have time," because "when we have time" never arrives. The postmortem should happen while the incident is fresh.

One critical addition specific to agent systems: add a "verify environment" step to triage. Before you can debug an agent failure, you must confirm you are debugging the right system. Which URL is the monitoring reporting on? Which deploy is in production? This sounds redundant; it is the step that saves hours.

The smell of it going wrong

No playbook. The on-call engineer receives the page and begins reasoning about the failure from first principles. The first ten minutes are spent forming a theory about root cause. In the eleventh minute, the issue expands to affect all users because containment was not applied.

The team has no kill switch and must deploy to contain. A P1 incident at 3am requires a production deploy under pressure. The deploy takes twenty minutes. During those twenty minutes, the agent is continuing to take actions. The deploy has a typo that introduces a second bug.

Communication is "the engineer will figure it out and tell us later." The customer success team does not know there is an incident. They learn about it from a customer who is asking why their data looks wrong. The customer success team's response is "we don't know anything about this yet" — which is worse than "we are investigating a known issue."

Postmortems are not scheduled. The incident resolved at 4am. Everyone went back to sleep. In the morning, the adrenaline had faded and the schedule was full. The postmortem never happened. Three months later, the same failure mode recurred.

A judgment call from real work

PL's polish-foundation-sprint had a series of regressions that surfaced over a single afternoon. Talvinder's QA reports were coming in describing broken flows in the staging application. The reports were being read as intermittent issues or user error, not as symptoms of a systemic failure, because the monitoring stack showed green.

The root cause was that the monitoring was pointed at the wrong production environment — the stale Vercel instance, not the Fly staging app. But nobody in the incident response thought to verify this because the playbook had no "verify which production we are watching" step.

When someone finally asked the question, the answer was immediately actionable. The deploy-dev workflow had been failing on lockfile drift for days. No code was reaching the Fly staging app. The staging app that users were testing was the previous version. The regressions were real, and the monitoring confirmation of "everything is fine" was monitoring the wrong URL.

The playbook change that followed was a single line in the triage section: "Confirm which URL the monitoring stack is pointing at. Confirm it matches the most recent deploy target." The fix takes thirty seconds. The absence of it had cost a day of confused debugging.

The generalizing principle: autonomous agent incidents frequently involve environmental state that is not captured in the agent's traces. The agent's traces show what the agent did; they do not show what the environment looked like when it did it. Triage must actively verify environmental state, not assume it matches the traces.

Rules from this lesson

Write the on-call playbook before launch; the 3am version of yourself cannot invent incident management under pressure.
Contain before you debug — stop the damage first, even if containment feels blunt; damage that continues while you investigate is damage that compounds.
Verify the production environment as the first step of triage; the monitoring may be pointed at the wrong thing.
Communicate early and with status, not diagnosis; silence in an active incident makes the situation worse for every stakeholder downstream.