Harness Engineering — the pm manual

Imagine an agent you leave running at 11 p.m. It has a task, a set of tools, and access to your production environment. By 5 a.m. it is either done, or it is still running, or it has spent four hundred dollars making no progress on step three of twelve. The question is not whether the agent was capable. The question is whether the team built the cage that makes capability safe to use unsupervised.

That cage is harness engineering. It is the unglamorous half of agent work — the part the demo never shows. Every team that has shipped a working autonomous system eventually discovers that the system itself is not the hard problem. The hard problem is the scaffolding: how do you know it is making progress? How do you know when it is stuck? How do you stop it before it costs you a month's inference budget in a single night? How do you audit what it did? These are engineering problems with a product shape, and they are what separate a working agent from a demo that worked once.

What the harness is

A harness is not one thing. It is five layers, and every agent that runs in production needs all five. Skipping a layer does not make the agent simpler — it makes the failure mode invisible.

The loop. The agent loop is the skeleton: the scaffold that sequences a model call, receives a result, decides whether to call a tool, acts on the tool output, and decides whether to continue. Most of the papers call this "ReAct" (Reason + Act). What matters for a shipping team is not the academic framing — it is whether the loop has a stopping condition, a cost budget, a wall-clock limit, and a graceful failure path. A loop without those four is a liability, not a feature. The course Agent Loop Anatomy walks every decision in the loop from the first model call to the last.

Evals. The eval suite is the immune system of the harness. It is how you know the loop is producing the right behaviour today, and how you know a change you made last Tuesday did not break something you checked last month. Evals for agents are harder than evals for single-shot prompts because the right output is a trace, not a string — a sequence of tool calls and results, not a sentence you can diff. But they are not optional. A team without an eval suite cannot make safe changes to their agent system; they can only make changes and hope. The course Eval Suites for Agents covers golden traces, automated pass/fail on traces, and LLM-as-judge for agent-shaped tasks.

Tools and memory. Tools are what give the agent power: the verbs it can use in the world. Memory is what give the agent continuity: the nouns its verbs act on. The tool design problem is about contracts — the schema must be precise enough that the model calls it correctly, but flexible enough that the model can compose it with other tools. The memory design problem is about what to persist across steps and what to throw away. Both are PM problems as much as engineering problems, because every tool you add expands the blast radius and every thing you persist adds a maintenance surface. The courses Tool Design for Agents and Memory and Context Management each take one half of this layer.

Orchestration. When you have more than one agent, you have an orchestration problem. Who drives? Who reports? How does context flow from one agent to the next without either exploding the cost or losing information? This is the layer where multi-agent architectures go wrong most often — not because the agents are bad, but because the coordinator either passes too much context (expensive, slow) or too little (the agents work at cross purposes without knowing it). The course Orchestration Patterns for Multi-Agent Systems covers the patterns that work and the ones that look good on whiteboards.

Production. The production layer is where observability, cost accounting, rate limiting, and human-in-the-loop gates live. This is the layer most teams build last and most regret not building first. The signals you need — trace-level logs, per-run cost, latency distribution, error classification — are not nice-to-have once you are in production. They are the only way to answer the question your on-call team will eventually ask: "what is it doing right now, and how do we stop it?" The course Production Observability for Agent Systems is the entrance fee for shipping to real users.

Why this is hard now (2025-2026)

The difficulty of harness engineering is not constant. In 2022, an agent ran for seconds. By 2024, Claude Code ran for minutes on a non-trivial engineering task. By 2026, the teams that know what they are doing are running agents for hours. That is not a quantitative change — it is a qualitative one. A bug that costs you one wrong sentence in a two-second run costs you forty wrong tool calls in a four-hour run, and you will not find it until the trace is already three hundred steps long.

Cost follows the same curve. A two-second agent run costs cents. A four-hour run with tool calls costs dollars. An accidentally unguarded loop can cost hundreds before anyone notices. SWE-bench Verified, the benchmark the labs use to score coding agents, reports success rates that have climbed from 12% in early 2024 to above 50% in late 2025. The benchmark measures whether the agent solved the task. It does not measure whether it solved it cheaply, whether it was auditable, or whether it would behave the same way if you ran it again next week. METR's autonomy evaluations are the closest thing to a production-realism benchmark the field has: they measure multi-hour, multi-step tasks in realistic settings, and the honest finding is that even the best systems fail in ways that are hard to predict before the run.

The failure mode has shifted. In 2023, the question was "did the model give a wrong answer." In 2026, the question is "did the loop spend four hundred dollars making no progress." Cognition's published thinking on multi-agent systems — the essay sometimes cited as "don't build multi-agents" — is not a blanket warning against orchestration. It is a warning against orchestration you have not measured. Anthropic's multi-agent guidance takes the same position from the other side: multi-agent patterns work when the subagent boundaries map to actual task decomposition, not when they are chosen because two agents sounds more impressive than one.

The pattern that keeps appearing in the teams that ship: they start with one agent, they know exactly what it costs per run, they have an eval set that runs in under two minutes, and they add a second agent only when they can explain precisely what the second agent buys them that the first cannot.

The five rules

Reading path from here

The Harness Engineering path at /paths/harness-engineering sequences the five courses above in the right order. Read chapter 6 of this manual (Tool Use, Function Calling, Agents) first if you have not — it gives you the six-rung ladder that frames what kind of agent you are building and whether you need a harness at all.

If you have two hours: Read chapter 6 of the manual, then the Agent Loop Anatomy course introduction. The combination will tell you which rung your system sits on and whether harness-1 through harness-5 apply to you today or in six months.

If you have twenty hours: Work through all five courses in order. Each one ends with a field exercise that uses your actual system as the test case. By the end, you will have a stopping-condition spec, a twenty-example eval suite, a tool-schema review, an orchestration decision, and a production checklist — all calibrated to the thing you are actually building.

Anchor stories

The subagent-worktree pattern. PL runs parallel PR work using subagents in isolated git worktrees — one branch, one agent, one task, each running concurrently. The architecture is clean on paper. In practice, we ran into a 429 quota incident: twelve subagents launched within a minute, all hitting the same Anthropic API key, all burning from the same quota pool. The work looked fine — each agent was running — but the 429s were silently retrying, burning tokens on retries, and some agents stalled mid-task without surfacing the failure clearly. The lesson was harness-1 in practice: cap-and-stagger. No more than three agents concurrent per quota pool, with a minimum sixty-second stagger on launch. The stopping condition was not "task complete" — it was also "quota safe." Building that constraint into the orchestration layer before launch would have been twenty minutes. Discovering it in production cost us two hours and a re-run of six branches.

The Ostronaut eval suite as institutional memory. The retrieval system at Ostronaut went through two generations. The first was what we called vibes-RAG: retrieve some chunks, eyeball the results, declare it good enough. The second came after we built a metric called orphan_gap_pct — the fraction of content chunks that never appeared in any retrieval result across a thousand representative queries. The number was 36%. A third of the knowledge base was invisible to the system. We had been eyeballing good results because we were eyeballing the chunks the system liked, not the ones it was ignoring. The eval suite turned the question "is the retrieval good?" into a number we could track, gate on, and regress against. When we switched chunking strategies to reduce orphan_gap_pct, the suite was the thing that told us the new strategy worked — and the thing that caught, two iterations later, a subtle regression where heading-adjacent content was being split badly. The eval set did not prevent the regression. It caught it before production. That is what harness-2 means in practice: the eval suite is not a pre-launch check. It is the memory that outlives the original engineer's understanding of the system.

The PL polish-sprint monitoring incident. During a concentrated shipping sprint, PL ran Checkly synthetic monitors against every environment — dev, staging, prod. The monitors were set up correctly at the time. Three weeks later, after a URL restructuring, the monitors were still running, still green — against URLs that had been redirected or were no longer the canonical path. The monitors were not wrong; they were watching the wrong thing. The post-mortem lesson was not "set up better monitors." It was: the URL list that monitoring tools watch is a maintained artifact, not a one-time configuration. It needs the same change-management discipline as the code it watches. When a URL changes, two things change: the code and the monitor. Skipping the second is how you end up with a green dashboard the night a real regression sits undetected. This is harness-5 in its most concrete form: production observability is not the system you set up at launch. It is the system you maintain as the product changes.