Cost meters — the unit economics of autonomy — Production Harnesses — observability, recovery, the bill

The move. Treat cost-per-run the way SaaS treats CAC and conversion — as a number on the dashboard you can recite from memory.

There is a recurring pattern in teams that ship their first autonomous agent.

For the first few weeks, the team is watching the product behavior — whether the agent is doing the right thing, whether users are happy, whether the failure modes are what they expected.

Cost is an afterthought because the usage volume is low and the billing portal shows a number that feels manageable.

Then the product gets traction. Or a test run goes long. Or a background job is misconfigured.

And the number in the billing portal is no longer manageable.

The pattern is predictable because the economics of autonomous agents differ from single-prompt features by an order of magnitude.

A single-prompt feature call costs somewhere between a fraction of a cent and a few cents. An autonomous agent session costs dollars. A six-hour coding agent run with frequent tool calls and multi-step planning can cost ten to a hundred dollars.

The difference is not incremental. It is a different mental model for what "this is getting expensive" means.

The fix is to treat cost as a first-class dashboard metric from day one, the same way a SaaS business treats customer acquisition cost and conversion rate.

These are numbers you know from memory because you look at them every day.

The picture

A cost dashboard with four panels.

Top left: cost-per-run, shown as a line chart over time with mean and p95 highlighted. A dotted alert threshold line.

Top right: cost-per-user-per-month, same treatment.

Bottom left: cost-per-task-type — a bar chart showing the mean cost for each workflow the agent performs. This is where cross-subsidization becomes visible: some task types cost ten cents, others cost five dollars.

Bottom right: cost trajectory, week-over-week percentage change. A week-over-week increase above 20% triggers a review.

Below the four panels: a ranked table of the ten most expensive runs this week, each row linking to the trace for that run.

This is the triage surface. When something is wrong, this table is where you find it.

Why it matters now

The economics of autonomous agents in 2026 differ from single-prompt features by an order of magnitude.

A six-hour autonomous run can cost ten to a hundred dollars. A science-fair project that runs fifty such sessions a day costs five hundred to five thousand dollars daily.

If cost is checked in the billing portal at month-end, the month-end number can be a shock.

The model-selection lever compounds this. Using a frontier model where a smaller model would have worked costs roughly ten times as much per token.

Teams that default to the largest available model out of habit — what the PL model-selection guide calls "Opus by habit" — are paying a premium that accumulates invisibly until the bill arrives.

A source you should trust

AI Manual Chapter 9 covers cost and latency as product constraints in the single-prompt context. The framing — cost as a design parameter, not an operational afterthought — extends directly to autonomous agents, with larger numbers.

Aider's cost-tracking writeups show what operator-grade cost discipline looks like in a shipping product. Aider tracks cost per session, surfaces it to users, and publishes postmortems on cost regressions caused by model changes and prompt changes. The discipline is visible in the product itself: users can see what their sessions cost in real time.

A recipe

A four-number cost dashboard every production agent should publish from day one:

Cost-per-run, mean and p95. The mean tells you the typical cost; the p95 tells you what the expensive outliers look like. A p95 that is ten times the mean suggests runaway cases you need to investigate.
Cost-per-user-per-month, mean and p95. This is your unit economics number. If your subscription price is ₹999/month and your p95 cost-per-user-per-month is ₹800, you have a margin problem.
Cost-per-task-type. Break the aggregate down by workflow. This is where you discover that the "summarize this document" workflow costs one cent and the "refactor this codebase" workflow costs four dollars. You cannot optimize what you cannot see.
Cost trajectory, week-over-week. A flat trajectory means the system is behaving consistently. An upward trajectory means something changed — more usage, longer sessions, a model upgrade, a prompt change that increased token consumption. Investigate any week-over-week change above 15%.

Set alert thresholds for each number before launch. The threshold for a single run should trigger a review, not a page. The threshold for daily total should page someone.

The smell of it going wrong

Cost is checked in the billing portal at month-end. The first time the team looks at cost for a given month is when the invoice arrives. By then, the session that caused the spike is three weeks old and the trace may have expired.

The team cannot recite cost-per-user-per-month from memory. This is the single most reliable signal that cost is not treated as a dashboard metric. Ask the team lead: what is your p95 cost-per-user-per-month? If the answer is a pause and a browser tab, cost is not on the dashboard.

A 10x cost incident is discovered when the invoice arrives. A test run was misconfigured to loop indefinitely. It ran for eighteen hours before someone noticed the process was still active. The cost for that run was visible in the billing portal the next morning, but nobody looked.

Cost is reported as an aggregate. The pipeline spent $200 this week. Which workflow? Which model? Which user? Unknown. Aggregate cost hides the signal; per-task-type cost surfaces it.

A judgment call from real work

PL's content-pipeline runs — course-content scoring, fit scoring, lesson enrichment, video generation — require explicit per-run budget caps. The model-selection guide in the global rules is the operational expression of this discipline: "default to mini-tier; escalate only when justified." The guide exists because escalation without justification was the default pattern before it was made explicit.

The cost incidents that motivated the rule were not dramatic. They were quiet. Over several weeks, a series of pipeline runs had been wired to use the frontier model because it was the model the team had tested with. Nobody revisited the model choice once the initial testing was done. The smaller model would have cleared the quality bar for most of the tasks; the frontier model was used because it was already in the configuration.

When someone ran the cost comparison, the gap was roughly 8x per run. The pipeline was running hundreds of times per week. The delta over a month was significant enough to fund a part-time contractor.

The lesson is not "always use the cheapest model." It is "the model choice is a cost decision, make it explicitly, and revisit it when usage scales." Mini-tier clears the bar for classification, summarization, and routine scoring. Frontier-tier earns its cost for tasks where the quality difference is measurable and product-critical. The distinction requires looking at the cost dashboard and the eval scores together.

The next lesson covers kill switches — the mechanism for stopping a cost spike mid-run rather than discovering it in the billing portal. Cost meters tell you when something is wrong; kill switches let you stop it. Both are necessary; neither is sufficient alone.

Rules from this lesson

Cost belongs on a dashboard, not in a billing portal; if you check it monthly, you will be surprised monthly.
Four numbers — per-run p95, per-user-month p95, per-task-type breakdown, week-over-week trajectory — cover the decisions that matter.
Default to the smallest model that clears the quality bar; escalation to a larger model requires an explicit decision backed by eval evidence, not habit.
Set alert thresholds before launch; discovering cost incidents via the invoice means the traces are probably gone.
Per-task-type breakdown is the diagnostic number; aggregate cost hides which workflow is the expense driver.