The Ostronaut batch pipeline — a worked example of a long-running agentic system — Production Harnesses — observability, recovery, the bill

The move. Read one real long-running pipeline end to end. The discipline is in the details.

Everything in this course has been building toward a working example.

The lessons on tracing, eval logging, cost meters, kill switches, checkpoints, replay, observability stacks, on-call playbooks, and postmortems are individually useful.

They are most useful when you can see them operating together in a system that ships behind a real product.

Ostronaut's batch pipeline is that system.

It is not a demo. It is not an architecture diagram from a conference talk. It is a long-running agentic pipeline that processes content, embeds it into named vector spaces, gates it on quality metrics, and promotes it to a production index used by real learners.

It has incidents in its history. It has a growing eval suite. It has postmortems that produced real code changes.

Reading it is the closest this course gets to watching an experienced operator work.

The picture

Six stages arranged in a pipeline diagram, each with three labels: what it does, what it emits, and what production failure it guards against.

Stage 1: Source ingest. Pulls content from the MDX source files, extracts frontmatter and body, normalizes to a consistent format. Emits: a structured content record per lesson with source hash, version, and content type. Guards against: content that changed upstream and was not detected, duplicate ingestion.

Stage 2: Chunking. Splits content into semantic units appropriate for the embedding model's context window. The chunk boundary strategy affects retrieval quality significantly — cuts in the wrong place create orphaned sections. Emits: chunks with parent ID, position, and overlap metadata. Guards against: context-window overflow, orphaned chunks that can be embedded but never retrieved.

Stage 3: Embedding. Calls the embedding model per chunk. Records cost per chunk, model version, and embedding vector. Emits: embedding records into the staging vector store. Guards against: embedding model version drift (same text produces different vectors when the model changes).

Stage 4: Retrieval-quality gate. Runs the eval suite against the staging index. The primary metric is orphan_gap_pct: for a representative set of queries, what percentage of the source content that should be retrieved is not retrieved? A threshold of 15% or below is required to proceed. Emits: a gate result (pass/fail) with the specific queries and the orphan sections for each failing query. Guards against: bad embedding quality shipping to production.

Stage 5: Eval-driven promotion. If the quality gate passes, the pipeline promotes the staging index to the production index. The promotion is atomic: the production index is swapped, not patched, so there is no state where some content is from the new batch and some from the old. Emits: a promotion event with version, timestamp, and gate scores. Guards against: partial deployments that leave the index in a mixed state.

Stage 6: Continuous monitoring loop. After promotion, a sampling-based quality check runs on a schedule, pulling live query logs and comparing retrieval results against the golden eval set. Any degradation above the threshold triggers an alert. Emits: quality score per day. Guards against: silent drift caused by usage-pattern shifts or upstream data changes.

Why it matters now

Most published agentic systems are either toy examples (short single-prompt features in a notebook) or research systems (impressive capabilities, unclear production behavior).

The Ostronaut pipeline is neither.

It processes production content for a live product, it has a history of incidents, and it has evolved under real operational pressure.

Reading it is valuable because the design choices it makes are the choices that come from operating something in production, not from designing something in advance.

The quality gate exists because a batch without it shipped a 36% orphan rate to production.

The atomic promotion exists because a partial promotion caused a retrieval failure that took four hours to diagnose.

The cost tracking per stage exists because an early version of the pipeline obscured which stage was driving cost growth.

These are not decisions a team makes on day one. They are decisions a team makes after incidents.

A source you should trust

Ostronaut's internal architecture documentation describes the pipeline design at the level of abstraction that is appropriate for understanding why each component exists. The public-shareable elements are the design principles: quality gates as blocking checks, atomic promotion, per-stage cost tracking, and eval-suite growth as institutional memory.

The Ostronaut commit history for the retrieval-quality eval set is the audit trail. Each new eval entry in the history corresponds to an incident or a near-miss. Reading the commit messages is reading the operational history of the system in compressed form.

A recipe

A long-running-pipeline checklist derived from Ostronaut's evolution:

Every stage emits structured traces. Not log lines — structured spans with stage name, input hash, output shape, latency, and cost. The ability to diagnose a pipeline failure depends on whether the failed stage is visible in the trace.
Every stage has a quality gate. A gate is a metric with a threshold and a pass/fail decision. If the output of a stage does not meet the threshold, the pipeline stops. Advisory metrics that do not block propagate bad output to the next stage.
Every quality gate has a named owner. Not "the team." A person who reviews gate failures, updates thresholds when the product changes, and is responsible for the gate's health.
Every postmortem updates a quality gate. The gate threshold or the gate query set is the first thing to update after a retrieval incident. If the incident would have been caught by a tighter threshold or a new query, the gate should reflect that before the next batch runs.
Cost is tracked per stage, not per pipeline. An aggregate cost hides which stage is driving spend. When the pipeline's cost grows week-over-week, the per-stage breakdown is where the diagnosis starts.

The smell of it going wrong

Stages emit logs but not structured traces. The pipeline log is a stream of INFO messages with timestamps and stage names. When a quality failure occurs in stage 4, the team reads the log looking for the chunk that failed. This takes an hour when it should take five minutes.

Quality gates are advisory, not blocking. The orphan_gap_pct metric is logged. The threshold is documented. But the pipeline continues past the gate whether the metric passes or not, because "we want to see what happens." Bad output ships to production. What happens is a user-visible retrieval failure.

Quality-gate ownership is implicit. The gate was set up by the engineer who wrote the embedding pipeline. They left the team. The gate has not been updated in eight months. The threshold was correct for the content volume when it was set; the content volume has tripled since then, and the threshold is now too lenient.

Cost is aggregated across stages, hiding the expense driver. The pipeline spent $180 this week. Which stage? Unknown without querying the traces manually. Stage 3 (embedding) was responsible for $140 of it because the chunking strategy in stage 2 had been changed without adjusting the overlap parameter, producing twice as many chunks as intended.

A judgment call from real work

The Ostronaut pipeline's evolution from vibes-RAG to measurable quality gates happened over roughly four months and three P0 incidents.

The first state was functional but invisible: content was ingested, embedded, and promoted to production without any quality measurement. Retrieval worked well enough that nobody investigated further. The system had no concept of orphan sections, no quality gate, and no eval suite.

The first P0 incident: a course module was updated with a restructured heading hierarchy. The chunking strategy split the new structure at section boundaries that were different from the old structure. Several dozen sections were embedded but never appeared in retrieval results for reasonable queries. Users reported that questions about specific topics were not being answered well. The diagnosis took eight hours because there was no quality gate and no trace per stage — the team was reading embeddings vector-by-vector to identify which ones were orphaned.

The fix introduced the orphan_gap_pct metric and the quality gate. The threshold was set at 20% (lenient, to allow the current pipeline to pass while the chunking strategy was refined).

The second P0 incident: a model upgrade changed the embedding dimensions. The new embeddings were not comparable to the old ones in the existing index. The promotion was not atomic; it patched the existing index with new vectors for new content while leaving old vectors for unchanged content. The result was an index with mixed embedding generations, producing incoherent retrieval results. Diagnosis took four hours.

The fix introduced atomic promotion (full index swap, never patch) and embedding version tracking per chunk.

The third P0 incident: a cost spike. The pipeline ran on an unusually large batch. The per-pipeline cost was visible in the billing portal. The per-stage breakdown was not visible. Stage 3 (embedding) had been using a larger embedding model for three weeks because someone had updated the model name during a test and not reverted it. The cost was five times the expected amount. The fix introduced per-stage cost tracking and a stage-level cost alert.

By the end of the third incident, the eval suite had grown from zero to nineteen entries, the orphan_gap_pct threshold had tightened to 15%, and per-stage cost tracking was a blocking requirement for any new pipeline stage. The institutional memory — stored in the eval suite — was doing its job.

Rules from this lesson

Long-running pipelines need structured tracing per stage; aggregate traces hide the failed stage and extend every diagnosis.
Quality gates need named owners with explicit thresholds; advisory metrics that do not block output propagate failures downstream.
Cost attribution per stage surfaces the expense driver; aggregate cost hides it, and the expense driver is almost never where you expect it.
Atomic promotion (full swap, never patch) prevents mixed-state indexes that produce incoherent behavior.