The Ostronaut named-vector retrieval harness — a worked example — Eval Harnesses — how you know your agent isn't lying to itself

Read a real harness end to end. Theory hardens fastest against working code.

The previous seven lessons built the vocabulary and the principles. This lesson applies them to a concrete production system. The Ostronaut retrieval harness is not a teaching example constructed to be clean — it is a system that was built wrong, broke, got measured, got rebuilt, and is now in production. The messiness is the point.

The picture

The Ostronaut harness architecture, component by component.

Ingest: content arrives as markdown documents — lesson prose, case studies, manual chapters. The ingestion step handles format normalization, metadata extraction, and routing to the chunker.

Chunking: the chunker splits documents into retrievable units. This is where the first production failure originated. Early chunking was paragraph-based with a fixed character limit. It worked well on short documents and broke on long-form content with nested headings, where a heading would end up in a different chunk than the content it introduced.

Named-vector embedding: each chunk is embedded into multiple vector spaces simultaneously. Named vectors (a Qdrant platform primitive) allow different retrieval queries to use different representation spaces for the same content. The Ostronaut harness uses three named spaces: a semantic space for meaning-based retrieval, a lexical space for exact-terminology lookup, and a domain-tagged space that weights India-context concepts differently than generic product-management concepts. Each space serves different query types; a single flat embedding space serves them all poorly.

Retrieval: queries are routed to the appropriate named vector space or spaces, depending on query type. BM25 (sparse keyword matching) runs alongside dense retrieval for hybrid scoring. Reranking applies a cross-encoder pass to the top-k results before serving. Payload filters restrict results by content type, course, or recency where the query intent warrants it.

Grader: the harness measures two families of metrics. Retrieval quality metrics (precision@k, recall@k, mean reciprocal rank) measured against a golden set of query/expected-result pairs. And orphan_gap_pct — the fraction of content chunks that contain load-bearing concepts with no retrieval path back from any plausible query. High orphan-gap means the index is complete in the sense that every chunk is there, but incomplete in the sense that some chunks are unreachable from real queries.

Each component is annotated with the specific failure mode it was designed to guard against. The grader runs on every indexing change and every retrieval-configuration change, not just on prompt changes.

Why it matters now

Generic RAG advice has saturated 2024–2026 discourse. The framing — embed your documents, store in a vector database, retrieve the top-k chunks, stuff them into the prompt — is well-understood and widely adopted.

The next layer is where production retrieval actually lives: what happens when a single embedding space serves multiple query intents poorly; what happens when chunking artifacts leave content unreachable; what happens when retrieval quality drifts because the index was updated but the grader was not run.

Named vectors (the practice of maintaining multiple named embedding spaces per document, each optimized for a different retrieval job) are a concrete answer to the first problem. Orphan-gap measurement is a concrete answer to the second. Wiring the grader into every indexing change is the answer to the third.

These are not research ideas. They are production patterns that emerged from specific production failures.

A source you should trust

Qdrant's named-vectors documentation. The platform primitive that makes multi-space embedding cheap to implement. The design decision documentation explains why named vectors exist, which is itself a lesson in what single-vector retrieval misses.
"BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models" (Thakur et al., 2021). The benchmark that demonstrated, systematically, that retrieval models trained on one task distribution transfer poorly to different distributions. This is the theoretical grounding for why single-vector retrieval underperforms on mixed-intent query sets.
Anthropic's retrieval evaluation guidance in their prompt-engineering documentation. The framing of retrieval quality as a separate measurable axis from end-to-end task quality is an important design principle this harness embodies.

A recipe

The Ostronaut-style retrieval audit you can run on any RAG system, starting from scratch:

List your vector spaces. Most first-generation RAG systems have one. Production retrieval systems should have three to five, each optimized for a different retrieval job. If you have one, the first question is: what query intents does it serve poorly?
For each vector space, write its retrieval JTBD in one sentence. "Semantic space: used for meaning-based queries where the user is asking about a concept without using exact terminology." "Lexical space: used for exact-terminology lookups, product names, proper nouns." If you cannot write the JTBD, the space is probably an accident rather than a design decision.
Measure per-space recall on a held-out golden set of query/expected-result pairs. Precision@5 is a useful starting metric — of the five chunks returned, how many are relevant? A score below 60% on any named space is a clear signal that the space is misconfigured for its intended JTBD.
Compute orphan_gap_pct. Walk the index. For each chunk, ask: given any plausible query that should surface this chunk, does the retrieval system actually return it in the top 20 results? The fraction of chunks that fail this test is the orphan gap. An orphan-gap above 15% means a significant fraction of your content is unreachable from real queries — it is in the index but invisible to users.
For the dominant failure mode — either low per-space recall or high orphan-gap — decide whether the fix is retrieval configuration (cheaper, faster) or chunking quality (more expensive, one-time). In our experience, orphan-gap problems are usually chunking artifacts; recall problems are usually embedding-space design. The grader tells you which battle to fight.

The smell of it going wrong

The team measures end-to-end task pass rate (does the agent produce a good answer?) but never measures retrieval recall separately. End-to-end measurement conflates model quality with retrieval quality. A retrieval failure that the model patches through inference is invisible in end-to-end metrics and will eventually surface when the model cannot patch it anymore.
One vector space serves every query type. The team knows some queries underperform but cannot articulate which dimensions of query intent the single space serves poorly.
The chunker leaves markdown headings in orphaned chunks and the team does not know the orphan-gap percentage. This is the specific failure that bit the Ostronaut system. Headings without body content are unreachable from meaning-based queries and return as noise in keyword queries.
Retrieval configuration changes — new chunk sizes, updated embedding models, revised payload filters — ship without a retrieval-recall regression check. A retrieval change that looks harmless can collapse precision@k for specific query types while leaving aggregate metrics stable.

A judgment call from real work

The Ostronaut retrieval system was built in three phases, and the harness that exists today would not exist without the failure that ended phase one.

Phase one: vibes-based RAG. The initial system used a single dense embedding space, paragraph-based chunking, and a manual quality check before each major content update. The manual check was one person running ten or fifteen representative queries and eyeballing the results. It worked well enough for the first few months. The failure was invisible: a slow accumulation of markdown-heading orphans as the content library grew and the chunker encountered increasingly complex document structures.

The incident that ended phase one was a Tuesday afternoon. A chunker configuration change — intended to improve performance on long documents — modified how the system handled markdown heading levels. The change was reviewed, the manual quality check ran, and the ten test queries all returned plausible results. The chunker had correctly moved body content into appropriately sized chunks. What it had also done, silently, was detach a class of section introductions — paragraphs that synthesized a section's themes — from the chunks that contained the section's detailed content.

By Thursday, retrieval quality on synthesis-type queries had degraded noticeably. The degradation was slow enough that it did not trigger any alerting. It surfaced through an unusual pattern in user feedback — three separate users noted that answers to "explain X" queries felt less complete than they had been. The investigation took most of a day.

The orphan-gap metric was invented to explain this failure. The team walked the index manually, query type by query type, and discovered that 36% of section-introduction chunks had no retrieval path from synthesis queries. Those chunks were in the index. They were correctly embedded. They were just unreachable from the queries most likely to need them. That 36% was the first orphan_gap_pct measurement.

Phase two: first measurable harness. The team rebuilt the chunker to respect markdown heading hierarchy — keeping section introductions adjacent to their body content. They added the orphan-gap check to the post-index verification step. They built a golden set of 45 query/expected-result pairs, covering the main query types in production, and ran it against the rebuilt index. Recall improved from the unquantified first version to a measurable 74% at precision@5.

The orphan-gap dropped from 36% to 9%. The 9% residual was a known class of footnote-heavy content that the chunker handled suboptimally; the team decided it was not load-bearing enough to address immediately.

Phase three: named vectors. Six months later, a new query type emerged as the content library grew: domain-specific terminology lookups. Users were searching for specific Indian product-management concepts — terms like "reverse channel conflict" in the context of Indian distribution chains — that the semantic embedding space handled poorly because it normalized away terminology in favor of meaning. The dense space would return semantically adjacent content, which was usually wrong for this query type.

The fix was a second named vector space: a lexical space trained to weight exact terminology. Queries were classified by intent and routed to the appropriate space. Precision@5 for terminology queries improved from 48% to 81%.

The third named space — the domain-tagged space with India-context weighting — came later still, as the team realized that the semantic space underweighted India-context signal because the underlying embedding model had been trained predominantly on US and European product management corpus.

Each phase was driven by a metric failure. Each metric failure was visible because the harness was running.

The moment the harness became institutional memory. The Ostronaut team had a turnover event in early 2026. The engineer who had built the retrieval system moved to a different project. The person who took over had not been present for any of the three phases. They read the harness, ran the grader, and within a day had a working mental model of the system — because the harness documented not just the current state but the failure modes each component was guarding against.

The harness was not documentation. It was executable knowledge. When a new chunker candidate was proposed three months later, the new engineer ran the grader against it before writing a line of code to integrate it. The candidate failed on orphan-gap. The investigation led to a better candidate that passed. The original engineer was not consulted. The harness was sufficient.

That is what a mature eval harness looks like. It runs without its author.

Rules from this lesson

Measure retrieval quality separately from end-to-end task quality. Conflating them hides retrieval failures that a capable model temporarily patches — until it cannot.
Named vectors are a low-cost primitive that solves a real problem. One-vector-fits-all is a smell that will surface as poor performance on specific query types that only become visible as the query distribution grows.
Orphan-gap percentage is the cheapest "is your index complete?" signal available. Run it on every chunker change. A gap above 15% means significant content is unreachable from real queries.