RAG Architecture — the pm manual

RAG is mature enough in 2026 that getting it wrong is a craft failure, not a technology gap. The mistakes are well-documented. Most teams still make them.

Talvinder Singh, Pragmatic Leaders

By 2026, RAG is the default architecture for any AI product that needs to answer questions grounded in specific knowledge: documentation, company data, product catalogs, research libraries, customer records. If your feature needs to say something true about a specific thing in your data, rather than just generating plausible-sounding text, you are building a RAG system.

The concept is simple. The implementation surface is large. And most teams make the same three mistakes.

What RAG is and why it exists

A pre-trained LLM has knowledge baked into its weights at training time. That knowledge is frozen at the training cutoff. It contains everything the internet had to say about general topics, and nothing about your specific product, your documentation, or events that happened after training ended.

RAG solves this by changing the architecture: rather than relying on the model's baked-in knowledge, you retrieve relevant documents from an external store and inject them into the prompt at query time.

The full pipeline:

User query arrives ("How do I cancel my subscription?")
Query is embedded — converted into a vector representation using an embedding model
Retrieval — the vector store finds documents whose embedding is closest to the query embedding (semantic similarity)
Reranking (optional) — a second model reorders the retrieved chunks by relevance
Context injection — the retrieved chunks are added to the prompt alongside the query
Generation — the LLM generates an answer grounded in the retrieved context

The result: the model doesn't need to "know" your product documentation — it reads it at query time, just like you'd look something up before answering a question.

The chunking decision

Chunking is how you break your source documents into pieces before indexing them. It is the decision that most directly governs answer quality, and it is the one that most teams get wrong.

The core tension: smaller chunks retrieve more precisely (the matching is tighter), larger chunks provide more context to the model (reducing the risk that a useful sentence is separated from its context).

What happens when chunks are too small: The model retrieves a fragment that mentions the right topic but lacks the surrounding context to answer the question. "Subscription cancellation: contact support" — but the support email is in the previous paragraph, which wasn't retrieved.

What happens when chunks are too large: You retrieve an entire 2,000-word support article when the user asked a question that the first two paragraphs answer. You consume context window. The "lost in the middle" effect kicks in. Quality drops.

The practical baseline for 2026:

512-1024 tokens per chunk is the standard starting point for documentation
Sentence-window chunking indexes at the sentence level but retrieves the surrounding 2-3 sentences as context — gets the best of both precision and context
Semantic chunking uses an embedding model to find natural boundaries in the text before splitting — significantly better for heterogeneous documents (articles that mix procedure, context, and reference material)
Hierarchical chunking indexes both paragraph-level and section-level summaries, allowing retrieval at multiple granularities

The right chunk size depends on your documents. If your documents are dense and structured (legal contracts, technical specifications), smaller chunks with overlap work better. If your documents are narrative and contextual (knowledge base articles, research papers), semantic chunking reduces the risk of splitting a concept across chunk boundaries.

Overlap is the practice of repeating the last N tokens of one chunk at the beginning of the next, ensuring that content at chunk boundaries doesn't fall through the cracks. 10-20% overlap is typical. This increases your index size proportionally but meaningfully reduces boundary artifacts.

Dense retrieval, sparse retrieval, and hybrid

Dense retrieval uses vector embeddings. Both the query and the documents are converted to vectors; retrieval finds the nearest neighbors in vector space. Dense retrieval excels at semantic similarity — it can match "how do I cancel" with "subscription termination process" even though no words overlap.

The limitation: dense retrieval is bad at exact matching. If a user asks about a specific product model number, a SKU, a proper noun, or a technical term that is rare in the training data, the embedding may not capture the distinction. "GPT-4o" and "GPT-4o-mini" have similar embeddings; a dense retriever may not reliably distinguish between documents about them.

Sparse retrieval is traditional keyword-based search — BM25 is the dominant algorithm. It matches on exact and near-exact terms. It excels at named entities, product codes, technical jargon, and any case where the specific word matters. It fails at semantic similarity — "cancel subscription" and "end service" are unrelated in sparse retrieval.

Hybrid retrieval combines both, merging the results from a dense and sparse retriever using a ranking function. The standard method is Reciprocal Rank Fusion (RRF) — a simple algorithm that combines rank positions from multiple sources. Hybrid retrieval dominates pure dense and pure sparse across most real-world benchmark evaluations. In 2026, the default recommendation for any production RAG system is hybrid retrieval.

The operational cost of hybrid is modest — two retrieval calls instead of one, plus a merge step. The quality improvement is substantial and consistent across domains.

Named vectors and payload filtering

Standard vector search asks: "find me the most similar documents to this query." In a heterogeneous knowledge base, you often need to ask: "find me the most similar legal documents about subscription terms from 2024." That requires metadata-aware retrieval.

Payload filtering (also called metadata filtering or pre-filtering) applies structured filters before or alongside semantic search. Before ranking by embedding similarity, you filter to only documents where document_type = "policy" or date > 2024-01-01. This dramatically improves precision when your index contains heterogeneous content.

Named vectors — a concept first-class in Qdrant (the vector store PL's RAG infrastructure uses) — allow a single document to carry multiple embedding representations, each optimized for a different retrieval purpose. A product catalog item might have one embedding tuned for product description search and another tuned for customer review search. At query time, you choose which named vector to search against based on the query type.

Named vectors matter when:

Your documents serve multiple use cases with different relevance signals
You want to separate retrieval for different modalities (text, image, metadata)
You have multilingual content and want per-language embeddings

For most small-to-medium RAG systems, payload filtering is sufficient and named vectors are unnecessary. Once your knowledge base exceeds 100k documents across multiple content types, revisit the architecture.

When RAG fails

Knowing the failure modes saves you weeks of debugging.

Failure: retrieval misses the right document. The answer exists in your knowledge base but isn't being retrieved. Root causes: chunk size mismatch (the relevant text was split across chunks), poor embedding coverage (the embedding model doesn't represent your domain well), or metadata filtering that's too aggressive. Diagnostic: test retrieval quality independently of generation quality. Check whether the right chunk is in the top-5 results before asking whether the model answered correctly.

Failure: retrieved documents are relevant but not helpful. The right chunk is retrieved, but the model generates a poor answer. Root causes: chunk too small (missing context), multiple conflicting chunks retrieved, or the model generating a plausible-sounding answer that contradicts the retrieved text (a retrieval-grounding failure). Diagnostic: log the retrieved chunks alongside the answer. Ask: if a human read only these chunks, could they answer the question correctly?

Failure: the model ignores the retrieved context. You inject the right documents, and the model answers from its parametric memory anyway, often with a hallucinated or outdated answer. This happens when: (1) the retrieval chunks are relevant but low-quality or truncated, causing the model to discount them; (2) the model has strong prior beliefs about the topic from training; (3) the system prompt doesn't explicitly instruct the model to prefer the retrieved context. Fix: explicit grounding instructions in the system prompt ("Answer only from the provided context. If the answer is not in the context, say so."), plus checking chunk quality.

Failure: context window overflow. Your retrieved chunks exceed the model's context window, causing truncation or degraded attention. Fix: limit retrieved chunks (top-3 or top-5 is usually sufficient), rerank to keep only the highest-quality chunks, or use a model with a larger context window.

// learn the judgment

Your RAG-based internal knowledge assistant is returning hallucinated answers on product pricing questions. A user asks 'What's the price of the Enterprise plan?' and the model answers with a price that doesn't match either the retrieved documentation or the actual current price. Engineering says the right pricing page is being retrieved. What's happening and what do you investigate?

The call: Where is the failure, and what is your diagnostic sequence?

Your reasoning:

RAG vs. fine-tuning vs. long context

This question comes up on every AI product roadmap. The short answer:

Approach	Use when	Don't use when
RAG	Knowledge changes frequently, documents are large, you need citations, grounding, or access control	You need to teach the model a new capability or style (not new facts)
Fine-tuning	Consistent format/style requirement, narrow well-defined task, thousands of labeled examples, cost/latency pressure	Knowledge changes frequently, you need to update facts, you lack labeled data
Long context (no retrieval)	Few, known documents; one-shot analysis tasks; don't want retrieval latency	High query volume (every query pays full-context cost), documents larger than 200k tokens

In 2026, most production systems use RAG for knowledge grounding and optionally fine-tune a smaller model for a specific task layer. The idea of "just use long context instead of RAG" is appealing but doesn't survive cost analysis at scale: feeding 500k tokens per query at GPT-4o pricing is $0.50-2.00 per query. For a 100k-user product, that's prohibitive.

What to do this week

Map your knowledge base. List all the sources your AI feature needs to draw on. Note their format (PDFs, structured JSON, Markdown, HTML), update frequency, and access control requirements. This scoping determines whether you need payload filtering, hybrid retrieval, or just a simple vector search.
Run a retrieval quality check. Pick 20 queries from your golden set. Check whether the right document is in the top-5 retrieved results. If retrieval precision is below 70%, fix retrieval before touching the model.
Decide on chunk size empirically. Test three chunk sizes (256, 512, 1024 tokens) on a sample of your documents. For each, check a retrieval quality metric. Pick the one with the best precision/context tradeoff.

Where to go next

Eval Design — how to test your retrieval pipeline systematically
LLM Fundamentals — the attention and context-window mechanics that explain RAG's tradeoffs
Agent Design — when retrieval is one step in a multi-step agent loop