RAG, Fine-Tune, or Context Window?

RAG is the boring answer. Fine-tuning is the wrong answer 80% of the time. Long context is the lazy answer. The whole game is knowing which 20% you are in before you commit a quarter to the wrong one.

Talvinder Singh

There is a question every AI product team eventually argues about, usually three weeks in, usually after a demo did not go well. Do we put more in the prompt, build a retrieval system, or fine-tune the model? The argument is loud because each option is championed by a different role — the prompt engineer wants more context, the platform engineer wants RAG, and the ML person wants to fine-tune. The argument is loud because all three answers are sometimes right. The argument is expensive because if you pick wrong, you lose the quarter.

Chapter 1 was the gate. Chapter 2 picked the model. Chapter 3 wrote the prompt. Chapter 4 built the eval. Chapter 5 designed around hallucination. Chapter 6 walked the autonomy ladder. This chapter is the data-meets-model decision: how do you put your information in front of the model, and what does each path cost you in production?

The decision is not a religious one. It is an arithmetic one. The trouble is that very few teams do the arithmetic before they pick a side.

The three paths, plainly

Path 1 — Context window. Put your data directly in the prompt. Every query. No infrastructure, no index, no training. The model sees what you send it and nothing else. Modern frontier models accept context windows in the hundreds of thousands of tokens — Claude 3.5 and 4 at 200K, Gemini 1.5 and 2.0 at 1M-2M, GPT-4-class models at 128K-200K. That is enough to fit an employee handbook, a contract, a quarter of customer tickets, or a small codebase in a single call.

Path 2 — RAG (Retrieval-Augmented Generation). Index your data once into a vector store. At query time, search the index for the chunks most relevant to the user's question, inject those chunks into the prompt, and have the model answer using them. The model never sees the whole corpus; it sees the slice that mattered for this query. RAG is the workhorse architecture of 2026 — most "AI search," "AI assistant," "AI over your docs," and "talk to your knowledge base" products are RAG underneath, whether the marketing says so or not.

Path 3 — Fine-tuning. Train the model — or more precisely, adjust its weights with LoRA or full fine-tuning — on examples that teach it to behave the way you want. The information becomes part of the model. No retrieval, no big prompt. The fine-tuned model just answers, the way it learned to.

These are not steps you graduate through. They are a menu. The lazy reading is "start at 1, climb to 3 as you grow." That reading is wrong often enough that it deserves a Rule of its own. The right reading is: pick the path that matches the shape of your data and your cost of being wrong, and refuse to climb just because the next rung sounds more sophisticated.

The decision tree

Five questions, in order. If you answer them honestly, the architecture picks itself.

1. How big is the data the model needs to see for a typical query? If it is under 50K tokens — roughly 35,000 words, or one good-sized PDF — the context window is your first stop. Stuff it in, evaluate, ship. You do not need an index. You do not need embeddings. You need a system prompt with the data pasted in and an eval suite that tells you if it works. If it is 50K-200K tokens, the context window still works but the per-query economics start to bite (see below). If it is over 200K tokens — millions of words, thousands of documents, a real corpus — you are in RAG territory whether you like it or not.

2. How often does the data change? Stable data (a published policy doc, a textbook, last quarter's filing) lives comfortably in the context window or in a periodically-rebuilt RAG index. Fast-changing data (this week's prices, today's open tickets, the user's current cart) belongs behind a tool call or a fresh retrieval — never in a fine-tune. The hard rule: fine-tunes are frozen at training time. The day you ship a fine-tuned model with prices baked in, you have shipped a bug with a release schedule.

3. Do you need to cite the answer? Regulated domains, customer-facing answers, anything a lawyer or a journalist might quote — you need to point at the source. RAG gives you citations almost for free: you know which chunks went into the prompt, so you know which sources to surface. Fine-tuning destroys this surface. The model can no longer tell you where the knowledge came from; it just knows. If "show your sources" is a product requirement, fine-tuning is off the table for the knowledge layer, full stop.

4. Is the gap between what the base model produces and what you need a knowledge gap or a behaviour gap? Knowledge gaps — "the model doesn't know our pricing," "the model can't reason about Indian tax codes," "the model doesn't have our policy" — are solved by showing the model the data, which is context or RAG. Behaviour gaps — "the model writes in the wrong tone," "the model returns the wrong JSON shape half the time," "the model uses American spelling," "the model can't match our house style" — are solved by teaching the model the pattern, which is fine-tuning (or, more often, a better prompt with examples).

The most common expensive mistake in AI product work is solving a knowledge gap with fine-tuning. The model "didn't know" something, the team fine-tuned it on the something, the data changed two months later, and now they are running a stale fine-tune they cannot easily update. They paid the fine-tune tax to solve a problem RAG would have solved for a fraction of the operational cost.

5. What is your per-query cost ceiling, and how many queries per day? This is the question that turns the philosophy into a spreadsheet. We come back to it below.

If under-50K + stable + no-citation-required + small volume, context window. If large or growing corpus + citation-required + freshness matters, RAG. If the gap is style or format or domain vocabulary, not facts, fine-tune. If the answer is "actually, we need fresh facts in our voice cited from our sources," RAG with a light style fine-tune on top. That last combination is where production-grade AI products tend to land. It is not the sexiest stack. It is the one that ships.

The honest cost math

The three paths look comparable in a slide deck. They are not comparable in a billing dashboard.

Context window. Pay per token every query. At 2026 frontier-model prices — roughly $3-$15 per million input tokens depending on which model — a 100K-token prompt costs $0.30-$1.50 per call. Do that a hundred thousand times a month and the bill is $30K-$150K just for the input tokens. Output tokens are extra. Caching helps — Anthropic and OpenAI both ship prompt caching that drops the cost of repeated prefixes by 80-90% — but caching only works if the prefix is genuinely stable, which means your "stuff the whole handbook in" approach has to actually be stable. The day someone edits the handbook, the cache invalidates and your bill spikes.

RAG. One-time indexing cost (embed every document — usually pennies per thousand tokens, $5-$50 for a typical corporate corpus). Recurring storage cost (a hosted vector DB — Pinecone, Qdrant Cloud, Weaviate Cloud — is $50-$500/month depending on scale, or near-zero on a self-hosted Qdrant or pgvector). Per-query cost is the embedding of the user's question (negligible) plus retrieval (negligible) plus the LLM call with maybe 5K-20K tokens of retrieved context (one tenth to one twentieth of the full-context cost). The catch is the engineering cost — RAG done badly is worse than no AI at all. (See the next section.)

Fine-tuning. One-time training cost — for a small open-source model on a few thousand examples, $50-$500 of GPU time; for a closed-model fine-tune via OpenAI or Anthropic, anywhere from $50 to several thousand depending on dataset size and model class. Per-query cost is then lower than the base model, because you can often use a smaller fine-tune to do work a larger base model would have needed. The hidden cost is maintenance: every time the base model is deprecated, every time your domain data shifts meaningfully, every time your style guide changes, you re-train. You are taking on a training pipeline as a permanent operational cost. For most teams in India building their first AI feature, that pipeline does not pay for itself for at least 18 months.

The arithmetic that decides between context and RAG is roughly: at what query volume does (RAG indexing + storage + smaller per-query cost) become cheaper than (full context window every time)? For a 100K-token corpus, RAG starts winning around 10,000-50,000 queries per month, depending on caching effectiveness. Below that, context-window-with-cache is cheaper and operationally simpler. Above that, RAG wins and keeps winning.

The arithmetic that decides between RAG and fine-tuning is different. Fine-tuning lowers per-query cost but locks in behaviour. You only pay back the fine-tune investment when (volume × cost-savings-per-query) exceeds (training cost + maintenance cost over the life of the feature). For most teams that math does not close — they are at hundreds of queries a day, not millions, and the maintenance cost is real money in person-weeks.

RAG as architecture (the part nobody tells you)

"Just use RAG" is the second-most-expensive sentence in AI product strategy. It sounds simple. It is not. The architecture has five moving parts and each one can quietly ruin the whole thing.

Embeddings. The model that converts your text into vectors. OpenAI's text-embedding-3-large, Cohere's embed-v3, open-source options like bge-large or nomic-embed. Choice of embedding model affects retrieval quality more than most teams realise. A bad embedding model returns lexically-similar-but-semantically-wrong chunks; a good one finds the relevant chunk even when the user's wording is nothing like the document's wording.

Chunking. The choices that ruin retrieval. If you chunk too small (a sentence per chunk), the model gets fragments that need three other chunks to make sense. If you chunk too large (a whole document per chunk), retrieval is coarse and the model wastes context on irrelevant paragraphs. The pragmatic answer in 2026 is paragraph-level chunks with overlap, snapped to natural boundaries (headings, double-newlines), with 200-500 tokens per chunk and 50-100 tokens of overlap. Get this wrong and no amount of model upgrading will fix retrieval quality. (The orphan-heading and dangling-paragraph problems are the ones that show up most often in production; cosmetic but real.)

Reranking. Embeddings find candidates fast and lossily. A reranker (Cohere Rerank, Voyage Rerank, or an LLM-as-reranker pattern) takes the top 20-50 candidates and scores them properly. This single step often moves a RAG system from "occasionally useful" to "actually good." Most teams skip it on day one and pay for it in eval scores forever.

Citation surfacing. The product UI that shows the user where the answer came from. Without it, users have no way to verify, and the trust cost is enormous on high-stakes queries. With it, the AI feature stops being a magic box and becomes a faster way to navigate the corpus the user already half-trusts. This is product work, not engineering work — and the product team usually has to push for it.

Named vectors and payload filters. The infrastructure pattern most production RAG systems converge on: store multiple embeddings per chunk (one for the question-shaped embedding, one for the answer-shaped embedding, one for the title), filter by structured metadata (tenant ID, document type, date range) at retrieval time so you are not searching the whole corpus, not just the slice that matters. We treat this as out-of-scope here; the MCP and infrastructure chapter is the place for that depth.

The order the wins arrive in, for a team starting from a naive RAG: chunking fix is biggest, reranker is second, citation UI is third, embedding model upgrade is fourth, named-vectors-and-filters is the lever you pull when you scale past a few thousand documents per tenant.

When fine-tuning actually wins

The narrow set of cases where fine-tuning is the right answer and the other two are not:

House style at scale. You need every output to match a voice your prompt can describe but cannot reliably reproduce — a legal firm's drafting style, a brand's tone, a financial reporting house's formatting. You have hundreds to thousands of curated examples of the style. Few-shot examples in the prompt get you 80% there; a small fine-tune gets you to 95% and keeps you there.
Tight output formats. You need the model to return a specific JSON schema, a specific XML shape, a specific function-call signature, every single time, at high volume. Prompts get this right most of the time. Fine-tunes get it right almost always — and at lower per-query cost because you can use a smaller base.
Domain vocabulary. Medical coding, legal clause classification, GST-tax-category mapping, regional language work where the base model is shaky. The model needs to internalise a specific symbol system, not just be shown examples each call.

What fine-tuning is not the answer for: facts (use RAG), evolving information (use a tool call or RAG), citations (use RAG), capabilities the base model doesn't have at all (no fine-tune is going to make a small model reason like a large one — pick a better base instead).

If you are about to fine-tune to solve a problem that is fundamentally "the model doesn't know X," stop. RAG is what you want. Fine-tune is for how the model says things, not what the model knows.

Three worked examples

Legal document search across a firm's 200,000-document archive. Citation-required, freshness matters (new filings every day), large corpus, high query volume across a small set of expert users. The arithmetic: context window is impossible (the corpus is 100M+ tokens), fine-tune is wrong (citation requirement, freshness, knowledge not behaviour). RAG wins, decisively. The real work is chunking strategy (paragraph + heading), reranking (Cohere Rerank or equivalent), and citation UI that links back to the source PDF at the paragraph. This is the textbook RAG case and it is also the one teams underestimate the engineering depth of.

Code-style consistency in a large monorepo. A 5M-line codebase with house conventions — error handling, naming, structure — that the base model does not match out of the box. The data does not change every day (style guides are stable). Citations are not required (the user wants code, not a reference). The gap is behaviour, not knowledge. Fine-tuning wins. A small fine-tune on a few thousand examples of in-house code teaches the model the style; retrieval still helps for project-specific facts, but the consistency gain comes from the fine-tune. This is one of the rare cases where the "ML person was right" outcome lands.

Customer-support agent over 50K resolved tickets. Mix of knowledge (what is the product, what are the policies) and style (how does our support team write, what tone, what level of detail). Tickets keep accumulating; policies change quarterly; agents need to cite past resolutions and policy documents. RAG with a light style fine-tune wins. RAG over the ticket corpus + policy docs gives fresh, citable answers. A small fine-tune on the best agents' replies teaches the model the house tone. Neither solution alone would have been enough; together they are the production architecture that actually ships and stays good.

What to do on Monday morning

If you have an AI feature on your roadmap and the team is arguing about RAG vs fine-tuning, walk through the five questions in the decision tree. Most arguments end inside 30 minutes once everyone has answered the same five questions in the same order.

If the team has been pushing fine-tuning and your honest answer to question 4 was "knowledge gap, not behaviour gap" — stop the fine-tune. You are solving a RAG problem with a training pipeline.

If the team has been pushing RAG and your honest answer to question 1 was "under 50K tokens, stable, low volume" — stop the RAG project. You are building infrastructure for a problem the context window already solved.

If the team has been pushing context window and your bill last month surprised you — that is the signal to start the RAG migration. The math has tipped.

The next chapter (AI UX Patterns That Work) is about what happens on the user's screen once you have chosen your architecture. That conversation is easier when this one is settled.

Rules

Where to go next

Chapter 1 — When AI is the right answer: the gate before any of this matters. (When AI Is the Right Answer (and When It Isn't))
Chapter 2 — The model-selection ladder: the right model under the right architecture is the cheap win. (The Model-Selection Ladder)
Chapter 3 — Prompt design as product design: before you fine-tune, see how far the prompt with examples gets you. (Prompt Design as Product Design)
Chapter 4 — Eval before launch: the regression suite that tells you whether your chunking change made retrieval better or worse. (Eval Before Launch)
Chapter 5 — Hallucination as a product problem: RAG reduces hallucination; it does not eliminate it. (Hallucination as a Product Problem)
Chapter 6 — Tool use, function calling, agents: when "retrieve" should be a tool the model calls, not a step in your pipeline. (Tool Use, Function Calling, Agents — The Maturity Ladder)