RAG Architecture and Use Cases — Course 3: Retrieval-Augmented Generation (RAG) Fundamentals

RAG is the librarian who fetches the right books, allowing the student to craft accurate essays.

Talvinder Singh, from a Pragmatic Leaders GenAI session

You are building a customer support chatbot for a global tech company. Users ask complex questions like “How do I fix error code 0x803F7001 on Windows 11?” The base GPT-4 model hallucinates answers because it wasn’t trained on your internal documentation. Your actual job is to ground the AI in factual, up-to-date information without costly retraining.

Retrieval-Augmented Generation (RAG) is the architecture pattern that solves this. It combines a retrieval system that fetches relevant documents in real time with a language model that generates answers based on those documents. This lesson teaches you how to design RAG systems, choose the right components, and avoid common pitfalls — all grounded in real-world examples and Indian context where applicable.

RAG is the librarian guiding the student to the right books

Large language models (LLMs) are like students who have read a vast number of textbooks but cannot memorize every detail. They excel at generating fluent text but often hallucinate facts because their knowledge is frozen at training time.

RAG adds a librarian who fetches specific, relevant documents from a knowledge base in response to each query. The LLM then composes its answer from those documents, drastically reducing hallucinations.

Formally, RAG is a two-step process:

Retrieve: Search a database for documents relevant to the user’s question.
Generate: Use an LLM to synthesize a response grounded in those retrieved documents.

For example, ChatGPT with Bing uses RAG to pull live web data, enabling accurate responses about recent events.

Why this matters: Salesforce reported a 60% reduction in hallucinations by adopting RAG in 2023. It also cuts training costs by reusing general-purpose LLMs rather than fine-tuning domain-specific models.

Retriever types shape what knowledge the AI accesses

The retriever is the backbone of a RAG system. It determines which documents reach the LLM and therefore what knowledge the AI can base its answers on.

There are three main types of retrievers:

Sparse retrievers (BM25): Work like keyword indexes. They match query terms to documents using term frequency and inverse document frequency (TF-IDF). They are fast but brittle — synonyms or paraphrases often fail (e.g., “car” vs. “vehicle”).
Dense retrievers (FAISS): Convert documents and queries into high-dimensional vectors that capture semantic meaning. They find conceptually similar documents even if exact keywords differ. FAISS is Facebook’s library optimized for fast similarity search in vector spaces.
Hybrid retrievers: Combine sparse and dense methods to improve recall. Google reported a 15% higher recall using hybrid retrieval in 2023.

Real-world example: IBM Watson Assistant uses hybrid retrieval to resolve 30% more support tickets in half the time. This is not abstract — Indian enterprises adopting RAG report similar gains in customer support efficiency.

Retriever Type	Key Feature	Strength	Limitation	Indian Example
Sparse (BM25)	Keyword matching	Fast, interpretable	Misses synonyms, paraphrases	Early fintech chatbots with keyword search
Dense (FAISS)	Semantic vector search	Finds conceptually similar docs	Requires vector embeddings, higher compute	Postman’s knowledge base search
Hybrid	Combines keyword + semantic	Higher recall, balanced speed	More complex infrastructure	IBM Watson Assistant

The generator crafts answers from retrieved documents

The LLM generator is like a chef who combines ingredients (retrieved documents) into a well-prepared dish (the answer). The quality of the retrieved documents directly impacts the quality of the final response.

Prompt engineering is critical: The generator must be instructed to only use the provided context, avoiding hallucination. A common prompt template is:

Context: {documents}
Question: {user query}
Answer:

This forces the model to ground answers explicitly in the retrieved data.

Model choice depends on use case and cost: GPT-4 excels at complex queries, while smaller models like LLaMA-2 enable low-cost deployments.

Real-world RAG applications demonstrate tangible impact

Case Study 1: Shopify’s Product Search

Problem: Users struggled to find niche products like “vegan leather backpacks.”

Solution: Shopify built a RAG system combining:

A FAISS-based dense retriever indexing 10 million+ product descriptions.
GPT-3.5 as the generator to synthesize buying guides from retrieved products.

Result: Shopify saw a 25% increase in conversion rates and 40% fewer support tickets. This grounded search experience reduced user frustration and improved sales.

Case Study 2: Mayo Clinic’s Medical RAG

Problem: Doctors needed instant access to the latest COVID-19 research to make timely decisions.

Solution: Mayo Clinic implemented:

A hybrid retriever (BM25 + FAISS) over 500,000 PubMed articles.
BioBERT as the generator, simplifying medical jargon for patients.

Result: Reduced time-to-diagnosis in emergency rooms by 20%, improving patient outcomes.

These examples show RAG’s power in both ecommerce and healthcare, two sectors where Indian startups like Razorpay and PharmEasy are starting to explore similar architectures.

Ethical risks in RAG systems demand vigilant mitigation

Risk 1: Outdated or biased context

Imagine a RAG system retrieving drug guidelines from 2020, causing incorrect dosage recommendations. The AI’s answer is factually wrong and potentially dangerous.

Mitigations:

Automatic document expiry: Flag or remove documents older than a defined threshold (e.g., 6 months) to ensure currency.
Bias audits: Use tools like IBM AI Fairness 360 to scan retrieved content for stereotypes or harmful biases before feeding it to the generator.

Risk 2: Over-reliance on retrieval

An AI might ignore common knowledge not in the database, e.g., “water boils at 100°C,” leading to awkward or incorrect answers.

Mitigations:

Hybrid answers: Blend retrieved data with the LLM’s parametric knowledge to fill gaps.
Confidence scores: Tag answers with provenance labels like “Based on internal docs” or “General knowledge” to inform users about reliability.

These guardrails are especially important in regulated Indian sectors like finance and healthcare, where trust and transparency are non-negotiable.

Technical deep dive: building blocks of a RAG system

Step 1: Build a sparse retriever with BM25

BM25 is a classic keyword-based retrieval algorithm using TF-IDF weighting. Here is a minimal Python example:

from rank_bm25 import BM25Okapi  

corpus = ["Error 0x803F7001: Update Windows", "How to reset BIOS"]  
tokenized_corpus = [doc.split() for doc in corpus]  
bm25 = BM25Okapi(tokenized_corpus)  

query = "Fix Windows update error"  
scores = bm25.get_scores(query.split())  
top_doc = corpus[scores.argmax()]  # Returns "Error 0x803F7001..."

This simple retriever matches query keywords to documents and returns the highest scoring document.

Step 2: Vector search with FAISS

FAISS enables fast nearest neighbor search over dense vector embeddings. Example:

import faiss  
import numpy as np  

# Create 100 dummy 512-dimensional document embeddings  
doc_embeddings = np.random.rand(100, 512).astype('float32')  
index = faiss.IndexFlatL2(512)  
index.add(doc_embeddings)  

# Query embedding  
query_embedding = np.random.rand(1, 512).astype('float32')  
k = 3  
distances, indices = index.search(query_embedding, k)

This finds the top 3 documents closest in vector space to the query.

Step 3: Integrate with LangChain

LangChain simplifies building retrieval-augmented pipelines by connecting retrievers and LLMs.

from langchain.retrievers import BM25Retriever  
from langchain.llms import OpenAI  

retriever = BM25Retriever.from_texts(["Doc 1", "Doc 2"])  
llm = OpenAI(temperature=0)  

rag_chain = {  
  "context": retriever,  
  "question": lambda x: x["question"]  
} | llm  

response = rag_chain.invoke({"question": "How to fix error X?"})

This code snippet shows a minimal RAG pipeline: the retriever fetches documents, which the LLM uses to generate an answer.

Hands-on practice: building your first RAG system

For non-technical learners

Research Microsoft’s Copilot RAG system. Write 300 words on:

How Copilot retrieves data from GitHub repositories and documentation.
One ethical challenge Copilot faces, such as risks of code plagiarism.

For technical learners

Use Hugging Face’s RAG implementation to build a question-answering system over your own dataset:

git clone https://github.com/huggingface/transformers.git  
pip install datasets faiss-cpu  
python examples/rag/use_own_knowledge_dataset.py \  
  --model_name_or_path facebook/rag-token-nq \  
  --csv_path my_docs.csv

Expected output: answers grounded in your custom documents rather than hallucinated text.

Key takeaways to remember

RAG combines retrieval and generation. Grounding LLMs in live data reduces hallucination by 60% (Salesforce, 2023). Shopify’s RAG system improved conversion by 25% and cut support tickets by 40%.
Retriever choice matters. BM25 is fast for keywords; FAISS captures semantics; hybrid improves recall by 15% (Google, 2023). IBM Watson Assistant’s hybrid retriever resolved tickets 30% faster.
Ethical guardrails are essential. Prevent outdated or biased context with document expiry and bias audits. Avoid over-reliance on retrieval by blending with parametric knowledge and tagging confidence.
Prompt engineering optimizes generation. Instruct LLMs to only use provided context to avoid fabricating answers. Mayo Clinic’s BioBERT simplifies medical jargon for patients using this approach.
Start simple and scale smart. Begin with sparse retrieval and add dense or hybrid methods as your data and use cases grow. Avoid retrievers that return more than 50% irrelevant documents — that signals chunking or embedding issues.

Where to go next

Deepen your retrieval skills: Advanced RAG Techniques — explore HyDE, FLARE, and RAFT methods.
Master prompt engineering for RAG: Prompt Engineering for RAG — learn to design prompts that maximize answer faithfulness.
Understand ethical AI practices: Ethical PM — frameworks for bias audits and transparency.
Build your AI product vision: Product Vision and Strategy — align RAG capabilities with user needs and business goals.

RAG is the librarian who fetches the right books, allowing the student to craft accurate essays.

Talvinder Singh, from a Pragmatic Leaders GenAI session

RAG is the librarian guiding the student to the right books

Formally, RAG is a two-step process:

Retrieve: Search a database for documents relevant to the user’s question.
Generate: Use an LLM to synthesize a response grounded in those retrieved documents.

For example, ChatGPT with Bing uses RAG to pull live web data, enabling accurate responses about recent events.

Retriever types shape what knowledge the AI accesses

The retriever is the backbone of a RAG system. It determines which documents reach the LLM and therefore what knowledge the AI can base its answers on.

There are three main types of retrievers:

Sparse retrievers (BM25): Work like keyword indexes. They match query terms to documents using term frequency and inverse document frequency (TF-IDF). They are fast but brittle — synonyms or paraphrases often fail (e.g., “car” vs. “vehicle”).
Dense retrievers (FAISS): Convert documents and queries into high-dimensional vectors that capture semantic meaning. They find conceptually similar documents even if exact keywords differ. FAISS is Facebook’s library optimized for fast similarity search in vector spaces.
Hybrid retrievers: Combine sparse and dense methods to improve recall. Google reported a 15% higher recall using hybrid retrieval in 2023.

Retriever Type	Key Feature	Strength	Limitation	Indian Example
Sparse (BM25)	Keyword matching	Fast, interpretable	Misses synonyms, paraphrases	Early fintech chatbots with keyword search
Dense (FAISS)	Semantic vector search	Finds conceptually similar docs	Requires vector embeddings, higher compute	Postman’s knowledge base search
Hybrid	Combines keyword + semantic	Higher recall, balanced speed	More complex infrastructure	IBM Watson Assistant

The generator crafts answers from retrieved documents

Prompt engineering is critical: The generator must be instructed to only use the provided context, avoiding hallucination. A common prompt template is:

Context: {documents}
Question: {user query}
Answer:

This forces the model to ground answers explicitly in the retrieved data.

Model choice depends on use case and cost: GPT-4 excels at complex queries, while smaller models like LLaMA-2 enable low-cost deployments.

Real-world RAG applications demonstrate tangible impact

Case Study 1: Shopify’s Product Search

Problem: Users struggled to find niche products like “vegan leather backpacks.”

Solution: Shopify built a RAG system combining:

A FAISS-based dense retriever indexing 10 million+ product descriptions.
GPT-3.5 as the generator to synthesize buying guides from retrieved products.

Result: Shopify saw a 25% increase in conversion rates and 40% fewer support tickets. This grounded search experience reduced user frustration and improved sales.

Case Study 2: Mayo Clinic’s Medical RAG

Problem: Doctors needed instant access to the latest COVID-19 research to make timely decisions.

Solution: Mayo Clinic implemented:

A hybrid retriever (BM25 + FAISS) over 500,000 PubMed articles.
BioBERT as the generator, simplifying medical jargon for patients.

Result: Reduced time-to-diagnosis in emergency rooms by 20%, improving patient outcomes.

These examples show RAG’s power in both ecommerce and healthcare, two sectors where Indian startups like Razorpay and PharmEasy are starting to explore similar architectures.

Ethical risks in RAG systems demand vigilant mitigation

Risk 1: Outdated or biased context

Imagine a RAG system retrieving drug guidelines from 2020, causing incorrect dosage recommendations. The AI’s answer is factually wrong and potentially dangerous.

Mitigations:

Automatic document expiry: Flag or remove documents older than a defined threshold (e.g., 6 months) to ensure currency.
Bias audits: Use tools like IBM AI Fairness 360 to scan retrieved content for stereotypes or harmful biases before feeding it to the generator.

Risk 2: Over-reliance on retrieval

An AI might ignore common knowledge not in the database, e.g., “water boils at 100°C,” leading to awkward or incorrect answers.

Mitigations:

Hybrid answers: Blend retrieved data with the LLM’s parametric knowledge to fill gaps.
Confidence scores: Tag answers with provenance labels like “Based on internal docs” or “General knowledge” to inform users about reliability.

These guardrails are especially important in regulated Indian sectors like finance and healthcare, where trust and transparency are non-negotiable.

Technical deep dive: building blocks of a RAG system

Step 1: Build a sparse retriever with BM25

BM25 is a classic keyword-based retrieval algorithm using TF-IDF weighting. Here is a minimal Python example:

from rank_bm25 import BM25Okapi  

corpus = ["Error 0x803F7001: Update Windows", "How to reset BIOS"]  
tokenized_corpus = [doc.split() for doc in corpus]  
bm25 = BM25Okapi(tokenized_corpus)  

query = "Fix Windows update error"  
scores = bm25.get_scores(query.split())  
top_doc = corpus[scores.argmax()]  # Returns "Error 0x803F7001..."

This simple retriever matches query keywords to documents and returns the highest scoring document.

Step 2: Vector search with FAISS

FAISS enables fast nearest neighbor search over dense vector embeddings. Example:

import faiss  
import numpy as np  

# Create 100 dummy 512-dimensional document embeddings  
doc_embeddings = np.random.rand(100, 512).astype('float32')  
index = faiss.IndexFlatL2(512)  
index.add(doc_embeddings)  

# Query embedding  
query_embedding = np.random.rand(1, 512).astype('float32')  
k = 3  
distances, indices = index.search(query_embedding, k)

This finds the top 3 documents closest in vector space to the query.

Step 3: Integrate with LangChain

LangChain simplifies building retrieval-augmented pipelines by connecting retrievers and LLMs.

from langchain.retrievers import BM25Retriever  
from langchain.llms import OpenAI  

retriever = BM25Retriever.from_texts(["Doc 1", "Doc 2"])  
llm = OpenAI(temperature=0)  

rag_chain = {  
  "context": retriever,  
  "question": lambda x: x["question"]  
} | llm  

response = rag_chain.invoke({"question": "How to fix error X?"})

This code snippet shows a minimal RAG pipeline: the retriever fetches documents, which the LLM uses to generate an answer.

Hands-on practice: building your first RAG system

For non-technical learners

Research Microsoft’s Copilot RAG system. Write 300 words on:

How Copilot retrieves data from GitHub repositories and documentation.
One ethical challenge Copilot faces, such as risks of code plagiarism.

For technical learners

Use Hugging Face’s RAG implementation to build a question-answering system over your own dataset:

git clone https://github.com/huggingface/transformers.git  
pip install datasets faiss-cpu  
python examples/rag/use_own_knowledge_dataset.py \  
  --model_name_or_path facebook/rag-token-nq \  
  --csv_path my_docs.csv

Expected output: answers grounded in your custom documents rather than hallucinated text.

Key takeaways to remember

RAG combines retrieval and generation. Grounding LLMs in live data reduces hallucination by 60% (Salesforce, 2023). Shopify’s RAG system improved conversion by 25% and cut support tickets by 40%.
Retriever choice matters. BM25 is fast for keywords; FAISS captures semantics; hybrid improves recall by 15% (Google, 2023). IBM Watson Assistant’s hybrid retriever resolved tickets 30% faster.
Ethical guardrails are essential. Prevent outdated or biased context with document expiry and bias audits. Avoid over-reliance on retrieval by blending with parametric knowledge and tagging confidence.
Prompt engineering optimizes generation. Instruct LLMs to only use provided context to avoid fabricating answers. Mayo Clinic’s BioBERT simplifies medical jargon for patients using this approach.
Start simple and scale smart. Begin with sparse retrieval and add dense or hybrid methods as your data and use cases grow. Avoid retrievers that return more than 50% irrelevant documents — that signals chunking or embedding issues.

Where to go next

Deepen your retrieval skills: Advanced RAG Techniques — explore HyDE, FLARE, and RAFT methods.
Master prompt engineering for RAG: Prompt Engineering for RAG — learn to design prompts that maximize answer faithfulness.
Understand ethical AI practices: Ethical PM — frameworks for bias audits and transparency.
Build your AI product vision: Product Vision and Strategy — align RAG capabilities with user needs and business goals.