Emerging RAG Techniques: Self-RAG, Multimodal, and Small Models — Course 4: Advanced RAG and Iterative Design

Self-RAG models don’t just answer — they question their own answers, revealing uncertainty rather than hiding it.

Talvinder Singh, from a Pragmatic Leaders session on advanced RAG

You are building a RAG system for a news aggregator app. Users want real-time, reliable answers about breaking stories, but your AI struggles with contradictory sources and cannot interpret multimedia like videos or tweets. Your actual job is to build RAG systems that self-correct their outputs, fuse multiple modalities, and run efficiently on devices ranging from data centers to smartphones.

This lesson teaches you three emerging RAG techniques that address these challenges: Self-RAG, which enables models to critique and revise their own answers; Multimodal RAG, which combines text, images, and video; and Small Language Models (SLMs), which are lightweight and cost-effective models suitable for edge deployment.

Self-RAG models critique their own answers to reduce errors

Traditional RAG pipelines retrieve documents and generate an answer. But what if the retrieved documents conflict? What if the answer is uncertain or incomplete?

Self-RAG adds a critical step: the model critiques its own output and revises it based on that critique. Think of it as a student who grades their own essay, circling weak points and noting missing citations before submitting it.

The process looks like this:

Retrieve relevant documents.
Generate an initial answer.
Critique the answer by asking, “Is this supported by the sources? Are there contradictions?”
Revise the answer, flagging uncertainty or conflicts explicitly.

This approach reduces factual errors by about 35%, as shown in Asai et al., 2023’s study on news summarization.

Consider this output from a Self-RAG system:

"The CEO is resigning (Source: NYT). However, TechCrunch claims she’s staying. [Self-Critique: Conflicting sources. Flagging uncertainty.]"

Instead of hiding contradictions, the model surfaces them transparently, building trust with users.

Microsoft’s Self-RAG application on Azure Docs

Microsoft faced a challenge: developers found conflicting advice scattered across 100,000+ Azure documentation pages. The support team was overwhelmed with tickets about inconsistent instructions.

They implemented a Self-RAG pipeline that not only generated answers but also assigned confidence scores and flagged conflicts in sources. Answers with low confidence were automatically routed for human review.

The result was striking: 50% fewer support tickets and 90% faster troubleshooting. This shows how Self-RAG can scale trust in complex information environments.

Multimodal RAG fuses text, images, audio, and video for richer answers

Many real-world questions require understanding beyond text. How do you build RAG systems that can interpret images, videos, or audio alongside documents?

Multimodal RAG retrieves and combines multiple data types into a unified answer. It enables AI to respond to queries like “What’s wrong with my car?” by analyzing photos of the engine and related repair manuals.

The technical foundation includes:

CLIP embeddings: A model that maps images and text into the same vector space so they can be compared directly. For example, a photo of a cat and the text "feline" have similar embeddings.
Fusion techniques: Cross-attention layers in transformer models combine modalities, allowing the model to jointly reason over text and images.
Tools like LLaVA (Large Language-and-Vision Assistant): These models answer visual questions by integrating language and vision capabilities.

Mayo Clinic’s multimodal diagnosis system

Doctors often need to correlate patient X-rays with medical histories and research papers. Mayo Clinic developed a multimodal RAG system that:

Uses CLIP to retrieve similar X-rays from PACS databases.
Employs GPT-4V to synthesize reports linking image findings to symptoms and literature.

This system accelerated tumor detection by 25%, reducing diagnostic delays and improving patient outcomes.

Small Language Models (SLMs) democratize RAG with efficiency and cost savings

Large Language Models (LLMs) like GPT-4 deliver high accuracy but are costly and power-hungry. Not every application needs a cargo ship when a bicycle will do.

Small Language Models (SLMs) are lighter, cheaper, and faster alternatives designed for efficiency. They carry less “knowledge freight” but can outperform larger models on specific tasks, especially logic puzzles and reasoning.

Key developments include:

Phi-2, a Microsoft 2.7 billion parameter model, outperforms LLaMA-7B in logical reasoning benchmarks.
Quantization techniques reduce model size and computation by representing weights with fewer bits (e.g., 4-bit precision).
Tools like MLC-LLM enable running quantized SLMs directly on iPhones and other edge devices.
Cost per inference drops dramatically, from roughly $0.03 for GPT-4 to $0.0001 for SLMs.

This makes deploying RAG at scale to billions of users feasible, especially in cost-sensitive markets like India.

Deploying Phi-2 on iPhones

Quantize Phi-2 to a 4-bit format compatible with mobile hardware using MLC-LLM. Integrate the model into Swift apps to generate answers locally without cloud latency or cost.

This approach preserves privacy, reduces dependency on internet connectivity, and lowers operational expenses.

Ethical risks require guardrails in emerging RAG systems

Risk 1: Over-trusting Self-RAG critique modules

Self-RAG systems depend on their critique step to flag uncertainty and conflicts. But if the critique model is trained on biased or incomplete feedback, it might downplay important risks or overstate confidence.

Example: A Self-RAG system underestimating climate change risks because it learned from overly optimistic critiques.

Mitigations:

Train critique modules on diverse, adversarial feedback, including prompts like “Is this answer too optimistic or dismissive?”
Maintain human oversight for low-confidence or flagged answers.
Set thresholds (e.g., if more than 20% of answers are uncertain, audit the retriever and source diversity).

Risk 2: Multimodal privacy leaks

Multimodal RAG systems ingest images and video, which may contain sensitive metadata (EXIF data with GPS, timestamps) or identifiable faces.

Example: A healthcare RAG inadvertently linking patient photos to diagnoses via leaked metadata.

Mitigations:

Sanitize image metadata before processing.
Perform on-device analysis when possible, using frameworks like Apple’s Core ML to avoid transmitting sensitive data.
Limit retention of visual data and enforce strict access controls.

Technical deep dive for engineers

Implementing Self-RAG with custom critiques

Using Hugging Face’s SelfRagModel:

from transformers import SelfRagModel

model = SelfRagModel.from_pretrained("selfrag/selfrag-llama-2")

output = model.generate(
    query="Is the CEO resigning?",
    documents=[
        "Doc1: CEO denies rumors...",
        "Doc2: NYT reports resignation..."
    ],
    critique_prompt="Identify source conflicts and uncertainty."
)

print(output)

Expected output:

Answer: Reports conflict (Doc1 vs. Doc2). CEO’s status is unconfirmed.
Self-Critique: Low confidence due to conflicting sources.

Multimodal retrieval with CLIP embeddings

import torch
from PIL import Image
from transformers import CLIPModel, CLIPProcessor

model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

image = Image.open("xray.jpg")
inputs = processor(text=["lung tumor", "broken bone"], images=image, return_tensors="pt")
outputs = model(**inputs)
similarity = outputs.logits_per_image.softmax(dim=1)

print(similarity)  # e.g., tensor([0.92, 0.08]) means "lung tumor" is 92% likely

Deploying Phi-2 on iPhone with MLC-LLM

# Convert Phi-2 to iPhone-compatible format
mlc_llm convert --model-name phi-2 --output phi-2-mlc.tar

In Swift app:

let model = try MLCLlama(modelPath: "phi-2-mlc.tar")
let answer = try model.generate("Explain quantum computing.")
print(answer)

Homework: Hands-on practice

For non-technical learners

Research Twitter’s 2023 AI misinformation crisis where Self-RAG failed to flag fake election tweets. Write a 300-word report addressing:

Why did the self-critique mechanism fail?
Propose a transparency feature for users to identify uncertain AI answers.

For technical learners

Run multimodal QA with LLaVA:

pip install llava
python -m llava.serve.cli \
  --model-path liuhaotian/llava-v1.5-7b \
  --image-file "xray.jpg" \
  --query "Is there a tumor?"

Expected output:

"The X-ray shows a mass in the upper lobe (likely tumor). Consult a doctor."

Key takeaways

Self-RAG improves trustworthiness by enabling models to critique and flag conflicting sources, reducing factual errors by 35% (Microsoft Azure case study).
Multimodal RAG unlocks new applications by fusing text, images, and video with tools like CLIP and LLaVA, enabling faster, richer insights like Mayo Clinic’s tumor detection improvement.
Small Language Models democratize AI by delivering efficient, low-cost inference on edge devices such as iPhones, with models like Phi-2 outperforming larger counterparts on logic tasks.
Ethical risks demand guardrails including diverse training for critique modules and privacy safeguards for multimodal data using on-device processing and metadata sanitization.
Balance efficiency and accuracy by monitoring Self-RAG’s uncertainty rates and optimizing SLMs with quantization tools like MLC-LLM.

Test yourself: The RAG system design challenge

// learn the judgment

You are the PM at a Bangalore-based Series B startup building a news aggregator app. Users want accurate, real-time answers on breaking news, including multimedia sources like tweets and videos. Your engineering lead proposes deploying a Self-RAG pipeline with a small language model running on user devices to handle latency and cost. However, your data science lead warns about potential over-trust in the model's self-critiques and privacy risks from processing images. You have two weeks before the next board meeting.

The call: How do you balance trust, privacy, and cost in your RAG architecture? What is your recommendation to leadership?

Your reasoning:

Where to go next

If you want to improve RAG with user feedback and retraining: Iterative Feedback Loops: User Signals, Retraining, and A/B Testing
If you are focused on scaling RAG systems efficiently: Scalability and Cost Optimization: Vector Databases, Cold Starts, and Tradeoffs
If your team needs to master prompt engineering for RAG: Prompt Engineering for RAG
If you want to understand ethical AI design: Ethical PM
If you want to explore foundational RAG concepts: RAG Architecture and Use Cases

Self-RAG models don’t just answer — they question their own answers, revealing uncertainty rather than hiding it.

Talvinder Singh, from a Pragmatic Leaders session on advanced RAG

Self-RAG models critique their own answers to reduce errors

Traditional RAG pipelines retrieve documents and generate an answer. But what if the retrieved documents conflict? What if the answer is uncertain or incomplete?

The process looks like this:

Retrieve relevant documents.
Generate an initial answer.
Critique the answer by asking, “Is this supported by the sources? Are there contradictions?”
Revise the answer, flagging uncertainty or conflicts explicitly.

This approach reduces factual errors by about 35%, as shown in Asai et al., 2023’s study on news summarization.

Consider this output from a Self-RAG system:

"The CEO is resigning (Source: NYT). However, TechCrunch claims she’s staying. [Self-Critique: Conflicting sources. Flagging uncertainty.]"

Instead of hiding contradictions, the model surfaces them transparently, building trust with users.

Microsoft’s Self-RAG application on Azure Docs

Microsoft faced a challenge: developers found conflicting advice scattered across 100,000+ Azure documentation pages. The support team was overwhelmed with tickets about inconsistent instructions.

The result was striking: 50% fewer support tickets and 90% faster troubleshooting. This shows how Self-RAG can scale trust in complex information environments.

Multimodal RAG fuses text, images, audio, and video for richer answers

Many real-world questions require understanding beyond text. How do you build RAG systems that can interpret images, videos, or audio alongside documents?

The technical foundation includes:

CLIP embeddings: A model that maps images and text into the same vector space so they can be compared directly. For example, a photo of a cat and the text "feline" have similar embeddings.
Fusion techniques: Cross-attention layers in transformer models combine modalities, allowing the model to jointly reason over text and images.
Tools like LLaVA (Large Language-and-Vision Assistant): These models answer visual questions by integrating language and vision capabilities.

Mayo Clinic’s multimodal diagnosis system

Doctors often need to correlate patient X-rays with medical histories and research papers. Mayo Clinic developed a multimodal RAG system that:

Uses CLIP to retrieve similar X-rays from PACS databases.
Employs GPT-4V to synthesize reports linking image findings to symptoms and literature.

This system accelerated tumor detection by 25%, reducing diagnostic delays and improving patient outcomes.

Small Language Models (SLMs) democratize RAG with efficiency and cost savings

Large Language Models (LLMs) like GPT-4 deliver high accuracy but are costly and power-hungry. Not every application needs a cargo ship when a bicycle will do.

Key developments include:

Phi-2, a Microsoft 2.7 billion parameter model, outperforms LLaMA-7B in logical reasoning benchmarks.
Quantization techniques reduce model size and computation by representing weights with fewer bits (e.g., 4-bit precision).
Tools like MLC-LLM enable running quantized SLMs directly on iPhones and other edge devices.
Cost per inference drops dramatically, from roughly $0.03 for GPT-4 to $0.0001 for SLMs.

This makes deploying RAG at scale to billions of users feasible, especially in cost-sensitive markets like India.

Deploying Phi-2 on iPhones

Quantize Phi-2 to a 4-bit format compatible with mobile hardware using MLC-LLM. Integrate the model into Swift apps to generate answers locally without cloud latency or cost.

This approach preserves privacy, reduces dependency on internet connectivity, and lowers operational expenses.

Ethical risks require guardrails in emerging RAG systems

Risk 1: Over-trusting Self-RAG critique modules

Example: A Self-RAG system underestimating climate change risks because it learned from overly optimistic critiques.

Mitigations:

Train critique modules on diverse, adversarial feedback, including prompts like “Is this answer too optimistic or dismissive?”
Maintain human oversight for low-confidence or flagged answers.
Set thresholds (e.g., if more than 20% of answers are uncertain, audit the retriever and source diversity).

Risk 2: Multimodal privacy leaks

Multimodal RAG systems ingest images and video, which may contain sensitive metadata (EXIF data with GPS, timestamps) or identifiable faces.

Example: A healthcare RAG inadvertently linking patient photos to diagnoses via leaked metadata.

Mitigations:

Sanitize image metadata before processing.
Perform on-device analysis when possible, using frameworks like Apple’s Core ML to avoid transmitting sensitive data.
Limit retention of visual data and enforce strict access controls.

Technical deep dive for engineers

Implementing Self-RAG with custom critiques

Using Hugging Face’s SelfRagModel:

from transformers import SelfRagModel

model = SelfRagModel.from_pretrained("selfrag/selfrag-llama-2")

output = model.generate(
    query="Is the CEO resigning?",
    documents=[
        "Doc1: CEO denies rumors...",
        "Doc2: NYT reports resignation..."
    ],
    critique_prompt="Identify source conflicts and uncertainty."
)

print(output)

Expected output:

Answer: Reports conflict (Doc1 vs. Doc2). CEO’s status is unconfirmed.
Self-Critique: Low confidence due to conflicting sources.

Multimodal retrieval with CLIP embeddings

import torch
from PIL import Image
from transformers import CLIPModel, CLIPProcessor

model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

image = Image.open("xray.jpg")
inputs = processor(text=["lung tumor", "broken bone"], images=image, return_tensors="pt")
outputs = model(**inputs)
similarity = outputs.logits_per_image.softmax(dim=1)

print(similarity)  # e.g., tensor([0.92, 0.08]) means "lung tumor" is 92% likely

Deploying Phi-2 on iPhone with MLC-LLM

# Convert Phi-2 to iPhone-compatible format
mlc_llm convert --model-name phi-2 --output phi-2-mlc.tar

In Swift app:

let model = try MLCLlama(modelPath: "phi-2-mlc.tar")
let answer = try model.generate("Explain quantum computing.")
print(answer)

Homework: Hands-on practice

For non-technical learners

Research Twitter’s 2023 AI misinformation crisis where Self-RAG failed to flag fake election tweets. Write a 300-word report addressing:

Why did the self-critique mechanism fail?
Propose a transparency feature for users to identify uncertain AI answers.

For technical learners

Run multimodal QA with LLaVA:

pip install llava
python -m llava.serve.cli \
  --model-path liuhaotian/llava-v1.5-7b \
  --image-file "xray.jpg" \
  --query "Is there a tumor?"

Expected output:

"The X-ray shows a mass in the upper lobe (likely tumor). Consult a doctor."

Key takeaways

Self-RAG improves trustworthiness by enabling models to critique and flag conflicting sources, reducing factual errors by 35% (Microsoft Azure case study).
Multimodal RAG unlocks new applications by fusing text, images, and video with tools like CLIP and LLaVA, enabling faster, richer insights like Mayo Clinic’s tumor detection improvement.
Small Language Models democratize AI by delivering efficient, low-cost inference on edge devices such as iPhones, with models like Phi-2 outperforming larger counterparts on logic tasks.
Ethical risks demand guardrails including diverse training for critique modules and privacy safeguards for multimodal data using on-device processing and metadata sanitization.
Balance efficiency and accuracy by monitoring Self-RAG’s uncertainty rates and optimizing SLMs with quantization tools like MLC-LLM.

Test yourself: The RAG system design challenge

// learn the judgment

The call: How do you balance trust, privacy, and cost in your RAG architecture? What is your recommendation to leadership?

Your reasoning:

Where to go next

If you want to improve RAG with user feedback and retraining: Iterative Feedback Loops: User Signals, Retraining, and A/B Testing
If you are focused on scaling RAG systems efficiently: Scalability and Cost Optimization: Vector Databases, Cold Starts, and Tradeoffs
If your team needs to master prompt engineering for RAG: Prompt Engineering for RAG
If you want to understand ethical AI design: Ethical PM
If you want to explore foundational RAG concepts: RAG Architecture and Use Cases