RAG Demystified: Part 3

We have discussed the core components of RAG in Part 1 and explored similarity metrics, embedding processes, and chunking strategies in Part 2. In this third installment, we’ll dive deep into advanced retrieval techniques, hybrid approaches, and cutting-edge RAG architectures that are revolutionizing how we build intelligent systems.

Static vs Dynamic Embeddings:

When we embed chunks, we can choose between static and dynamic embedding approaches, each with distinct advantages and use cases.

Static embedding:

Static embeddings refer to fixed vector representations of words or tokens that remain constant regardless of the context in which they appear. These embeddings are typically pre-trained on large corpora using models like Word2Vec or GloVe and capture general semantic and syntactic relationships between words. In contrast to dynamic embeddings, which adapt based on surrounding context, static embeddings assign the same vector to a word in every instance, regardless of sentence structure or meaning[1].

Static Embedding Example:

The word “bank” gets the same vector in both sentences:

  • “He sat by the river bank” → vector for bank: [0.45, 0.12, 0.88, 0.36]
  • “She deposited money in the bank” → vector for bank: [0.45, 0.12, 0.88, 0.36]

Each word in the vocabulary has one fixed vector for static embedding.

Advantages:

Static embeddings offer several advantages, including computational efficiency, which enables faster inference and reduces memory requirements. Additionally, they provide consistent representations, making them reliable and stable across different tasks and inputs.

Limitations:

While static embeddings are computationally efficient and easy to use, they lack the ability to capture contextual nuances, making them less effective for complex or ambiguous queries. Modern Retrieval-Augmented Generation (RAG) systems increasingly favor contextual embeddings from transformer-based models for improved accuracy and relevance.

Popular static embedding models include Word2Vec, GloVe, and FastText.

Dynamic embedding:

While static embeddings proved to be strong baselines for measuring semantic similarity, they inherently failed to capture the rich contextual information of natural language. For example, a static model would produce the identical vector for the word “bank” in both “river bank” and “money bank,” despite their distinct meanings. Dynamic embeddings, on the other hand, generate vector representations of words or sentences that vary based on the surrounding context within the text.

Dynamic Embedding Example:

With a dynamic model, the word “bank” has different vectors based on context:

  • “He sat by the river bank” → vector for bank: [0.33, 0.55, 0.74, 0.21]
  • “She deposited money in the bank” → vector for bank: [0.91, 0.13, 0.48, 0.62]

Popular models for dynamic embeddings include BERT, SBERT, all-MiniLM, the E5 family, Cohere Embed, and OpenAI’s text embedding models.

Advantages:

Dynamic embeddings provide context-aware representations that improve the handling of ambiguity and deliver superior performance on complex tasks. They are adaptable to domain-specific contexts, making them highly versatile.

Limitations:

Higher computational costs, slower inference speeds, increased memory requirements, and greater implementation complexity are the downsides of using dynamic embeddings

The Need for Hybrid Approaches:

There is no single embedding model that works optimally across all industries. While general-purpose embeddings provide a good starting point for building Retrieval-Augmented Generation (RAG) systems, achieving high performance typically requires industry-specific specialization.

When working with domain-specific data, the embedding model may not be trained on specific technical or uncommon terms. Modern RAG systems often employ hybrid approach where specific terms can be effectively retrieved using sparse technique, while semantic understanding is handled by dense vector representations. This combination helps improve recall and relevance in domain-sensitive applications. One of the most effective sparse retrieval techniques is BM25.

Let’s explore how BM25 works

BM25 Algorithm:

BM25 (Best Match 25) is a ranking function widely used in search engines and information retrieval systems. It builds on the TF-IDF model by introducing enhancements like term frequency saturation and document length normalization, offering a more accurate measure of a document’s relevance to a search query[2].

Why BM25 Matters

Unlike simple TF-IDF, BM25 addresses two critical issues[3]:

Term frequency saturation: Prevents documents with excessive keyword stuffing from dominating results
Document length normalization: Ensures fair comparison between short and long documents

Core Components of BM25:

1. Inverse Document Frequency (IDF):

IDF measures how informative a term is by penalizing common terms and rewarding rare ones.

Formula:
IDF(qi) = log(1 + (N - df(qi) + 0.5)/(df(qi) + 0.5))

Where:

  • N = total number of documents
  • df(qi) = number of documents containing term a
  • 0.5 constant is added to avoid division by zero, incase if both N and n(qi) are equal and also to avoid negative IDF.

2. Term Frequency Saturation (TF):

BM25 limits the influence of term repetition—more appearances help, but with diminishing returns.

Formula:

TF_component = (f(qi, d) × (k1 + 1)) / (f(qi, d) + k1 × (1 - b + b × |d|/avgdl))

Where:

  • f(qi, d) = term frequency of qi in document d
  • |d| = document length
  • avgdl = average document length
  • k1 = saturation parameter (typically 1.2-2.0)
  • b = length normalization (typically 0.75)

BM25 Parameter Tuning:

k1 (Term Frequency Saturation):

Higher values (1.5-2.0): Less saturation, term frequency matters more
Lower values (0.5-1.0): More saturation, diminishing returns kick in sooner

b (Length Normalization):

b=1: Full length normalization
b=0: No length normalization
b=0.75: Balanced approach (most common)

3. Document Length Normalization:

Longer documents naturally have a higher probability of containing a query term and having higher term frequencies. The BM25 algorithm penalizes longer documents to prevent them from having an unfair advantage in relevance scoring. This is where document length normalization comes in. The parameter b in the term frequency formula controls the degree of this normalization. If b is set to 1, the document’s length fully normalizes the term frequency. If it’s set to 0, there is no normalization.

Full BM25 Formula

The total BM25 score of document D for query Q = {q1, …, qn}:

BM25(D,Q) = Σ IDF(qi) × TF_component(qi, D)

Example Query: “Quantum Physics”

Documents:

  • D1: Quantum entanglement is a phenomenon in quantum physics. (8 words)
  • D2: Einstein called quantum entanglement spooky action at a distance. (9 words)
  • D3: Quantum physics explores the strange world of entanglement. (8 words)

The above documents are indexed using Inverted index. it keeps track of terms and their document name matching.

For instance:
For term: quantum

It can be hashed to properties like
Posting List: [
(docID: D1, term_freq: 2),
(docID: D2, term_freq: 1),
(docID: D3, term_freq: 1)]

Query: quantum physics

BM25 Calculation

Step 1: avgdl = (8 + 9 + 8) / 3 = 8.33
Step 2: IDF values

  •  “quantum” in all 3 docs → IDF ≈ 0.133
  •  “physics” in 2 docs → IDF ≈ 0.47

Step 3: Term frequencies

  • D1: quantum=2, physics=1
  • D2: quantum=1, physics=0
  • D3: quantum=1, physics=1

Using k1 = 1.5, b = 0.75:

D1 (|d| = 8):
0.133 * 2*(2.5)
———————-
2+1.5⋅(1−0.75+(0.75 * 8/8.33)

​≈0.133 * 5/3.455 ≈0.192

Similary Score for “physics” is calculated using the same formula: ≈0.479

Total Score for D1 = 0.192 + 0.479 = 0.671
Score(“quantum”) ≈ 0.192
Score(“physics”) ≈ 0.479
Total: 0.671

D2 (|d| = 9):

Score(“quantum”) ≈ 0.128
Score(“physics”) = 0
Total: 0.128

D3 (|d| = 8):

Score(“quantum”) ≈ 0.135
Score(“physics”) ≈ 0.479
Total: 0.614

Final Ranking (Most Relevant First)

1. D1 – 0.671
2. D3 – 0.614
3. D2 – 0.128

we know how dense vector works, it does not necessarily be dense if your use case depends on sparse or static we can choose over that also. Now we need to combine both BM25 and dense vector results to a single combine results, well there are many popular algorithm which can do this, we’ll examine one popular algorithm: Reciprocal Rank Fusion.

Reciprocal Rank Fusion (RRF):

RRF is a ranking fusion technique that combines multiple ranked lists of documents from different retrieval systems or queries. Instead of relying on the retrieval scores, RRF assigns scores to documents based solely on their rank positions in each list by using the reciprocal of the rank. The final ranking is determined by summing these reciprocal rank scores across all lists. This approach emphasizes documents that appear near the top of any list, improving overall retrieval effectiveness without depending on the original retrieval scores[4].

Formula:
RRF_score = Σ (1 / (k + rank_i))

Where:

k = 60 (standard dampening constant)
rank_i = rank from retrieval system i

Complete RRF Example

Step 1: BM25 rank

Doc ID BM25 Score BM25 Rank
D1 0.671 1
D3 0.614 2
D2 0.128 3

Step 2: Simulated Dense Retrieval (Cosine Similarity)

Let’s assume you used an embedding model like `all-MiniLM-L6-v2`, and the cosine similarities for the query “quantum physics” with each document are as follows:

Doc ID Cosine Similarity Dense Rank
D3 0.94 1
D1 0.91 2
D2 0.76 3

These are plausible—D3 has high semantic alignment with both terms, D1 is close behind, D2 is more about “entanglement” and less about “physics.”

Step 3: Apply Reciprocal Rank Fusion (RRF)

Formula:

RRF = ∑ i=1 to n 1/(k + rank_di)

Where:

  • n =number of rankers (2: BM25 and Dense)
  • k = 60 (commonly used constant to dampen dominance)

For each document:

D1

  • BM25 rank: 1 → 1/(60 + 1) = 1/61 ≈ 0.01639
  • Dense rank: 2 → 1/(60 + 2) = 1/62 ≈ 0.01613
  • RRF Score: 0.01639 + 0.01613 = 0.03252

D2

  • BM25 rank: 3 → 1/(60 + 3) = 1/63 ≈ 0.01587
  • Dense rank: 3 → 1/(60 + 3) = 1/63 ≈ 0.01587
  • RRF Score: 0.01587 + 0.01587 = 0.03174

D3

  • BM25 rank: 2 → 1/(60 + 2) = 1/62 ≈ 0.01613
  • Dense rank: 1 → 1/(60 + 1) = 1/61 ≈ 0.01639
  • RRF Score: 0.01613 + 0.01639 = 0.03252

Step 4: Final Hybrid Ranking via RRF

Doc ID RRF Score Final Rank
D1 0.03252 1
D3 0.03252 1
D2 0.03174 3

On top of hybrid search, a reranker model can be used to reorder retrieved documents, further validating their relevance. This is especially useful in hybrid systems combining BM25 and dense vector retrieval. Popular reranking models include BGE, ColBERT, and LLM-based rerankers.

Advanced RAG Architectures

Modern RAG systems have evolved beyond simple retrieve-and-generate patterns. Let’s explore cutting-edge architectures that address specific challenges.

1. HyDe (Hypothetical Document Embedding)

It is an advanced variant of Retrieval-Augmented Generation (RAG) that enhances retrieval quality by generating a hypothetical answer based on the user’s query. Instead of directly fetching documents from a knowledge base for a user query, this approach first uses a language model to generate a synthetic or “ideal” document that represents what a good answer might look like. This hypothetical document is then embedded and used as a query to retrieve the most relevant documents.

Workflow:

• The user provides a query.
• The model generates a synthetic document that hypothetically answers the query.
• The hypothetical document is embedded and used to retrieve documents from a knowledge base.
• The model generates a final answer using the retrieved documents, guided by the hypothetical context.

Advantages:

– Bridges query-document gap
– Improves retrieval for complex questions
– Integrates with existing retrieval systems, requiring minimal modifications

2. Corrective RAG:

CRAG implements a self-grading technique for retrieved documents to improve the accuracy. unlike traditional RAG architecture, which does not validates or grades the retrieval documents,  CRAG evaluates the quality of the information before moving to the generation phase. CRAG breaks down the retrieved documents into “knowledge strips” and grades each strip for relevance. when the confidence on retrieved documents is low, CRAG uses web search to fetch more suitable information to answer the user’s query[5].

Workflow:

The CRAG pipeline follows a conditional, corrective flow:
A. The user submits a query.
B. Documents are retrieved from a static corpus.
C. A lightweight retrieval evaluator (based on T5) scores the relevance of each document to the query.
D. Based on confidence scores, one of three actions is taken:
• Correct: If relevant documents are found, they are refined.
• Incorrect: If all documents are irrelevant, a web search is triggered.
• Ambiguous: If relevance is uncertain, both internal and external sources are used.
E. Retrieved documents are decomposed into knowledge strips (e.g., D1 → K11, K12, K13). Each strip is scored, irrelevant ones are filtered out, and relevant ones are recomposed into a refined knowledge base.
F. The refined knowledge is passed to a language model to generate the final response.

Advantages:

– Self-correcting mechanism
– Reduces hallucination

3. Self-RAG:

Self- RAG is a framework that enables language models to decide when to retrieve external knowledge and how to evaluate it. It uses special reflection tokens like ISREL, ISSUP, and ISUSE to assess the relevance, support, and usefulness of generated content. The model retrieves documents only when needed, generates candidate outputs, and critiques them to select the most factual and helpful response. This self-reflective process improves factual accuracy and citation quality in generation tasks[6].

Workflow:

1. User provides a query
2. Model decides whether retrieval is required
• The model predicts a special Retrieve token (Yes, No, or Continue) to determine if external knowledge is needed.
• This decision is based on the input and any previously generated content.
3. If retrieval is required
• The model adds the Retrieve=Yes token in the output.
• It then fetches documents from a knowledge base (e.g., Wikipedia) using a retriever like Contriever.
4. For each retrieved document
• The model generates a candidate output segment (e.g., a sentence).
• This is done in parallel for each document.
5. Each candidate output is evaluated using reflection tokens:
• ISREL: Is the document relevant to the query?
• ISSUP: Does the document support the generated output?
• ISUSE: How useful is the output overall (rated 1–5)?
6. Best output is selected
• A beam search ranks the outputs using a weighted score based on the reflection tokens.
• The best segment is chosen and added to the final response.

Advantages:

– Dynamic retrieval decisions
– Built-in fact-checking
– Improved citation quality

4. Agentic RAG:

Agentic RAG is an advanced paradigm that enhances traditional RAG systems by embedding autonomous AI agents into the retrieval and generation pipeline. These agents are capable of:
• Dynamic decision-making
• Iterative reasoning
• Tool use
• Multi-agent collaboration
This enables Agentic RAG systems to adaptively manage complex, multi-step tasks and deliver highly contextual, real-time, and accurate responses across diverse domains such as healthcare, finance, education, and legal analysis. The Agentic RAG workflow can vary depending on the architecture (single-agent, multi-agent)[7].

Workflow:

1. A user submits a query to the system.
2. A coordinator agent evaluates the query and delegates tasks to specialized agents based on the query type and complexity.
3. Specialized Retrieval Agents:
Agents retrieve data from:
• Structured databases (via SQL)
• Unstructured documents (via semantic search)
• Web sources (via APIs)
• Graph knowledge bases (for relational reasoning)
• Recommendation systems (for personalization)
Agents may use external tools (e.g., vector search, web search, APIs) to enhance retrieval and reasoning.
4. Agents reflect on intermediate outputs, critique results, and plan next steps iteratively to refine the response.
5. Retrieved and validated data is passed to a Large Language Model (LLM) for synthesis into a coherent, context-aware response.
6. The final output is generated and delivered to the user, often with citations or actionable insights.

Advantages:

– Handles complex, multi-step queries
– Specialized expertise per domain
– Iterative improvement

5. Multi-step Reasoning:

Traditional RAG often performs retrieval and generation in a single step. Multi-step reasoning involves breaking down complex queries into a series of simpler sub-queries, performing retrieval and generation at each step.This enables the RAG system to handle questions that require logical reasoning, inference, or the synthesis of information from multiple sources[8].

Workflow:

1. Query Decomposition: Break down the complex query into simpler sub-queries.
2. Iterative Retrieval and Generation: For each sub-query, retrieve relevant information and generate intermediate reasoning steps.
3. Knowledge Fusion: Combine all intermediate responses and retrieved content.
4. Reasoning Chains: Use the fused knowledge to build a coherent chain of reasoning, which may loop back to guide further retrieval.

Advantages:

– Improves factual accuracy
– Handles complex analytical queries

6. Multi-Modal RAG:

Multimodal RAG is an advanced RAG framework that enhances the capabilities of large language models (LLMs) by enabling them to reason over multiple types of data not just text, but also images, audio, video, and more. It extends the traditional RAG pipeline, which retrieves relevant textual documents to support answer generation, by incorporating multimodal content into both the retrieval and generation stages[9].

Workflow:

1. Data Extraction: Collect and ingest multimodal data from various sources:

  • Text: documents, manuals, web pages
  • Images: diagrams, screenshots, photos
  • Audio/Video : recordings, tutorials

2. Embedding Generation: Convert each modality into vector representations.

  • Text → via a text encoder (e.g., BERT, OpenAI)
  • Image → via a vision encoder (e.g., CLIP, BLIP)
  • Audio → via an audio encoder (e.g., Whisper)

3. Vector Store Setup: Choose how to organize and store embeddings

  • Single vector store: Combine all modalities (e.g., text + image summaries)
  • Multiple vector stores: Separate stores for each modality (e.g., one for text, one for images)

4. Query Embedding: Embed the user query using multimodal.

5. Similarity Search:Perform separate similarity searches in vector stores.Retrieve top-k relevant text chunks and images. Combine retrieved content into a unified context.

6. Answer Generation: Feed them into a multimodal LLM to generate the final answer.

Advantages:

– Richer context for generation
– Handles diverse query types

7. Graph RAG:

GraphRAG represents an enhanced RAG framework that augments large language model (LLM) capabilities by incorporating graph-structured data into both retrieval and generation processes. In contrast to conventional RAG systems that depend on flat text or image embeddings, GraphRAG exploits the interconnected and multi-faceted characteristics of graph structures to obtain information that is more contextually meaningful and structurally pertinent[10].

Workflow:

1. Ingestion Phase:

This phase prepares the graph-structured knowledge base from raw data.
• Identify key entities or concepts from source documents.
• Detect and define relationships between entities (e.g., “works for”, “located in”).
• Construct a graph network where nodes represents entities/concepts and edges represents the connections between them.
• Encode each node into a vector using embedding models.
• Maintain the graph structure in a graph database or memory-efficient format for fast retrieval.

2. Query Phase:

This phase handles user queries and generates responses using the graph.
• Perform entity recognition and relation extraction from the user query.
• Encode the query into a vector using the same embedding model used during ingestion.
• Formulate a structured query (e.g., SPARQL, Cypher) or identify seed nodes for traversal.
• Traverse the graph (e.g., BFS, DFS, GNN-based) to retrieve relevant nodes, paths, or subgraphs.
• Prune, rerank, or verbalize the retrieved subgraph to prepare it for the LLM.
• Feed the organized context into an LLM or hybrid model to generate a grounded response

Advantages:

– Captures complex relationships
– Enables reasoning over connections
– Handles multi-hop queries
– Provides explainable retrieval paths

Looking Ahead: RAG Part 4 Preview

In our final installment of this RAG series, we’ll cover

  • Fine-tuning Embedding Models
  • Popular RAG Tools and Frameworks

Stay tuned

References:

[1] what is the difference between static and contextual embeddings.

[2] The probabilistic relevance framework: BM25 and beyond.

[3] Mastering BM25: A deep dive into the algorithm and application in Milvus.

[4] Re-ranking with Reciprocal Rank Fusion (RRF).

[5] Corrective Retrieval Augmented Generation. 

[6] Self-RAG: Self evaluating the retrieved information using special tokens.

[7] Agentic Retrieval-Augmented Generation.

[8] Multi-Step Reasoning Using Chain-of-Thought.

[9] Optimizing RAG with multimodal inputs for industrial applications.

[10] Retrieval-Augmented Generation with Graphs.

Author Details

Karthikeyan S

Karthikeyan is Researcher at the Applied Research Center of Applied AI and an LLM Engineer with 3+ years of experience specializing in Retrieval-Augmented Generation systems. He has implemented production-grade RAG solutions across multiple industries and designed enterprise training programs on RAG technologies. His expertise spans the complete RAG pipeline from data preparation and embedding optimization to deployment and performance tuning.

Chetana Amancharla

Chetana Amancharla leads the Applied Research Center for Advanced AI, bringing 24 years of IT industry expertise to cutting-edge innovation in agentic AI. She specializes in developing intelligent platforms that combine autonomous agents, symbolic-neural reasoning, and large language models to create self-directed automation and adaptive enterprise solutions. Chetana focuses on translating advanced AI research into scalable applications that drive real-world business transformation.

Bhumika Singhal

Bhumika Singhal, an AI/ML Technology Architect at Infosys, specializes in NLP and GenAI, with a profound focus on Agentic AI. Her 12-year career encompasses leading client projects in autonomous agent design and deployment, alongside developing semantic search platforms, leveraging her extensive Azure technology stack expertise.

Leave a Comment

Your email address will not be published. Required fields are marked *