Part 23: RAG Architectures - Semantic Retrieval & Knowledge Systems

Master chunking strategies, embedding models, re-ranking (Cohere), hybrid search, contextual compression, and evaluation frameworks for RAG systems.

Part 23: RAG Architectures - Semantic Retrieval & Knowledge Systems

← Back to Master Index


1. Why RAG in 2026?

RAG (Retrieval-Augmented Generation) is the backbone of production GenAI applications. Engineers with RAG expertise command 40-70% higher salaries in AI engineering roles.

Key Components

  • Document Chunking: Splitting large documents
  • Embedding Models: Converting text to vectors
  • Vector Storage: Storing and searching embeddings
  • Re-ranking: Improving result relevance
  • Prompt Engineering: Structuring LLM prompts

2. Document Chunking Strategies

Text Splitting Methods

from langchain.text_splitter import RecursiveCharacterTextSplitter

# Recursive character splitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    length_function=len,
    separators=["\n\n", "\n", " ", ""]
)

# Split documents
texts = text_splitter.split_text("Your large document text...")

# Token-based splitting
from langchain.text_splitter import TokenTextSplitter

token_splitter = TokenTextSplitter(
    chunk_size=256,
    chunk_overlap=32
)

Chunking Strategies

# Sliding window approach
def sliding_window_chunks(text, chunk_size, overlap):
    chunks = []
    start = 0
    while start < len(text):
        end = start + chunk_size
        chunks.append(text[start:end])
        start = end - overlap
    return chunks

# Semantic chunking
def semantic_chunking(text, embedding_model, threshold=0.5):
    sentences = text.split('. ')
    chunks = []
    current_chunk = []
    prev_embedding = None
    
    for sentence in sentences:
        embedding = embedding_model.embed(sentence)
        if prev_embedding and cosine_similarity(embedding, prev_embedding) < threshold:
            chunks.append(' '.join(current_chunk))
            current_chunk = [sentence]
        else:
            current_chunk.append(sentence)
        prev_embedding = embedding
    
    if current_chunk:
        chunks.append(' '.join(current_chunk))
    
    return chunks

3. Embedding Models

Model Selection

from sentence_transformers import SentenceTransformer

# Choose embedding model based on use case
MODELS = {
    "general": "sentence-transformers/all-MiniLM-L6-v2",  # Fast, good quality
    "semantic": "sentence-transformers/all-MiniLM-L12-v2",  # Better quality
    "multilingual": "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2",
    "code": "sentence-transformers/all-MiniLM-L6-v2"  # For code search
}

# Load model
model = SentenceTransformer(MODELS["general"])

# Generate embeddings
embeddings = model.encode([
    "Machine learning is fascinating",
    "Data science workflows are important"
])

Embedding Optimization

# Batch processing for efficiency
def batch_embed(texts, model, batch_size=32):
    embeddings = []
    for i in range(0, len(texts), batch_size):
        batch = texts[i:i+batch_size]
        batch_embeddings = model.encode(batch, show_progress_bar=True)
        embeddings.extend(batch_embeddings)
    return embeddings

# Normalize embeddings for cosine similarity
import numpy as np

def normalize_embeddings(embeddings):
    norms = np.linalg.norm(embeddings, axis=1, keepdims=True)
    return embeddings / norms

4. Hybrid Search Implementation

class HybridSearcher:
    def __init__(self, vector_db, keyword_index):
        self.vector_db = vector_db
        self.keyword_index = keyword_index
    
    def search(self, query, top_k=5, vector_weight=0.7, keyword_weight=0.3):
        # Vector search
        query_embedding = self.get_embedding(query)
        vector_results = self.vector_db.search(query_embedding, top_k * 2)
        
        # Keyword search
        keyword_results = self.keyword_index.search(query, top_k * 2)
        
        # Combine scores
        combined_results = self.combine_results(
            vector_results, keyword_results,
            vector_weight, keyword_weight
        )
        
        return sorted(combined_results, key=lambda x: x['score'], reverse=True)[:top_k]
    
    def combine_results(self, vector_results, keyword_results, v_weight, k_weight):
        # Merge and weight scores
        all_results = {}
        
        for result in vector_results:
            all_results[result['id']] = {
                'id': result['id'],
                'score': result['score'] * v_weight,
                'data': result['data']
            }
        
        for result in keyword_results:
            if result['id'] in all_results:
                all_results[result['id']]['score'] += result['score'] * k_weight
            else:
                all_results[result['id']] = {
                    'id': result['id'],
                    'score': result['score'] * k_weight,
                    'data': result['data']
                }
        
        return list(all_results.values())

BM25 Integration

import rank_bm25

class BM25Retriever:
    def __init__(self, documents):
        self.tokenized_docs = [doc.split() for doc in documents]
        self.bm25 = rank_bm25.BM25Okapi(self.tokenized_docs)
    
    def search(self, query, k=5):
        tokenized_query = query.split()
        scores = self.bm25.get_scores(tokenized_query)
        top_indices = scores.argsort()[-k:][::-1]
        return [(i, scores[i]) for i in top_indices]

5. Re-ranking Techniques

Cross-Encoder Re-ranking

from sentence_transformers import CrossEncoder

# Load cross-encoder for re-ranking
re_ranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

def rerank_results(query, candidates):
    # Create pairs for cross-encoder
    pairs = [[query, candidate['text']] for candidate in candidates]
    
    # Get re-ranking scores
    scores = re_ranker.predict(pairs)
    
    # Sort by re-ranking scores
    reranked = sorted(
        zip(candidates, scores),
        key=lambda x: x[1],
        reverse=True
    )
    
    return [candidate for candidate, _ in reranked]

Cohere Reranking

import cohere

def cohere_rerank(documents, query, top_n=5):
    response = cohere.Client(api_key='YOUR_API_KEY').rerank(
        model='rerank-english-v2.0',
        query=query,
        documents=[doc['text'] for doc in documents],
        top_n=top_n
    )
    
    ranked_docs = []
    for result in response.results:
        ranked_docs.append({
            'document': documents[result.document_index],
            'relevance_score': result.relevance_score
        })
    
    return ranked_docs

6. Contextual Compression

Compress Retrieved Documents

from langchain.retrievers import ContextualCompress

class DocumentCompressor:
    def compress(self, documents, query):
        compressed = []
        for doc in documents:
            # Extract relevant parts
            relevant_text = self.extract_relevant_parts(doc['text'], query)
            compressed.append({
                'id': doc['id'],
                'text': relevant_text,
                'metadata': doc['metadata']
            })
        return compressed
    
    def extract_relevant_parts(self, text, query):
        # Simple keyword-based extraction
        sentences = text.split('. ')
        relevant = [s for s in sentences if any(word in s.lower() for word in query.lower().split())]
        return '. '.join(relevant[:3])  # Top 3 relevant sentences

LLM-based Compression

def compress_with_llm(documents, query, llm):
    context = "\n\n".join([doc['text'] for doc in documents])
    
    prompt = f"""
    Given the following context and query, extract only the relevant information:
    
    Context: {context}
    
    Query: {query}
    
    Relevant Information:
    """
    
    response = llm(prompt)
    return response.strip()

7. RAG Evaluation Metrics

RAGAS Framework

from ragas.metrics import (
    faithfulness,
    answer_relevance,
    context_relevance,
    context_recall
)
from ragas import evaluate

# Evaluate RAG system
results = evaluate(
    dataset,
    metrics=[
        faithfulness,
        answer_relevance,
        context_relevance,
        context_recall
    ]
)

print(results)

Custom Evaluation

def evaluate_rag(query, answer, context, expected):
    metrics = {}
    
    # Context relevance
    context_scores = [
        cosine_similarity(
            embed(query),
            embed(context[i])
        ) for i in range(len(context))
    ]
    metrics['context_relevance'] = max(context_scores)
    
    # Answer relevance
    metrics['answer_relevance'] = cosine_similarity(
        embed(answer),
        embed(expected)
    )
    
    # Faithfulness (requires LLM judgment)
    metrics['faithfulness'] = check_factuality(answer, context)
    
    return metrics

8. Resource Directory: RAG Systems

Best Books

BookAuthorPriceKey Topics
Building RAG ApplicationsO'ReillyPaidRAG implementation
Practical RAGPacktPaidHands-on RAG
LLM EngineeringO'ReillyPaidProduction LLMs
AI EngineeringO'ReillyPaidAI systems

Best Udemy Courses

CourseInstructorPrice (INR)Key Topics
RAG Systems CourseInstructor₹1,999-2,999RAG implementation
LangChain RAGInstructor₹1,499-2,299LangChain
Vector DBs for RAGInstructor₹1,499-2,299Vector databases
Advanced RAGInstructor₹1,999-2,999Advanced techniques

Best O'Reilly Resources

ResourceTopicAccess
Building RAG ApplicationsO'ReillyPaid
Learning RAG SystemsO'ReillyPaid
Practical RAGO'ReillyPaid

Best LinkedIn Learning Courses

CourseInstructorAccess
RAG SystemsInstructorPaid
Knowledge GraphsInstructorPaid
AI SearchInstructorPaid

Free Resources

PlatformResourceLink
RAG TutorialLangChainpython.langchain.com/docs/use_cases/qa_knowledge
RAGAS DocsGitHubgithub.com/explodinggradients/ragas
Awesome RAGGitHubgithub.com/maxime-recher/awesome-rag
RAG PaperarXivarxiv.org/abs/2005.11401

9. Common RAG Interview Questions

QuestionAnswer
How to handle large documents in RAG?Chunk documents, use sliding window, semantic chunking.
What is re-ranking?Second-pass model that re-orders retrieved documents by relevance.
How to improve RAG accuracy?Better chunking, re-ranking, prompt engineering, higher quality embeddings.
What is contextual compression?Reduce document size while preserving relevant information.
How to evaluate RAG systems?RAGAS metrics, human evaluation, domain-specific metrics.

10. Part Navigation

Previous Parts

Part 22: Vector Databases

Next Parts

Part 24: LangChain Foundations · Part 25: LangGraph


Proceed to Part 24: LangChain Foundations →

Comments

Comments are powered by giscus. Set PUBLIC_GISCUS_REPO_ID and PUBLIC_GISCUS_CATEGORY_ID in your environment to enable them.