Career Guide 29 May 2026 6 min read

Part 23: RAG Architectures - Semantic Retrieval & Knowledge Systems

Master chunking strategies, embedding models, re-ranking (Cohere), hybrid search, contextual compression, and evaluation frameworks for RAG systems.

By Chirag Singhal

Part 23: RAG Architectures - Semantic Retrieval & Knowledge Systems

← Back to Master Index

1. Why RAG in 2026?

RAG (Retrieval-Augmented Generation) is the backbone of production GenAI applications. Engineers with RAG expertise command 40-70% higher salaries in AI engineering roles.

Key Components

Document Chunking: Splitting large documents
Embedding Models: Converting text to vectors
Vector Storage: Storing and searching embeddings
Re-ranking: Improving result relevance
Prompt Engineering: Structuring LLM prompts

2. Document Chunking Strategies

Text Splitting Methods

from langchain.text_splitter import RecursiveCharacterTextSplitter

# Recursive character splitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    length_function=len,
    separators=["\n\n", "\n", " ", ""]
)

# Split documents
texts = text_splitter.split_text("Your large document text...")

# Token-based splitting
from langchain.text_splitter import TokenTextSplitter

token_splitter = TokenTextSplitter(
    chunk_size=256,
    chunk_overlap=32
)

Chunking Strategies

# Sliding window approach
def sliding_window_chunks(text, chunk_size, overlap):
    chunks = []
    start = 0
    while start < len(text):
        end = start + chunk_size
        chunks.append(text[start:end])
        start = end - overlap
    return chunks

# Semantic chunking
def semantic_chunking(text, embedding_model, threshold=0.5):
    sentences = text.split('. ')
    chunks = []
    current_chunk = []
    prev_embedding = None
    
    for sentence in sentences:
        embedding = embedding_model.embed(sentence)
        if prev_embedding and cosine_similarity(embedding, prev_embedding) < threshold:
            chunks.append(' '.join(current_chunk))
            current_chunk = [sentence]
        else:
            current_chunk.append(sentence)
        prev_embedding = embedding
    
    if current_chunk:
        chunks.append(' '.join(current_chunk))
    
    return chunks

3. Embedding Models

Model Selection

from sentence_transformers import SentenceTransformer

# Choose embedding model based on use case
MODELS = {
    "general": "sentence-transformers/all-MiniLM-L6-v2",  # Fast, good quality
    "semantic": "sentence-transformers/all-MiniLM-L12-v2",  # Better quality
    "multilingual": "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2",
    "code": "sentence-transformers/all-MiniLM-L6-v2"  # For code search
}

# Load model
model = SentenceTransformer(MODELS["general"])

# Generate embeddings
embeddings = model.encode([
    "Machine learning is fascinating",
    "Data science workflows are important"
])

Embedding Optimization

# Batch processing for efficiency
def batch_embed(texts, model, batch_size=32):
    embeddings = []
    for i in range(0, len(texts), batch_size):
        batch = texts[i:i+batch_size]
        batch_embeddings = model.encode(batch, show_progress_bar=True)
        embeddings.extend(batch_embeddings)
    return embeddings

# Normalize embeddings for cosine similarity
import numpy as np

def normalize_embeddings(embeddings):
    norms = np.linalg.norm(embeddings, axis=1, keepdims=True)
    return embeddings / norms

4. Hybrid Search Implementation

Keyword + Vector Search

class HybridSearcher:
    def __init__(self, vector_db, keyword_index):
        self.vector_db = vector_db
        self.keyword_index = keyword_index
    
    def search(self, query, top_k=5, vector_weight=0.7, keyword_weight=0.3):
        # Vector search
        query_embedding = self.get_embedding(query)
        vector_results = self.vector_db.search(query_embedding, top_k * 2)
        
        # Keyword search
        keyword_results = self.keyword_index.search(query, top_k * 2)
        
        # Combine scores
        combined_results = self.combine_results(
            vector_results, keyword_results,
            vector_weight, keyword_weight
        )
        
        return sorted(combined_results, key=lambda x: x['score'], reverse=True)[:top_k]
    
    def combine_results(self, vector_results, keyword_results, v_weight, k_weight):
        # Merge and weight scores
        all_results = {}
        
        for result in vector_results:
            all_results[result['id']] = {
                'id': result['id'],
                'score': result['score'] * v_weight,
                'data': result['data']
            }
        
        for result in keyword_results:
            if result['id'] in all_results:
                all_results[result['id']]['score'] += result['score'] * k_weight
            else:
                all_results[result['id']] = {
                    'id': result['id'],
                    'score': result['score'] * k_weight,
                    'data': result['data']
                }
        
        return list(all_results.values())

BM25 Integration

import rank_bm25

class BM25Retriever:
    def __init__(self, documents):
        self.tokenized_docs = [doc.split() for doc in documents]
        self.bm25 = rank_bm25.BM25Okapi(self.tokenized_docs)
    
    def search(self, query, k=5):
        tokenized_query = query.split()
        scores = self.bm25.get_scores(tokenized_query)
        top_indices = scores.argsort()[-k:][::-1]
        return [(i, scores[i]) for i in top_indices]

5. Re-ranking Techniques

Cross-Encoder Re-ranking

from sentence_transformers import CrossEncoder

# Load cross-encoder for re-ranking
re_ranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

def rerank_results(query, candidates):
    # Create pairs for cross-encoder
    pairs = [[query, candidate['text']] for candidate in candidates]
    
    # Get re-ranking scores
    scores = re_ranker.predict(pairs)
    
    # Sort by re-ranking scores
    reranked = sorted(
        zip(candidates, scores),
        key=lambda x: x[1],
        reverse=True
    )
    
    return [candidate for candidate, _ in reranked]

Cohere Reranking

import cohere

def cohere_rerank(documents, query, top_n=5):
    response = cohere.Client(api_key='YOUR_API_KEY').rerank(
        model='rerank-english-v2.0',
        query=query,
        documents=[doc['text'] for doc in documents],
        top_n=top_n
    )
    
    ranked_docs = []
    for result in response.results:
        ranked_docs.append({
            'document': documents[result.document_index],
            'relevance_score': result.relevance_score
        })
    
    return ranked_docs

6. Contextual Compression

Compress Retrieved Documents

from langchain.retrievers import ContextualCompress

class DocumentCompressor:
    def compress(self, documents, query):
        compressed = []
        for doc in documents:
            # Extract relevant parts
            relevant_text = self.extract_relevant_parts(doc['text'], query)
            compressed.append({
                'id': doc['id'],
                'text': relevant_text,
                'metadata': doc['metadata']
            })
        return compressed
    
    def extract_relevant_parts(self, text, query):
        # Simple keyword-based extraction
        sentences = text.split('. ')
        relevant = [s for s in sentences if any(word in s.lower() for word in query.lower().split())]
        return '. '.join(relevant[:3])  # Top 3 relevant sentences

LLM-based Compression

def compress_with_llm(documents, query, llm):
    context = "\n\n".join([doc['text'] for doc in documents])
    
    prompt = f"""
    Given the following context and query, extract only the relevant information:
    
    Context: {context}
    
    Query: {query}
    
    Relevant Information:
    """
    
    response = llm(prompt)
    return response.strip()

7. RAG Evaluation Metrics

RAGAS Framework

from ragas.metrics import (
    faithfulness,
    answer_relevance,
    context_relevance,
    context_recall
)
from ragas import evaluate

# Evaluate RAG system
results = evaluate(
    dataset,
    metrics=[
        faithfulness,
        answer_relevance,
        context_relevance,
        context_recall
    ]
)

print(results)

Custom Evaluation

def evaluate_rag(query, answer, context, expected):
    metrics = {}
    
    # Context relevance
    context_scores = [
        cosine_similarity(
            embed(query),
            embed(context[i])
        ) for i in range(len(context))
    ]
    metrics['context_relevance'] = max(context_scores)
    
    # Answer relevance
    metrics['answer_relevance'] = cosine_similarity(
        embed(answer),
        embed(expected)
    )
    
    # Faithfulness (requires LLM judgment)
    metrics['faithfulness'] = check_factuality(answer, context)
    
    return metrics

8. Resource Directory: RAG Systems

Best Books

Book	Author	Price	Key Topics
Building RAG Applications	O'Reilly	Paid	RAG implementation
Practical RAG	Packt	Paid	Hands-on RAG
LLM Engineering	O'Reilly	Paid	Production LLMs
AI Engineering	O'Reilly	Paid	AI systems

Best Udemy Courses

Course	Instructor	Price (INR)	Key Topics
RAG Systems Course	Instructor	₹1,999-2,999	RAG implementation
LangChain RAG	Instructor	₹1,499-2,299	LangChain
Vector DBs for RAG	Instructor	₹1,499-2,299	Vector databases
Advanced RAG	Instructor	₹1,999-2,999	Advanced techniques

Best O'Reilly Resources

Resource	Topic	Access
Building RAG Applications	O'Reilly	Paid
Learning RAG Systems	O'Reilly	Paid
Practical RAG	O'Reilly	Paid

Best LinkedIn Learning Courses

Course	Instructor	Access
RAG Systems	Instructor	Paid
Knowledge Graphs	Instructor	Paid
AI Search	Instructor	Paid

Free Resources

Platform	Resource	Link
RAG Tutorial	LangChain	python.langchain.com/docs/use_cases/qa_knowledge
RAGAS Docs	GitHub	github.com/explodinggradients/ragas
Awesome RAG	GitHub	github.com/maxime-recher/awesome-rag
RAG Paper	arXiv	arxiv.org/abs/2005.11401

9. Common RAG Interview Questions

Question	Answer
How to handle large documents in RAG?	Chunk documents, use sliding window, semantic chunking.
What is re-ranking?	Second-pass model that re-orders retrieved documents by relevance.
How to improve RAG accuracy?	Better chunking, re-ranking, prompt engineering, higher quality embeddings.
What is contextual compression?	Reduce document size while preserving relevant information.
How to evaluate RAG systems?	RAGAS metrics, human evaluation, domain-specific metrics.

Previous Parts

Part 22: Vector Databases

Next Parts

Part 24: LangChain Foundations · Part 25: LangGraph

Proceed to Part 24: LangChain Foundations →

Comments

Comments are powered by giscus. Set PUBLIC_GISCUS_REPO_ID and PUBLIC_GISCUS_CATEGORY_ID in your environment to enable them.

Part 23: RAG Architectures - Semantic Retrieval & Knowledge Systems

1. Why RAG in 2026?

Key Components

2. Document Chunking Strategies

Text Splitting Methods

Chunking Strategies

3. Embedding Models

Model Selection

Embedding Optimization

4. Hybrid Search Implementation

Keyword + Vector Search

BM25 Integration

5. Re-ranking Techniques

Cross-Encoder Re-ranking

Cohere Reranking

6. Contextual Compression

Compress Retrieved Documents

LLM-based Compression

7. RAG Evaluation Metrics

RAGAS Framework

Custom Evaluation

8. Resource Directory: RAG Systems

Best Books

Best Udemy Courses

Best O'Reilly Resources

Best LinkedIn Learning Courses

Free Resources

9. Common RAG Interview Questions

10. Part Navigation

Previous Parts

Next Parts

Comments