Part 23: RAG Architectures - Semantic Retrieval & Knowledge Systems
← Back to Master Index
1. Why RAG in 2026?
RAG (Retrieval-Augmented Generation) is the backbone of production GenAI applications. Engineers with RAG expertise command 40-70% higher salaries in AI engineering roles.
Key Components
- Document Chunking: Splitting large documents
- Embedding Models: Converting text to vectors
- Vector Storage: Storing and searching embeddings
- Re-ranking: Improving result relevance
- Prompt Engineering: Structuring LLM prompts
2. Document Chunking Strategies
Text Splitting Methods
from langchain.text_splitter import RecursiveCharacterTextSplitter
# Recursive character splitter
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
length_function=len,
separators=["\n\n", "\n", " ", ""]
)
# Split documents
texts = text_splitter.split_text("Your large document text...")
# Token-based splitting
from langchain.text_splitter import TokenTextSplitter
token_splitter = TokenTextSplitter(
chunk_size=256,
chunk_overlap=32
)
Chunking Strategies
# Sliding window approach
def sliding_window_chunks(text, chunk_size, overlap):
chunks = []
start = 0
while start < len(text):
end = start + chunk_size
chunks.append(text[start:end])
start = end - overlap
return chunks
# Semantic chunking
def semantic_chunking(text, embedding_model, threshold=0.5):
sentences = text.split('. ')
chunks = []
current_chunk = []
prev_embedding = None
for sentence in sentences:
embedding = embedding_model.embed(sentence)
if prev_embedding and cosine_similarity(embedding, prev_embedding) < threshold:
chunks.append(' '.join(current_chunk))
current_chunk = [sentence]
else:
current_chunk.append(sentence)
prev_embedding = embedding
if current_chunk:
chunks.append(' '.join(current_chunk))
return chunks
3. Embedding Models
Model Selection
from sentence_transformers import SentenceTransformer
# Choose embedding model based on use case
MODELS = {
"general": "sentence-transformers/all-MiniLM-L6-v2", # Fast, good quality
"semantic": "sentence-transformers/all-MiniLM-L12-v2", # Better quality
"multilingual": "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2",
"code": "sentence-transformers/all-MiniLM-L6-v2" # For code search
}
# Load model
model = SentenceTransformer(MODELS["general"])
# Generate embeddings
embeddings = model.encode([
"Machine learning is fascinating",
"Data science workflows are important"
])
Embedding Optimization
# Batch processing for efficiency
def batch_embed(texts, model, batch_size=32):
embeddings = []
for i in range(0, len(texts), batch_size):
batch = texts[i:i+batch_size]
batch_embeddings = model.encode(batch, show_progress_bar=True)
embeddings.extend(batch_embeddings)
return embeddings
# Normalize embeddings for cosine similarity
import numpy as np
def normalize_embeddings(embeddings):
norms = np.linalg.norm(embeddings, axis=1, keepdims=True)
return embeddings / norms
4. Hybrid Search Implementation
Keyword + Vector Search
class HybridSearcher:
def __init__(self, vector_db, keyword_index):
self.vector_db = vector_db
self.keyword_index = keyword_index
def search(self, query, top_k=5, vector_weight=0.7, keyword_weight=0.3):
# Vector search
query_embedding = self.get_embedding(query)
vector_results = self.vector_db.search(query_embedding, top_k * 2)
# Keyword search
keyword_results = self.keyword_index.search(query, top_k * 2)
# Combine scores
combined_results = self.combine_results(
vector_results, keyword_results,
vector_weight, keyword_weight
)
return sorted(combined_results, key=lambda x: x['score'], reverse=True)[:top_k]
def combine_results(self, vector_results, keyword_results, v_weight, k_weight):
# Merge and weight scores
all_results = {}
for result in vector_results:
all_results[result['id']] = {
'id': result['id'],
'score': result['score'] * v_weight,
'data': result['data']
}
for result in keyword_results:
if result['id'] in all_results:
all_results[result['id']]['score'] += result['score'] * k_weight
else:
all_results[result['id']] = {
'id': result['id'],
'score': result['score'] * k_weight,
'data': result['data']
}
return list(all_results.values())
BM25 Integration
import rank_bm25
class BM25Retriever:
def __init__(self, documents):
self.tokenized_docs = [doc.split() for doc in documents]
self.bm25 = rank_bm25.BM25Okapi(self.tokenized_docs)
def search(self, query, k=5):
tokenized_query = query.split()
scores = self.bm25.get_scores(tokenized_query)
top_indices = scores.argsort()[-k:][::-1]
return [(i, scores[i]) for i in top_indices]
5. Re-ranking Techniques
Cross-Encoder Re-ranking
from sentence_transformers import CrossEncoder
# Load cross-encoder for re-ranking
re_ranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
def rerank_results(query, candidates):
# Create pairs for cross-encoder
pairs = [[query, candidate['text']] for candidate in candidates]
# Get re-ranking scores
scores = re_ranker.predict(pairs)
# Sort by re-ranking scores
reranked = sorted(
zip(candidates, scores),
key=lambda x: x[1],
reverse=True
)
return [candidate for candidate, _ in reranked]
Cohere Reranking
import cohere
def cohere_rerank(documents, query, top_n=5):
response = cohere.Client(api_key='YOUR_API_KEY').rerank(
model='rerank-english-v2.0',
query=query,
documents=[doc['text'] for doc in documents],
top_n=top_n
)
ranked_docs = []
for result in response.results:
ranked_docs.append({
'document': documents[result.document_index],
'relevance_score': result.relevance_score
})
return ranked_docs
6. Contextual Compression
Compress Retrieved Documents
from langchain.retrievers import ContextualCompress
class DocumentCompressor:
def compress(self, documents, query):
compressed = []
for doc in documents:
# Extract relevant parts
relevant_text = self.extract_relevant_parts(doc['text'], query)
compressed.append({
'id': doc['id'],
'text': relevant_text,
'metadata': doc['metadata']
})
return compressed
def extract_relevant_parts(self, text, query):
# Simple keyword-based extraction
sentences = text.split('. ')
relevant = [s for s in sentences if any(word in s.lower() for word in query.lower().split())]
return '. '.join(relevant[:3]) # Top 3 relevant sentences
LLM-based Compression
def compress_with_llm(documents, query, llm):
context = "\n\n".join([doc['text'] for doc in documents])
prompt = f"""
Given the following context and query, extract only the relevant information:
Context: {context}
Query: {query}
Relevant Information:
"""
response = llm(prompt)
return response.strip()
7. RAG Evaluation Metrics
RAGAS Framework
from ragas.metrics import (
faithfulness,
answer_relevance,
context_relevance,
context_recall
)
from ragas import evaluate
# Evaluate RAG system
results = evaluate(
dataset,
metrics=[
faithfulness,
answer_relevance,
context_relevance,
context_recall
]
)
print(results)
Custom Evaluation
def evaluate_rag(query, answer, context, expected):
metrics = {}
# Context relevance
context_scores = [
cosine_similarity(
embed(query),
embed(context[i])
) for i in range(len(context))
]
metrics['context_relevance'] = max(context_scores)
# Answer relevance
metrics['answer_relevance'] = cosine_similarity(
embed(answer),
embed(expected)
)
# Faithfulness (requires LLM judgment)
metrics['faithfulness'] = check_factuality(answer, context)
return metrics
8. Resource Directory: RAG Systems
Best Books
| Book | Author | Price | Key Topics |
|---|
| Building RAG Applications | O'Reilly | Paid | RAG implementation |
| Practical RAG | Packt | Paid | Hands-on RAG |
| LLM Engineering | O'Reilly | Paid | Production LLMs |
| AI Engineering | O'Reilly | Paid | AI systems |
Best Udemy Courses
| Course | Instructor | Price (INR) | Key Topics |
|---|
| RAG Systems Course | Instructor | ₹1,999-2,999 | RAG implementation |
| LangChain RAG | Instructor | ₹1,499-2,299 | LangChain |
| Vector DBs for RAG | Instructor | ₹1,499-2,299 | Vector databases |
| Advanced RAG | Instructor | ₹1,999-2,999 | Advanced techniques |
Best O'Reilly Resources
| Resource | Topic | Access |
|---|
| Building RAG Applications | O'Reilly | Paid |
| Learning RAG Systems | O'Reilly | Paid |
| Practical RAG | O'Reilly | Paid |
Best LinkedIn Learning Courses
| Course | Instructor | Access |
|---|
| RAG Systems | Instructor | Paid |
| Knowledge Graphs | Instructor | Paid |
| AI Search | Instructor | Paid |
Free Resources
| Platform | Resource | Link |
|---|
| RAG Tutorial | LangChain | python.langchain.com/docs/use_cases/qa_knowledge |
| RAGAS Docs | GitHub | github.com/explodinggradients/ragas |
| Awesome RAG | GitHub | github.com/maxime-recher/awesome-rag |
| RAG Paper | arXiv | arxiv.org/abs/2005.11401 |
9. Common RAG Interview Questions
| Question | Answer |
|---|
| How to handle large documents in RAG? | Chunk documents, use sliding window, semantic chunking. |
| What is re-ranking? | Second-pass model that re-orders retrieved documents by relevance. |
| How to improve RAG accuracy? | Better chunking, re-ranking, prompt engineering, higher quality embeddings. |
| What is contextual compression? | Reduce document size while preserving relevant information. |
| How to evaluate RAG systems? | RAGAS metrics, human evaluation, domain-specific metrics. |
10. Part Navigation
Previous Parts
Part 22: Vector Databases
Next Parts
Part 24: LangChain Foundations ·
Part 25: LangGraph
Proceed to Part 24: LangChain Foundations →
Comments
Comments are powered by giscus. Set
PUBLIC_GISCUS_REPO_IDandPUBLIC_GISCUS_CATEGORY_IDin your environment to enable them.