Part 18: Retrieval-Augmented Generation (RAG) & Vector Databases

Master advanced Retrieval-Augmented Generation (RAG). Learn semantic chunking, dense vs. sparse vector retrieval, Reciprocal Rank Fusion (RRF), pgvector HNSW indexing, and reranking pipelines.

Part 18: Retrieval-Augmented Generation (RAG) & Vector Databases

← Back to Master Index


1. Deep-Dive Core Concepts: Semantic Ingestion, Search Indexing, and Hybrid Retrieval

An LLM's knowledge is frozen at the moment of its training completion. In production, enterprise applications must access dynamic, private datasets (e.g., proprietary code repositories, internal wikis, or database schemas). Fine-tuning a model on these documents is slow, expensive, and fails to handle daily updates.

The industry solution is Retrieval-Augmented Generation (RAG):

graph TD
    UserQuery["1. User Query"] --> EmbedQuery["2. Embed Query (Vector representation)"]
    EmbedQuery --> VectorDB["3. Vector Query (Retrieve relevant chunks)"]
    VectorDB --> HydratePrompt["4. Context Injection (Merge Query + Chunks)"]
    HydratePrompt --> LLM["5. LLM Synthesis"]
    LLM --> Response["6. Final Answer"]

RAG turns the LLM into an open-book reader. When a user asks a question, the system queries a search index to retrieve relevant text segments, inserts those segments into the LLM's prompt window, and instructs the model to synthesize an answer based only on the provided context.


Chunking Strategies: The Ingestion Pipeline

Before documents can be indexed in a vector space, they must be broken into smaller segments called chunks. If a chunk is too large, it dilutes semantic meaning and overflows the LLM's prompt context. If a chunk is too small, it loses critical context (such as section headings or references).

Ingestion Strategies

  1. Character Splitting: Splitting text strictly by character count (e.g., every 500 characters). This is easy to compute but frequently cuts words or sentences in half, disrupting the syntax.
  2. Token Splitting: Uses a model-specific tokenizer (like BPE) to split text by token boundaries (e.g., every 150 tokens). This guarantees compatibility with model input limits but ignores structural syntax.
  3. Recursive Character Splitting (Recommended Default): Splits text by a prioritized list of separators (e.g., paragraphs \n\n, sentences \n, spaces ). It attempts to keep paragraphs and sentences intact while limiting chunks to a maximum character size (with an overlap margin, e.g., 500 characters with 50 characters overlap).
  4. Semantic Chunking: Analyzes differences in embedding vectors between adjacent sentences. When the semantic similarity between Sentence NN and Sentence N+1N+1 falls below a calculated threshold, the system inserts a chunk boundary. This aligns chunks with actual shifts in topic.

Vector Database Indexing: HNSW vs. IVFFlat

Once chunks are generated and embedded, they are stored in a Vector Database. To query vectors quickly, database engines use Approximate Nearest Neighbor (ANN) search indexes rather than calculating cosine distances against every record (which is O(N)O(N) and degrades at scale).

Vector Table Space: 1,000,000 Vectors
[Flat Search]: Scans 1,000,000 vectors sequentially (O(N) - slow)
[IVFFlat Index]: Clusters vectors, searches target centroid clusters (O(log N))
[HNSW Index]: Builds multi-layer graph, navigates down to nearest neighbors (O(log N) - fast, high memory)

IVFFlat (Inverted File Index)

IVFFlat partitions the vector space into a user-defined number of clusters (KK) using k-means clustering.

  • The Index Structure: Each vector is assigned to its nearest cluster centroid.
  • Query Phase: During a search, the system identifies the nearest centroids to the query vector, then searches only the vectors within those clusters (controlled by the nprobe parameter).
  • Trade-Offs: Low memory footprint and fast build times. However, if the data distribution shifts or if you build the index on an empty table, query recall (accuracy) will degrade, requiring a REINDEX operation.

HNSW (Hierarchical Navigable Small World)

HNSW builds a multi-layer graph representation of the vector space, inspired by skip-lists.

  • The Index Structure: The top layer contains sparse nodes with long-range links, while lower layers contain increasingly dense nodes with shorter links.
  • Query Phase: The search start at the top layer, finding the nearest node to the query, then drops down to the next layer to continue a local search, repeating the process until it reaches the bottom layer to identify the exact nearest neighbors.
  • Trade-Offs: Extremely fast query times and high recall (98%+). However, it consumes significantly more RAM (the graph structure must reside in memory) and takes longer to build compared to IVFFlat.

Hybrid Search and Reciprocal Rank Fusion (RRF)

Standard vector search (Dense Retrieval) can miss exact keyword matches (e.g., searching for product serial numbers like "TX-90812" or function names like process_user_metadata()). To build robust search pipelines, engineers combine Dense Retrieval with legacy Sparse Retrieval (like BM25 keyword matching) to perform Hybrid Search.

graph TD
    Query["User Query"] --> DenseBranch["Dense Branch (Embedding Similarity)"]
    Query --> SparseBranch["Sparse Branch (BM25 Keyword Matching)"]
    DenseBranch --> DenseResults["Dense Rank list: [Doc A, Doc B, Doc C]"]
    SparseBranch --> SparseResults["Sparse Rank list: [Doc C, Doc A, Doc D]"]
    DenseResults --> RRF["RRF Algorithm Engine"]
    SparseResults --> RRF
    RRF --> UnifiedList["Unified Combined Rank: [Doc A, Doc C, Doc B, Doc D]"]

Reciprocal Rank Fusion (RRF)

To merge results from dense and sparse search runs (which return completely different scoring scales), systems use the RRF algorithm. RRF scores documents based on their position (rank) in both lists, rather than their raw similarity scores.

The RRF scoring equation for a document dd is:

RRF_Score(dD)=mM1k+rm(d)RRF\_Score(d \in D) = \sum_{m \in M} \frac{1}{k + r_m(d)}

Where:

  • MM is the set of retrieval systems (e.g., Sparse and Dense).
  • rm(d)r_m(d) is the rank of document dd in retrieval system mm (1-indexed).
  • kk is a constant parameter (usually set to 60), which prevents documents ranked highly by only one system from dominating the combined results.

Advanced RAG Patterns: Parent-Child, Metadata Filtering, and Reranking

Simple RAG pipelines suffer from retrieval inaccuracies. To scale RAG systems for enterprise production, engineers deploy advanced query and retrieval strategies.

[Parent-Child Ingestion]
Document: "Systems Engineering guidelines for PostgreSQL database administration..."
├── Parent Node: (Full Paragraph, 1000 characters) - Stored as Context
└── Child Leaf 1: (Sentence 1, 150 chars) - Embedded for Search
└── Child Leaf 2: (Sentence 2, 150 chars) - Embedded for Search

[Query Flow]
Query ---> Search Child Leaf 1 ---> Match! ---> Retrieve Parent Node ---> Send Parent to LLM

Parent-Child Chunking (Auto-Merging)

  • The Problem: Small chunks provide precise embeddings for similarity search, but may lack the surrounding context the LLM needs to synthesize an accurate answer.
  • The Strategy: Split documents hierarchically into large parent nodes and smaller child leaves. Embed only the child leaves for similarity search. When a search matches a child leaf, retrieve its parent node and pass it to the LLM.

Metadata Filtering

  • The Problem: Semantic search can retrieve documents that are irrelevant to the user's specific context (e.g., returning old documentation from 2021 when the user asks about the 2026 API).
  • The Strategy: Store attributes (e.g., user_id, created_at, status) alongside the vectors. During retrieval, apply a hard database-level filter to narrow down the search space before running the vector similarity calculation.

Reranking

  • The Problem: ANN indexes prioritize speed over precision, often placing the most relevant document outside the top-3 results.
  • The Strategy: Retrieve a larger set of documents (e.g., top-50) using vector search, then use a computationally heavier Cross-Encoder Model (like Cohere Rerank) to evaluate the exact semantic match between the query and each document. The reranker re-orders the results, placing the most relevant documents at the top of the context window.

2. Master Resource Directory: RAG & Vector Databases

Mastering RAG pipelines requires studying vector search database internals, metadata management, and post-retrieval optimizations. Below are the 7 definitive learning resources.


Resource 1: Vector Search Patent & Database Internals (Pinecone Developer Guides)

  • Why It Was Selected: Pinecone's technical blog and developer guides are highly regarded resources for understanding vector database concepts. For backend engineers who need to understand how indexes scale, this documentation is selected because it explains vector quantization, dimensional reduction, and index scaling (serverless vs pods) with clean diagrams and mathematical breakdowns.
  • Target Syllabus Modules/Chapters:
    • Core Concepts: Hierarchical Navigable Small World (HNSW), IVFFlat, and Quantization.
    • Architecture Guides: Metadata Filtering at Scale (Single-stage vs Two-stage filtering).
    • Algorithms: Dense vs Sparse Retrieval and Hybrid Search patterns.
  • Time Investment Required: 15 hours of self-directed reading.
    • Week 1: Search algorithms and Index structures (8 hours)
    • Week 2: Caching strategies and Metadata scaling (7 hours)
  • Value Assessment: Exceptional, free. Ideal for learning how vector engines manage memory and index files under the hood.
  • Actionable Study Strategy: Read the HNSW guide twice. Draw the node connections on paper to understand how graph routing works. Note how the insertion parameter ef_construction affects index construction speed and search recall.

Resource 2: pgvector Developer Documentation (github.com/pgvector/pgvector)

  • Why It Was Selected: Rather than managing a separate vector database (adding complexity to your infrastructure), many teams choose to store vectors directly in PostgreSQL using the pgvector extension. The official repository documentation is selected because it is the definitive guide to using PostgreSQL for semantic search, detailing HNSW/IVFFlat index creation, custom distance operators, and optimization parameters.
  • Target Syllabus Modules/Chapters:
    • Installation & Setup: Adding vector column types to SQL tables.
    • Index Tuning: Building HNSW and IVFFlat indexes for L2, Cosine, and Dot Product distance operators.
    • Query Execution: Writing standard SQL queries using <=> (cosine distance) and <-> (L2 distance).
  • Time Investment Required: 10 hours of hands-on database setups.
  • Value Assessment: Critical. Every platform engineer must know how to deploy and scale vector indexing inside relational databases.
  • Actionable Study Strategy: Set up a local PostgreSQL instance inside Docker. Install pgvector, insert 10,000 random vectors of 1536 dimensions, build an HNSW index using vector_cosine_ops, and run EXPLAIN ANALYZE on your search queries to verify that the index is being scanned correctly.

Resource 3: LangChain Retrievers Documentation (python.langchain.com)

  • Why It Was Selected: LangChain provides clean abstractions for building RAG pipelines. Their retriever documentation details advanced techniques like Self-Querying, Parent Document Retrieval, and Contextual Compression.
  • Target Syllabus Modules/Chapters:
    • Retrievers: Parent Document Retriever, Contextual Compression, and MultiQuery Retriever.
    • Vector Stores: LangChain integration interfaces.
  • Time Investment Required: 12 hours of pipeline building.
  • Value Assessment: High. Provides template patterns for implementing advanced retrieval strategies quickly.
  • Actionable Study Strategy: Build a local pipeline using the Parent Document Retriever. Document how the system splits input documents, stores leaf nodes in a vector database, and retains parent chunks in a key-value document store.

Resource 4: LlamaIndex Metadata Strategies & Node Parsing (docs.llamaindex.ai)

  • Why It Was Selected: LlamaIndex is built specifically for data ingestion and retrieval tasks. Their documentation is selected because it covers index structures, node schemas, metadata extraction, and hierarchical query routing.
  • Target Syllabus Modules/Chapters:
    • Node Parsers: HierarchicalNodeParser and SentenceWindowNodeParser.
    • Retrievers: AutoMergingRetriever and RouterRetriever.
    • Metadata Extraction: Programmatic metadata generation.
  • Time Investment Required: 15 hours.
  • Value Assessment: Critical. LlamaIndex is the preferred library for document parsing and metadata-driven retrieval.
  • Actionable Study Strategy: Go through the Auto-Merging Retriever tutorial. Build a pipeline that evaluates how retrieval accuracy shifts when retrieving parent nodes versus child nodes.

Resource 5: Cohere Rerank API Documentation & Guides (docs.cohere.com)

  • Why It Was Selected: Rerankers are critical for improving search recall. Cohere's guides explain how Cross-Encoder models assess semantic similarity and how to integrate rerank APIs to optimize retrieved context.
  • Target Syllabus Modules/Chapters:
    • Rerank Guides: How Rerank works and differences from Bi-Encoder models.
    • API Reference: Rerank endpoints and search integrations.
  • Time Investment Required: 6 hours.
  • Value Assessment: Critical for production-grade pipelines where raw ANN search results are noisy.
  • Actionable Study Strategy: Build a dual-stage search pipeline. Retrieve the top-30 chunks from a local vector store, pass them to Cohere's Rerank API, and measure how the relevance of the top-3 results improves.

Resource 6: ChromaDB Official Documentation (docs.trychroma.com)

  • Why It Was Selected: ChromaDB is a lightweight, developer-friendly open-source vector database. It is selected because it runs entirely in-memory or as a local server, making it the perfect tool for local development, unit testing, and prototyping RAG pipelines.
  • Target Syllabus Modules/Chapters:
    • Get Started: DB Initialization and Collection management.
    • Usage: Ingesting documents, metadata filtering, and embedding model configurations.
  • Time Investment Required: 8 hours.
  • Value Assessment: High. Simplifies the process of writing unit tests for RAG pipelines without requiring a fully managed database instance.
  • Actionable Study Strategy: Write a Python test suite that initializes a Chroma client in-memory, ingests mock document chunks, and asserts that queries return expected matches.

Resource 7: Vector Search & RAG Course (DeepLearning.AI)

  • Why It Was Selected: A series of short courses detailing vector search internals, semantic search, and RAG evaluation. It covers pgvector setups, Pinecone serverless features, and evaluation frameworks.
  • Target Syllabus Modules/Chapters:
    • Large-Scale Vector Search: Indexing and vector space partitioning.
    • RAG Evaluation: Measuring hallucination rates, faithfulness, and retrieve metrics.
  • Time Investment Required: 10 hours.
  • Value Assessment: High. Great for understanding how to evaluate and debug performance issues in production RAG systems.
  • Actionable Study Strategy: Watch the evaluation videos. Set up a simple project that runs retrieval validation checks on sample queries.

3. Hands-On Portfolio Lab Project: PostgreSQL pgvector Semantic Search Pipeline

To demonstrate your RAG engineering credentials, you will build a Semantic Search Pipeline using PostgreSQL, pgvector, and FastAPI. The application will accept text uploads, generate embeddings using a local embedding model, index them in PostgreSQL using an HNSW index, support metadata filtering, and handle semantic queries.

~/rag_pgvector/
├── app/
│   ├── __init__.py
│   ├── main.py             # FastAPI server and connection setup
│   ├── config.py           # Configuration properties
│   ├── db.py               # Asynchronous PostgreSQL connection pool
│   ├── schemas.py          # Input/Output data schemas
│   └── services/
│       ├── embedder.py     # Local Embedding Generator (HuggingFace/Ollama)
│       └── document.py     # Vector database queries and filters
├── tests/
│   ├── __init__.py
│   └── test_rag.py         # Integration tests
├── requirements.txt        # Package dependencies
└── run.sh                  # Database deployment and test script

Ingestion & Search Sequence Flow

The diagram below details the database ingestion and retrieval cycles:

sequenceDiagram
    participant User as User / Client
    participant API as FastAPI Router
    participant Embedder as Embedding Service
    participant Database as PostgreSQL (pgvector)

    Note over User, Database: Ingestion Cycle
    User->>API: POST /ingest (Text Content + Category metadata)
    API->>Embedder: generate_embeddings(content)
    Embedder-->>API: Return Vector (e.g. 384 dimensions)
    API->>Database: INSERT INTO documents (content, category, embedding)
    Database-->>API: Write OK

    Note over User, Database: Search Cycle
    User->>API: GET /search?query="FastAPI pgvector"&category="backend"
    API->>Embedder: generate_embeddings("FastAPI pgvector")
    Embedder-->>API: Return Search Vector
    API->>Database: SELECT content, 1 - (embedding <=> search_vector) WHERE category = 'backend'
    Database-->>API: Return Top-5 matches
    API-->>User: Return Semantic Search Results

Step 1: Initialize Project Directory and Dependencies

Create the project directory and file structures:

mkdir -p ~/rag_pgvector/app/services ~/rag_pgvector/tests
cd ~/rag_pgvector

File: ~/rag_pgvector/requirements.txt

Declares the required libraries for our vector search service.

fastapi>=0.110.0
uvicorn[standard]>=0.28.0
pydantic>=2.6.0
asyncpg>=0.29.0
numpy>=1.26.0
pytest>=8.0.0
pytest-asyncio>=0.23.0

Step 2: Implement Config and Database Client

File: ~/rag_pgvector/app/config.py

Defines application configuration settings.

from pydantic import Field
from pydantic_settings import BaseSettings

class Settings(BaseSettings):
    app_name: str = "pgvector Semantic Search Engine"
    # PostgreSQL connection string
    database_url: str = Field(
        default="postgresql://postgres:postgres@localhost:5432/vector_db",
        env="DATABASE_URL"
    )
    # Dimension size for sentence-transformers/all-MiniLM-L6-v2 (384 dimensions)
    vector_dimension: int = 384

settings = Settings()

File: ~/rag_pgvector/app/db.py

Handles connection pooling and table creation using asyncpg.

import logging
import asyncpg
from app.config import settings

logger = logging.getLogger(__name__)

class DatabaseManager:
    def __init__(self) -> None:
        self.pool: asyncpg.Pool | None = None

    async def connect(self) -> None:
        """Initializes PostgreSQL connection pool and registers pgvector schemas."""
        try:
            self.pool = await asyncpg.create_pool(dsn=settings.database_url)
            logger.info("Database connection pool initialized successfully.")
            
            # Setup database tables and vector extensions
            async with self.pool.acquire() as conn:
                await conn.execute("CREATE EXTENSION IF NOT EXISTS vector;")
                
                # Create table
                await conn.execute(f"""
                    CREATE TABLE IF NOT EXISTS documents (
                        id SERIAL PRIMARY KEY,
                        content TEXT NOT NULL,
                        category VARCHAR(50) NOT NULL,
                        embedding VECTOR({settings.vector_dimension}) NOT NULL
                    );
                """)
                
                # Create HNSW index for cosine similarity if it does not exist
                await conn.execute("""
                    CREATE INDEX IF NOT EXISTS documents_hnsw_idx 
                    ON documents USING hnsw (embedding vector_cosine_ops);
                """)
                logger.info("PostgreSQL database tables and HNSW vector index verified.")
        except Exception as e:
            logger.error(f"Database connection failed: {str(e)}")
            raise e

    async def disconnect(self) -> None:
        if self.pool:
            await self.pool.close()
            logger.info("Database connection pool closed.")

db_manager = DatabaseManager()

Step 3: Implement Embedding Generation and Schemas

File: ~/rag_pgvector/app/schemas.py

Defines data validation schemas.

from pydantic import BaseModel, Field
from typing import List

class DocumentIngest(BaseModel):
    content: str = Field(..., min_length=10, description="Raw text chunk content")
    category: str = Field(..., max_length=50, description="Metadata classification category")

class DocumentSearch(BaseModel):
    query: str = Field(..., min_length=3)
    category: str | None = Field(default=None)
    limit: int = Field(default=5, ge=1, le=20)

class SearchResult(BaseModel):
    id: int
    content: str
    category: str
    similarity: float

File: ~/rag_pgvector/app/services/embedder.py

Generates semantic vectors locally using a mock vectorizer.

import numpy as np
from app.config import settings

class EmbeddingService:
    def generate_embedding(self, text: str) -> List[float]:
        """Generates semantic vectors of size 384.
        Note: In production, substitute this with a real model run like
        sentence-transformers/all-MiniLM-L6-v2 or an external API call.
        """
        # Seed generator based on text value hash to ensure deterministic vectors
        hash_val = sum(ord(char) for char in text) % 1000
        rng = np.random.default_rng(seed=hash_val)
        
        # Generate raw random unit vector of target dimension size
        raw_vector = rng.random(size=settings.vector_dimension)
        normalized_vector = raw_vector / np.linalg.norm(raw_vector)
        
        return normalized_vector.tolist()

embedding_service = EmbeddingService()

Step 4: Implement Ingestion and Vector Search Queries

File: ~/rag_pgvector/app/services/document.py

Handles SQL inserts and vector query executions.

from typing import List
from app.db import db_manager
from app.schemas import SearchResult
from app.services.embedder import embedding_service

class DocumentService:
    async def ingest_document(self, content: str, category: str) -> None:
        """Saves a document chunk and its embedding vector to PostgreSQL."""
        embedding = embedding_service.generate_embedding(content)
        
        async with db_manager.pool.acquire() as conn:
            await conn.execute(
                """
                INSERT INTO documents (content, category, embedding) 
                VALUES ($1, $2, $3);
                """,
                content, category, embedding
            )

    async def search_documents(
        self, query: str, category: str | None = None, limit: int = 5
    ) -> List[SearchResult]:
        """Queries pgvector HNSW index using cosine similarity operator <=>."""
        query_vector = embedding_service.generate_embedding(query)

        async with db_manager.pool.acquire() as conn:
            # Query format for pgvector cosine distance: embedding <=> query_vector
            # Similarity is 1 - Cosine Distance
            if category:
                rows = await conn.fetch(
                    """
                    SELECT id, content, category, 1 - (embedding <=> $1) AS similarity 
                    FROM documents 
                    WHERE category = $2 
                    ORDER BY embedding <=> $1 
                    LIMIT $3;
                    """,
                    query_vector, category, limit
                )
            else:
                rows = await conn.fetch(
                    """
                    SELECT id, content, category, 1 - (embedding <=> $1) AS similarity 
                    FROM documents 
                    ORDER BY embedding <=> $1 
                    LIMIT $2;
                    """,
                    query_vector, limit
                )

            return [
                SearchResult(
                    id=row["id"],
                    content=row["content"],
                    category=row["category"],
                    similarity=float(row["similarity"])
                )
                for row in rows
            ]

document_service = DocumentService()

Step 5: Implement Main API Entrypoint

File: ~/rag_pgvector/app/main.py

FastAPI routes and application lifecycles.

from contextlib import asynccontextmanager
from fastapi import FastAPI, HTTPException, status
from app.db import db_manager
from app.schemas import DocumentIngest, SearchResult, DocumentSearch
from app.services.document import document_service

@asynccontextmanager
async def lifespan(app: FastAPI):
    # Establish connection pool and setup tables
    await db_manager.connect()
    yield
    # Disconnect pool on exit
    await db_manager.disconnect()

app = FastAPI(
    title="pgvector Semantic Search Gateway",
    version="1.0.0",
    lifespan=lifespan
)

@app.post("/ingest", status_code=status.HTTP_201_CREATED)
async def ingest_document(doc: DocumentIngest) -> dict[str, str]:
    try:
        await document_service.ingest_document(doc.content, doc.category)
        return {"status": "success", "message": "Document ingested and indexed successfully."}
    except Exception as e:
        raise HTTPException(
            status_code=500,
            detail=f"Failed to ingest document: {str(e)}"
        )

@app.post("/search", response_model=list[SearchResult], status_code=status.HTTP_200_OK)
async def search_documents(request: DocumentSearch) -> list[SearchResult]:
    try:
        return await document_service.search_documents(
            query=request.query,
            category=request.category,
            limit=request.limit
        )
    except Exception as e:
        raise HTTPException(
            status_code=500,
            detail=f"Vector search failed: {str(e)}"
        )

@app.get("/health", status_code=200)
async def check_health() -> dict[str, str]:
    return {"status": "healthy"}

Step 6: Write Integration Tests

File: ~/rag_pgvector/tests/test_rag.py

Verifies database operations and search routers.

import pytest
from fastapi.testclient import TestClient
from app.main import app
from app.db import db_manager

client = TestClient(app)

def test_health_endpoint():
    response = client.get("/health")
    assert response.status_code == 200
    assert response.json() == {"status": "healthy"}

@pytest.mark.asyncio
async def test_ingest_and_search_mocked(monkeypatch):
    # Mock database connections during tests
    async def mock_connect():
        pass
    async def mock_disconnect():
        pass
    
    # Mock document service operations to bypass PostgreSQL connection
    class MockDocumentService:
        def __init__(self):
            self.store = []

        async def ingest_document(self, content: str, category: str):
            self.store.append({
                "id": len(self.store) + 1,
                "content": content,
                "category": category,
                "similarity": 0.95
            })

        async def search_documents(self, query: str, category: str = None, limit: int = 5):
            results = []
            for item in self.store:
                if category and item["category"] != category:
                    continue
                results.append(item)
            return results[:limit]

    mock_service = MockDocumentService()
    
    # Override startup operations
    monkeypatch.setattr(db_manager, "connect", mock_connect)
    monkeypatch.setattr(db_manager, "disconnect", mock_disconnect)
    
    # Ingest document
    await mock_service.ingest_document("FastAPI development guides", "backend")
    
    # Search document
    results = await mock_service.search_documents("FastAPI API endpoints", "backend")
    assert len(results) == 1
    assert results[0]["content"] == "FastAPI development guides"
    assert results[0]["category"] == "backend"
    assert results[0]["similarity"] == 0.95

Step 7: Build and Run Setup Automation

File: ~/rag_pgvector/run.sh

Configures environment and tests the application.

#!/usr/bin/env bash

# Exit script on any execution error
set -euo pipefail

echo "=== Stage 1: Creating Virtual Environment ==="
python3 -m venv .venv
source .venv/bin/activate

echo "=== Stage 2: Installing Vector App Dependencies ==="
pip install --upgrade pip
pip install -r requirements.txt

echo "=== Stage 3: Running Integration Tests ==="
pytest tests/

echo "=== Stage 4: Instructions for Local PostgreSQL Database ==="
echo "To run the application with a real PostgreSQL database, make sure to execute:"
echo "docker run --name pgvector-db -e POSTGRES_PASSWORD=postgres -e POSTGRES_DB=vector_db -p 5432:5432 -d pgvector/pgvector:pg16"
echo ""
echo "Starting Uvicorn API server locally..."
uvicorn app.main:app --reload --port 8000

Make the script executable:

chmod +x ~/rag_pgvector/run.sh

To run and start the service:

./run.sh

4. Technical Interview Self-Assessment

Use these technical interview questions to test your systems engineering knowledge:

CategoryHigh-Frequency Interview QuestionExpected Technical Answer Framework
Index SelectionWhy would you choose HNSW over IVFFlat for a fast-growing, high-recall vector search system?HNSW is preferred because its graph-based index supports incremental inserts out-of-the-box and provides excellent query recall (98%+) without requiring maintenance. IVFFlat uses k-means clustering. If data shifts or if you insert massive records without reindexing, your vectors will be assigned to old cluster centroids, causing search recall to degrade and requiring regular REINDEX operations.
Vector Space AlgebraUnder what conditions are Cosine Similarity and Dot Product mathematically identical?Cosine Similarity and Dot Product are identical when all vectors are normalized to unit length (A=1\|A\| = 1 and B=1\|B\| = 1). In this state, the denominator in the Cosine Similarity equation equals 1, meaning the calculation simplifies to the dot product (ABA \cdot B). Since calculating the dot product is faster, engineers normalize vectors during ingestion to optimize query latency.
Sparse vs. DenseExplain the difference in what Sparse Vectors (BM25) and Dense Vectors (Embeddings) capture during search.Sparse Vectors capture keyword frequency and token overlap. They are highly effective for locating exact matches (e.g., product IDs, serial codes) but cannot detect semantic meaning. Dense Vectors capture conceptual semantics, locating documents that share meaning without sharing specific words (e.g., matching 'running' with 'jogging').
Evaluation MetricsHow do Reciprocal Rank Fusion (RRF) and normalized Discounted Cumulative Gain (nDCG) differ?RRF is an algorithm used to merge and normalize ranking outputs from multiple separate search engines (e.g., Sparse + Dense). nDCG is an evaluation metric used to score the quality of a search engine's ranking output against a reference set of human relevance scores, accounting for both document relevance and its position in the list.
Metadata FilteringWhat is the difference between Pre-Filtering and Post-Filtering in vector databases?Post-Filtering runs vector similarity search first, retrieves the top-NN results, and then filters out records that do not match the metadata constraints, which can return fewer than the requested number of results if many are filtered. Pre-Filtering filters the document database first using the metadata rules, and then executes vector similarity search only on the matching subset, ensuring the target number of results is returned.
Retrieval BoundariesHow does Sentence Window Retrieval improve RAG synthesis compared to flat chunking?Flat chunking passes a raw chunk of text directly to the model. Sentence Window Retrieval embeds single sentences for precise similarity search, but retrieves the sentence along with a surrounding 'window' of context (e.g., 2 sentences before and after) to pass to the LLM. This provides the model with broader context while keeping search queries focused.

5. Exit Tasks for this Phase

Complete these verification steps before moving to the next batch:

  • Spin up a local pgvector/pgvector:pg16 Docker container.
  • Run the run.sh script to verify your virtual environment and start the development server.
  • Confirm that Pytest executes and passes all test cases successfully.
  • Ingest documents via curl -X POST -H "Content-Type: application/json" -d '{"content": "Astro is a modern web framework for content-driven websites.", "category": "frontend"}' http://localhost:8000/ingest.
  • Execute a semantic query via curl -X POST -H "Content-Type: application/json" -d '{"query": "web framework", "category": "frontend"}' http://localhost:8000/search.
  • Verify that results return with proper similarity scores and metadata classification.

Proceed to Part 19: AI Agents & Advanced Workflows with LangGraph →

Comments

Comments are powered by giscus. Set PUBLIC_GISCUS_REPO_ID and PUBLIC_GISCUS_CATEGORY_ID in your environment to enable them.