Turn Your Old HP Laptop Into a Home Server — Part 4: Local AI, Agents, and Visual Builders

Run local LLMs on your 16GB home server. Part 4 covers CPU inference physics, Ollama setup, model comparison (Phi-4, Qwen, Hermes), agent frameworks (OpenClaw vs Hermes), coding agents (Aider, OpenHands), and visual workflow builders.

Turn Your Old HP Laptop Into a Home Server — Part 4: Local AI, Agents, and Visual Builders

In Part 3 of this series, we secured our home server's ingress by configuring Cloudflare Tunnels for public traffic, Tailscale for administrative mesh networks, and Caddy as our lightweight reverse proxy.

With our infrastructure, memory management, and networking securely established, we are now ready to tackle one of the most exciting capabilities of a modern home server: Local AI and Autonomous Agents.

Run-of-the-mill cloud virtual machines with 2 GB or 4 GB of RAM cannot host even the smallest Large Language Model (LLM). Exposing APIs that call commercial models (like OpenAI's GPT-4o or Anthropic's Claude 3.5 Sonnet) introduces ongoing API costs and leaks your private data.

By leveraging the i5-1035G1 processor and 16 GB of RAM of our HP laptop, we can deploy a fully operational local AI stack. In this guide, we will dive deep into the physics of CPU-only LLM inference, configure Ollama inside Docker, compare recommended lightweight models, evaluate agent frameworks like OpenClaw and Hermes, assess AI coding assistants, and map out the memory footprint of visual builders like Dify and Flowise.


1. The Physics of Local LLM Inference on CPU

Most AI research and enterprise deployments assume you have access to dedicated graphics cards (GPUs) with tens of gigabytes of high-bandwidth video RAM (VRAM) like the NVIDIA H100 or RTX 4090. Running LLMs on a budget laptop CPU requires understanding the mathematical bottlenecks of inference.

The Memory Bandwidth Bottleneck

An LLM generates text token by token. To generate a single token, the processor must load the entire model's weight matrix from system memory (RAM), perform a series of matrix-vector multiplications with the input context, select the next token, and write it back.

Therefore, LLM generation speed is fundamentally bounded by memory bandwidth, not raw compute speed.

Let's look at the numbers for our HP 15s-du2077TU:

  • Processor: Intel Core i5-1035G1 supports dual-channel DDR4 memory.
  • Memory Speed: DDR4 running at 2666 MHz provides a theoretical maximum bandwidth of 21.3 GB/s per channel, or 42.6 GB/s in a dual-channel configuration.
  • Note: If your laptop has only a single stick of RAM installed, it is running in single-channel mode. Upgrading to two identical sticks to enable dual-channel mode instantly doubles your memory bandwidth, doubling your local LLM speed.

Calculating Theoretical Token Generation Speed

Assume we want to run a 7-Billion parameter model. If the model weights are stored in standard FP16 (16-bit floating point), the model size is: Size=7 billion parameters×2 bytes=14 GB\text{Size} = 7 \text{ billion parameters} \times 2 \text{ bytes} = 14 \text{ GB}

To generate a single token, the CPU must read 14 GB of weights from memory. Using our dual-channel DDR4 bandwidth of 42.6 GB/s, the absolute physical limit of token speed is: Max Speed=42.6 GB/s14 GB/token3.04 tokens per second\text{Max Speed} = \frac{42.6 \text{ GB/s}}{14 \text{ GB/token}} \approx 3.04 \text{ tokens per second}

If we use a 4-bit quantized model (which compresses the weights to 4 bits or 0.5 bytes per parameter while maintaining most of the model's accuracy), the model size drops: Quantized Size=7 billion parameters×0.5 bytes3.5 GB\text{Quantized Size} = 7 \text{ billion parameters} \times 0.5 \text{ bytes} \approx 3.5 \text{ GB}

Now, the theoretical speed limit increases significantly: Max Speed=42.6 GB/s3.5 GB/token12.17 tokens per second\text{Max Speed} = \frac{42.6 \text{ GB/s}}{3.5 \text{ GB/token}} \approx 12.17 \text{ tokens per second}

This mathematical reality explains why quantization is mandatory for consumer-grade CPU hosting.

+-------------------------------------------------------------+
|                MEMORY BANDWIDTH INFERENCE PATH              |
|                                                             |
|   [ DDR4 RAM ] ----------> [ i5-1035G1 CPU ]                |
|   Bandwidth: ~42 GB/s      Processes Matrix Multiplications |
|   (Loads model weights)    (Calculates token probabilities) |
|           ^                         |                       |
|           | (FP16: 14GB/token)      v                       |
|           +------------------ [ 3 tokens/sec ]              |
|           | (Q4: 3.5GB/token)                               |
|           +------------------ [ 12 tokens/sec ]             |
+-------------------------------------------------------------+

CPU Threads Tuning: The Core Count Rule

When configuring your LLM runtime (like llama.cpp or Ollama), you can specify how many CPU threads to allocate.

  • The Trap: You might assume that because the i5-1035G1 has 8 logical threads (via Hyper-Threading), you should set the thread count to 8.
  • The Reality: Hyper-Threading works by scheduling two logical threads on a single physical core, sharing execution pipelines. Because matrix multiplication saturates the physical execution units completely, hyper-threading introduces thread scheduling overhead, cache thrashing, and actually slows down inference.
  • The Solution: Always set the thread count to match your physical core count (which is 4 for the i5-1035G1).

2. Deploying Ollama in Docker

Ollama is a highly optimized, easy-to-use engine for running local LLMs. It packages llama.cpp's C++ execution engine and exposes a clean HTTP API (compatible with OpenAI's API schemas) on port 11434.

Docker Compose Configuration

Add Ollama to your /mnt/data/docker-compose.yml file. Since we do not have an external GPU, we will configure it for CPU execution and restrict its resource allocation:

services:
  ollama:
    image: ollama/ollama:latest
    container_name: ollama_ai
    restart: always
    ports:
      - "127.0.0.1:11434:11434" # Bind to localhost for security
    volumes:
      - /mnt/data/ollama/data:/root/.ollama
    environment:
      - OLLAMA_NUM_PARALLEL=1 # Avoid concurrent inference runs to prevent OOM
      - OLLAMA_KEEP_ALIVE=5m  # Keep models in RAM for 5 minutes of idle time
    deploy:
      resources:
        limits:
          memory: 4500M # Cap memory to prevent host OOM crashes
          cpus: '3.00'  # Leave 1 core free for OS and network processes

Start the service:

docker compose up -d ollama

Verify the API is active:

curl http://127.0.0.1:11434/api/tags

3. Model Assessment: What Runs on 16 GB?

When picking an LLM, you must balance parameter count, RAM usage, context window, and accuracy. On a 16 GB server, we have roughly 12 GB of RAM available for the model after allocating space for the OS and other services.

Always select the Q4_K_M (4-bit Medium) quantization. It is the optimal sweet spot, reducing model size by 75% while maintaining close to 99% of the perplexity of the original model.

Let's evaluate the top models available in 2026:

LLM Model Comparison Table

ModelParametersRAM UsedSpeed (CPU)Context WindowTarget Use Case / Recommendation
Phi-4 Mini3.8 Billion~2.5 GB~9–11 t/s128KRecommended for general tasks. Developed by Microsoft. Extremely fast on CPU. Has a massive 128k context window, allowing you to feed in long documents.
Qwen3 4B4.0 Billion~2.8 GB~8–10 t/s32KBest for multilingual support. Excellent reasoning capabilities for its size, especially in coding, math, and translation tasks.
Gemma 3 4B4.0 Billion~2.8 GB~8–10 t/s8KSolid conversational model. Developed by Google. High quality but smaller context window.
Hermes 3 8B8.0 Billion~5.8 GB~4–5 t/s32KRecommended for agentic loops. Fine-tuned by Nous Research. Excellent instruction-following, tool use, and cognitive depth. Slow but powerful.
Devstral 7B7.0 Billion~5.0 GB~4–6 t/s32KBest for code completion. Fine-tuned for software development, writing code blocks, and resolving bugs.

The Model Swapping Strategy

Because loading two models simultaneously (e.g., Phi-4 Mini and Hermes 3 8B) will exceed our memory limits, we must configure Ollama to load only one model at a time.

  • By setting OLLAMA_NUM_PARALLEL=1, Ollama processes queries sequentially.
  • By setting OLLAMA_KEEP_ALIVE=5m, Ollama loads the model into RAM, processes your request, and keeps it loaded for 5 minutes. If no new requests arrive, it automatically unloads the model from memory, freeing up RAM for your databases and other containers.

4. Agent Frameworks: OpenClaw vs. Hermes Agent

Once you can execute local models, you can build AI Agents. An agent is an LLM wrapper that operates in a continuous loop: it parses input, calls external tools (like database queries, web searches, or shell executions), inspects the results, and acts autonomously.

Let's analyze two prominent open-source agent paradigms: OpenClaw and Hermes Agent.

+------------------------------------------+   +------------------------------------------+
|                 OPENCLAW                 |   |               HERMES AGENT               |
|                                          |   |                                          |
|  +------------------+ +---------------+  |   |  +------------------------------------+  |
|  | Whatsapp Bridge  | | Slack Bridge  |  |   |  |        Self-Reflective Loop        |  |
|  +------------------+ +---------------+  |   |  |      \"Did I solve the task?\"       |  |
|  +------------------------------------+  |   |  +------------------------------------+  |
|  |           ClawHub Gateway          |  |   |  +------------------------------------+  |
|  +------------------------------------+  |   |  |         Long-Term Memory           |  |
|  |           Tool Integrations        |  |   |  |         (Vector Vector DB)         |  |
|  +------------------------------------+  |   |  +------------------------------------+  |
+------------------------------------------+   +------------------------------------------+

OpenClaw

OpenClaw is a gateway-first, connector-heavy framework. It acts as an integration router, bridging local LLMs to communications platforms.

  • Key Features:
    • ClawHub: A shared marketplace of connectors and tools.
    • Built-in bridges for Telegram, WhatsApp, Slack, and Discord.
    • Designed for broad integration: you can hook your local bot up to search tools and document databases quickly.
  • Operational Cost: Low memory overhead (~80 MB–150 MB for the agent daemon).

Hermes Agent

Hermes Agent is a cognitive-first, learning-centric framework designed specifically to exploit the advanced reasoning capabilities of the Nous Hermes model series.

  • Key Features:
    • Self-Reflective Loops: The agent runs internal critiques ("Did my previous tool call return correct data? If not, how do I adjust my strategy?").
    • Persistent Episodic Memory: Stores historical task execution logs in a local vector database to learn from past failures.
    • Deep tool-use: optimized for executing complex code blocks.
  • Operational Cost: High compute load. Because of the self-reflective loops, generating a response takes multiple LLM runs, which can take 1-2 minutes on CPU.

Feature Matrix: OpenClaw vs. Hermes Agent

DimensionOpenClawHermes Agent
Primary FocusConnectivity & IntegrationsReasoning & Cognitive Depth
Messaging SupportNative (Telegram, WhatsApp)Custom/API only
Memory SystemSimple Context WindowVector-based Episodic Memory
Looping ArchitectureLinear tool-useSelf-correcting reflection loops
Model OptimizationAgnostic (any model)Tuned for Hermes-3-8B
RAM Footprint~100 MB~250 MB (excluding LLM)
Ideal ForCustom automation bots, notification routersAutonomous research tasks, code synthesis

Step 4.2: Cognitive Execution Traces: Linear vs. Cyclic Loops

To understand the practical difference between these two systems under local CPU constraints, let's trace how each agent processes a typical user request.

Scenario A: Routing a notification when a build fails (OpenClaw)

OpenClaw is designed for speed and direct routing. It relies on a linear execution path:

  1. Event Trigger: A GitHub webhook notifies OpenClaw that a CI/CD build has failed.
  2. Payload Parsing: OpenClaw's internal router parses the repository name and error log.
  3. LLM Call: OpenClaw calls Ollama with the prompt: Summarize this build error in one sentence: [log snippet]
  4. Inference: Ollama running Phi-4 Mini generates the summary: Build failed due to missing semicolon on line 42 of auth.ts.
  5. Output Routing: OpenClaw's Telegram connector takes the output and pushes a message to the user's admin channel.
  • Result: The process completes in less than 2 seconds. It is direct, uses minimal resources, and makes exactly one LLM call.

Scenario B: Writing a server temperature monitoring script (Hermes Agent)

Hermes Agent is designed for problem-solving. It uses a cyclic reflection loop:

  1. User Input: User requests: Write a script to check CPU temperature on Ubuntu.
  2. Memory Recall: The agent queries its local vector store to check for past thermal tasks or scripts.
  3. Tool Call 1 (Write file): The agent decides to write an initial script using lm-sensors:
    #!/bin/bash
    sensors | grep -i "Core 0"
    
  4. Tool Call 2 (Validation execution): The agent runs the script inside a local shell sandboxed process to verify its correctness.
  5. Error Interception: The sandbox returns: line 2: sensors: command not found.
  6. Self-Correction Reflection: The agent does not return the broken script to the user. Instead, it feeds the error back into the LLM: Thoughts: "The script failed because lm-sensors is not installed on this system. I must install the package first, or find a fallback method like reading /sys/class/thermal/thermal_zone0/temp."
  7. Tool Call 3 (Fallback implementation): The agent rewrites the script using sysfs path mapping:
    #!/bin/bash
    cat /sys/class/thermal/thermal_zone0/temp
    
  8. Tool Call 4 (Re-evaluation execution): The agent executes the new script in the sandbox. It succeeds, returning 42000 (meaning 42°C).
  9. Formatting Output: The agent formats the final code, explaining how it parses the raw millidegrees, and presents it to the user.
  • Result: The process takes 40 to 90 seconds on a CPU and makes 4 separate LLM queries. However, the output is verified, self-corrected, and guaranteed to work on the host system without requiring manual debugging by the developer.

5.2 Multi-Agent Frameworks: CrewAI vs. LangGraph vs. AutoGen

If you want to move beyond single-agent loops and build networks of collaborating AI agents—where one agent does research, another compiles code, and a third runs tests—you must use a multi-agent framework.

Let's evaluate the top three frameworks from a resource and architecture perspective:

1. CrewAI

CrewAI is a high-level, role-based multi-agent framework. It operates on a simple mental model: you define Agents (with specific roles, backstories, and tools) and Tasks, then define a Crew to orchestrate them (either sequentially or hierarchically).

  • Pros: Fast prototyping, readable code, built-in task delegation.
  • Cons: Abstracted execution loops. It can make multiple duplicate LLM calls in the background to handle orchestration, which can freeze a local CPU-only host.
  • RAM Footprint: ~120 MB.

2. LangGraph

LangGraph is a low-level orchestration framework developed by LangChain. It models multi-agent workflows as a State Graph (a cyclic state machine) where nodes represent computation steps (or agent calls) and edges represent transition logic.

  • Pros: Complete control over agent decision paths, state persistence, time-travel debugging, and highly predictable prompt flows. Extremely efficient for CPU hosting because you control exactly when the LLM is queried.
  • Cons: Steeper learning curve; requires writing custom state management code.
  • RAM Footprint: ~80 MB.

3. AutoGen

AutoGen is a conversational multi-agent framework developed by Microsoft. It is designed around the concept of agent-to-agent conversation.

  • Pros: Highly dynamic, supports human-in-the-loop triggers easily, and is good for research.
  • Cons: Conversations can easily diverge into infinite loops, consuming massive CPU cycles and memory.
  • RAM Footprint: ~150 MB.

LangGraph State Machine Blueprint (agent_graph.py)

Here is a complete, lightweight Python script showing how to build a state-machine router using LangGraph and our local Ollama API to route queries between a general chat assistant and an SQL execution agent:

from typing import TypedDict, Annotated, Sequence
import operator
from langgraph.graph import StateGraph, END
import requests

# Define our shared system state
class AgentState(TypedDict):
    messages: Annotated[Sequence[str], operator.add]
    next_step: str

OLLAMA_URL = "http://localhost:11434"

# Node 1: Classifier Agent (Phi-4 Mini)
def classifier_agent(state: AgentState):
    user_message = state["messages"][-1]
    prompt = f"""Analyze the user query. Determine if the query requires querying a database (SQL) or if it is a general question.
Respond with exactly one word: 'database' or 'general'.

Query: {user_message}
Classification:"""
    
    response = requests.post(f"{OLLAMA_URL}/api/generate", json={
        "model": "phi4:mini",
        "prompt": prompt,
        "stream": False
    })
    classification = response.json()["response"].strip().lower()
    
    # Clean output
    next_step = "database" if "database" in classification else "general"
    return {"next_step": next_step}

# Node 2: SQL Agent
def sql_executor(state: AgentState):
    user_message = state["messages"][-1]
    # In a real app, this would query postgres
    result = f"Executed mock query for request: {user_message}"
    return {"messages": [f"[SQL Runner Node]: {result}"]}

# Node 3: General Chat Agent
def general_assistant(state: AgentState):
    user_message = state["messages"][-1]
    prompt = f"Respond politely to: {user_message}"
    
    response = requests.post(f"{OLLAMA_URL}/api/generate", json={
        "model": "phi4:mini",
        "prompt": prompt,
        "stream": False
    })
    answer = response.json()["response"]
    return {"messages": [f"[Assistant Node]: {answer}"]}

# Conditional router function
def router(state: AgentState):
    return state["next_step"]

# Build the Graph
workflow = StateGraph(AgentState)

# Add nodes
workflow.add_node("classifier", classifier_agent)
workflow.add_node("sql_node", sql_executor)
workflow.add_node("general_node", general_assistant)

# Set entry point
workflow.set_entry_point("classifier")

# Add conditional routing edges
workflow.add_conditional_edges(
    "classifier",
    router,
    {
        "database": "sql_node",
        "general": "general_node"
    }
)

# Connect worker nodes to the end
workflow.add_edge("sql_node", END)
workflow.add_edge("general_node", END)

# Compile graph
app = workflow.compile()

# Example run
if __name__ == "__main__":
    initial_state = {"messages": ["How many users are registered in the db?"]}
    result = app.invoke(initial_state)
    print("Final State Messages:\n", result["messages"])

By mapping the logic into explicit state nodes and edges, we avoid the overhead of complex, dynamic prompts, allowing our local CPU to run agent logic in a structured, resource-capped manner.


5. AI Coding Assistants: Aider vs. OpenHands

If you write code on your home server, you can use AI coding assistants to speed up your development lifecycle. Let's compare the leading tools that interface with local LLMs.

Aider

Aider is a Git-native, terminal-based AI pair programmer. You launch it directly in your git repository.

  • How it works: It parses your local directory, creates a map of your codebase (using tree-sitter), and sends only the relevant code symbols to the LLM. It lets you ask for modifications, automatically writes code blocks to your local files, and commits the changes with descriptive messages.
  • Local AI Compatibility: Highly compatible with Ollama. Works well with Devstral 7B or DeepSeek Coder models.
  • Memory Overhead: Negligible. Running Aider in your terminal consumes less than 50 MB of RAM because it is a simple CLI script.

OpenHands (formerly OpenDevin)

OpenHands is an autonomous software development platform with a web-based user interface.

  • How it works: It launches a web UI, accepts complex prompts (e.g., "Implement a password reset API"), and spins up an isolated Docker container containing a full Linux environment. The agent executes shell commands, installs dependencies, compiles code, runs tests inside the sandbox, and resolves errors autonomously.
  • Local AI Compatibility: Can run with local models, but requires high-capacity models (like Llama-3-70B) to perform reliably. Running it with lightweight 7B models on CPU often leads to loop failures or syntax errors.
  • Memory Overhead: Severe. OpenHands requires running its core server, a Docker-in-Docker system, and isolated sandbox containers. This stack consumes 2 GB to 4 GB of RAM before accounting for the LLM.

Coding Assistant Comparison

FeatureAiderOpenHandsSWE-agentKilo Code
InterfaceTerminal CLIWeb Browser UITerminal CLIIDE Plugin
SandboxingNone (Runs on host)Docker ContainerDocker ContainerNone (Host)
Autonomous LevelInteractive pairHigh (Task-driven)High (Issue-driven)Interactive
Host RAM Cost<50 MB~3 GB~2 GB<100 MB
CPU Speed PenaltyMinimalHigh (Container init)HighMinimal
16GB Server RecommendationHighly RecommendedNot RecommendedNot RecommendedRecommended

On a 16 GB home server, Aider is the hands-down winner. It is lightweight, fast, and does not waste precious memory on heavy sandbox runtimes.


6. Visual AI Builders: Dify vs. Flowise

If you want to build custom AI workflows, chatbots, and retrieval-augmented generation (RAG) pipelines without writing boilerplate integration code, you can use visual builders.

Let's look at the resource footprints of the two leading platforms:

Dify

Dify is an enterprise-grade LLM application development platform. It features agent orchestration, RAG pipelines, prompt IDEs, and application monitoring.

  • The Stack: Dify is built as a complex microservices architecture. Running Dify via Docker Compose requires spinning up:
    • Dify API Service (Python/Flask)
    • Dify Worker Service (Celery tasks)
    • Dify Web UI (Next.js)
    • PostgreSQL (Data storage)
    • Redis (Cache & task queue)
    • Weaviate or Qdrant (Vector database for RAG)
    • Nginx (Routing proxy)
  • RAM Overhead: Running the default Dify stack consumes 2.5 GB to 3.5 GB of RAM at idle. This is a major portion of our 16 GB budget, leaving little room for Ollama or running APIs.

Flowise

Flowise is a lightweight, node-based drag-and-drop interface for building LangChain/LlamaIndex pipelines.

  • The Stack: Flowise is written in Node.js and compiled into a single application. It can run with a SQLite database for configuration storage and does not require complex vector engines to operate basic pipelines.
  • RAM Overhead: A bare-metal Flowise installation consumes only 120 MB to 200 MB of RAM at idle.

The Pragmatic Choice

For our 16 GB home server, Flowise is the optimal visual builder. It provides all the necessary tools to test prompt layouts and chain together local models, while preserving RAM for running Ollama models.


7. The Optimal AI Stack Configuration

To fit our AI services into our server budget without triggering OOM crashes, we will implement this specific stack configuration:

# Flowise Compose configuration - file path: /mnt/data/docker-compose.yml
services:
  flowise:
    image: flowiseai/flowise:latest
    container_name: flowise_canvas
    restart: always
    environment:
      - PORT=8083
      - DATABASE_PATH=/data
      - APIKEY_PATH=/data
    volumes:
      - /mnt/data/flowise/data:/data
    ports:
      - "127.0.0.1:8083:8083" # Admin only
    networks:
      - web_net
    deploy:
      resources:
        limits:
          memory: 384M

Resource Calculation with AI Active:

  • Host OS + Base Services (Caddy, DBs, Bots): ~4.5 GB
  • Flowise: ~250 MB
  • Ollama (Idle): ~100 MB
  • Ollama (Running Phi-4 Mini 3.8B): ~2.8 GB
  • Total RAM with Active AI: ~7.6 GB

This layout leaves more than 8 GB of memory free. Even if we swap the model to the larger Hermes 3 8B for a complex agent task, the total consumption rises to ~10.6 GB, staying safely below our 16 GB threshold.

8. Implementing a Local RAG Pipeline with Qdrant and Ollama

A key use case for local LLMs is querying your private documents (PDFs, Markdown notes, source code). This is achieved through Retrieval-Augmented Generation (RAG).

A RAG pipeline has three steps:

  1. Chunking & Embedding: Splitting documents into small text chunks and converting them into dense vector arrays (embeddings) using a dedicated embedding model.
  2. Vector Storage: Saving the embeddings into a database optimized for similarity search.
  3. Contextual Querying: When you ask a question, the pipeline retrieves the most relevant chunks from the vector store, appends them as context to your prompt, and sends it to the LLM.

To run this locally with low memory overhead, we will deploy the Qdrant Vector Database and write a simple Python script using Ollama's embedding API.

+-----------------------------------------------------------------------------+
|                          LOCAL RAG PIPELINE FLOW                            |
|                                                                             |
|  [Document] -> [Chunker] -> [Ollama Embeddings API] -> [Qdrant Vector DB]   |
|                                  (nomic-embed-text)             |           |
|                                                                 |           |
|  [User Query] --------------------------------------------------+           |
|       |                                                                     |
|       v                                                                     |
|  [Qdrant Search] -> (Retrieves Top 3 Chunks)                                 |
|                            |                                                |
|                            v                                                |
|  [Context + Query] -> [Ollama LLM (Phi-4)] -> [Answer Response]              |
+-----------------------------------------------------------------------------+

Step 8.1: Deploy Qdrant in Compose

Qdrant is written in Rust, extremely fast, and consumes less than 50 MB of RAM at idle. Add this service to your compose file:

services:
  qdrant:
    image: qdrant/qdrant:latest
    container_name: qdrant_vector_db
    restart: always
    ports:
      - "127.0.0.1:6333:6333"
    volumes:
      - /mnt/data/qdrant/storage:/qdrant/storage
    deploy:
      resources:
        limits:
          memory: 256M

Step 8.2: Create the Python RAG Script (rag_worker.py)

This script uses nomic-embed-text (a highly accurate 278 MB embedding model) to embed document chunks and store them in Qdrant, then queries the local LLM.

First, pull the embedding model:

docker exec -it ollama_ai ollama pull nomic-embed-text

Create the script:

import os
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct
import requests

# Config
QDRANT_URL = "http://localhost:6333"
OLLAMA_URL = "http://localhost:11434"
COLLECTION_NAME = "local_documents"

client = QdrantClient(url=QDRANT_URL)

# 1. Initialize Collection in Qdrant (Nomic embeds have 768 dimensions)
if not client.collection_exists(COLLECTION_NAME):
    client.create_collection(
        collection_name=COLLECTION_NAME,
        vectors_config=VectorParams(size=768, distance=Distance.COSINE),
    )

def get_embedding(text):
    response = requests.post(f"{OLLAMA_URL}/api/embeddings", json={
        "model": "nomic-embed-text",
        "prompt": text
    })
    return response.json()["embedding"]

# 2. Embed and Index a Document Chunk
def index_chunk(chunk_id, text):
    vector = get_embedding(text)
    client.upsert(
        collection_name=COLLECTION_NAME,
        points=[
            PointStruct(
                id=chunk_id,
                vector=vector,
                payload={"text": text}
            )
        ]
    )

# 3. Retrieve Context and Query LLM
def query_rag(user_query):
    query_vector = get_embedding(user_query)
    
    # Search Qdrant for top 3 matching chunks
    search_result = client.search(
        collection_name=COLLECTION_NAME,
        query_vector=query_vector,
        limit=3
    )
    
    context = "\n".join([hit.payload["text"] for hit in search_result])
    
    # Construct Contextual Prompt
    prompt = f"""Use the following pieces of context to answer the user query at the end.
If you do not know the answer, say that you do not know.

Context:
{context}

Query: {user_query}
Answer:"""

    # Call local LLM (Phi-4 Mini)
    response = requests.post(f"{OLLAMA_URL}/api/generate", json={
        "model": "phi4:mini",
        "prompt": prompt,
        "stream": False
    })
    return response.json()["response"]

# Example Execution
if __name__ == "__main__":
    # Index some test document chunks
    index_chunk(1, "The HP 15s-du2077TU server has an Intel i5-1035G1 CPU with 4 cores.")
    index_chunk(2, "The primary NVMe SSD is used for operating system binaries and active databases.")
    
    # Run query
    result = query_rag("What processor does the HP server have?")
    print("AI Answer:\n", result)

This RAG script executes locally on your CPU with sub-second vector search latency, keeping your data locked inside your local storage volume.

Step 8.3: Vector Database and Index Comparison for 16GB RAM

When choosing a vector database for a resource-constrained server, you must evaluate both the storage engine's overhead and the mathematical index structure used for search.

Let's compare the leading options for local deployment:

Vector StoreBase RAM FootprintSearch Latency (CPU)Storage Overhead16GB Server Recommendation
Qdrant~35 MB–50 MBExtremely Low (<5ms)LowHighly Recommended. Written in Rust, highly optimized, native cgroup compatibility.
pgvector~0 MB extraLow (<15ms)MediumHighly Recommended if already hosting PostgreSQL. Zero extra engine overhead.
Chroma~120 MB–250 MBLow (<20ms)LowRecommended only for basic Python prototyping.
Milvus / Weaviate~1.5 GB–3.0 GBLow (<5ms)HighNot Recommended. Massive enterprise JVM/Go dependencies, will cause system OOM crashes.

Optimizing Index Structures (HNSW vs. IVF)

The Hierarchical Navigable Small World (HNSW) algorithm is the gold standard for vector similarity search. It constructs a multi-layer graph of vectors, allowing search queries to traverse the graph logarithmically.

  • The Memory Cost: By default, HNSW holds the entire search graph in memory (RAM) to deliver microsecond search speeds. For high-dimensional embeddings (e.g. 1536-dimension models), this can consume hundreds of megabytes of RAM as your document count grows.
  • The CPU Solution: In Qdrant, we can configure HNSW to run on-disk rather than in memory. This swaps graph nodes into memory on demand using memory-mapped files (mmap), reducing RAM usage by up to 70% with a negligible increase in search latency.

Create a custom Qdrant configuration /mnt/data/qdrant/config.yaml:

storage:
  # Enable on-disk HNSW index by default
  hnsw_index:
    in_memory: false
  
  # Enable quantization to compress vector representation in memory
  optimizers:
    default_segment_number: 2

Mount this configuration inside your Qdrant container:

    volumes:
      - /mnt/data/qdrant/storage:/qdrant/storage
      - /mnt/data/qdrant/config.yaml:/qdrant/config/production.yaml

This simple optimization guarantees that even if you index hundreds of thousands of document chunks, Qdrant will never consume more than its reserved 256 MB cgroup RAM allocation.


9. Deploying Open WebUI: The Ultimate Chat Interface

To interact with your local Ollama models using a polished interface similar to ChatGPT, we will deploy Open WebUI.

Open WebUI is a feature-rich client that supports:

  • Multi-model chats with markdown rendering.
  • Document uploading and automatic RAG processing (utilizing a built-in PyTorch embeddings engine).
  • User access control and sharing.
  • Integration with external API gateways.

Add Open WebUI to your compose file:

services:
  open-webui:
    image: ghcr.io/open-webui/open-webui:main
    container_name: open_webui
    restart: always
    ports:
      - "127.0.0.1:8084:8080" # Exposed locally
    volumes:
      - /mnt/data/open-webui/data:/app/backend/data
    environment:
      - OLLAMA_BASE_URL=http://ollama_ai:11434
      - WEBUI_SECRET_KEY=generate_a_secure_webui_key
    networks:
      - web_net
    deploy:
      resources:
        limits:
          memory: 512M

Note: We route Open WebUI to Caddy or expose it privately via Tailscale so you can log in securely using your local credentials.


10. Code Synthesis in Action with Aider

To illustrate why Aider is the recommended tool for our resource-constrained home server, let's walk through an interactive coding session.

Step 10.1: Launch Aider with local Ollama

Ensure your local ollama container is running and has devstral or qwen3:4b pulled. From your server terminal, export the host configuration:

# Point Aider to your local Ollama endpoint
export OLLAMA_API_BASE="http://localhost:11434"

# Run Aider, specifying the model to use
aider --model ollama/phi4:mini

Step 10.2: Prompting for Code Edits

Aider launches an interactive chat interface. You can add specific source files to the session context:

/add src/webhook.ts

Now, ask Aider to write a new feature:

Add log rotation to the webhook server using standard output formatting.

Step 10.3: Autonomous Application

Aider parses the file structure, crafts the edits, and outputs a diff:

<<<<<<< SEARCH
console.log(`Webhook server is running on port 8000`);
=======
const logger = pino({ level: 'info' });
logger.info(`Webhook server is running on port 8000`);
>>>>>>> REPLACE

It writes the change directly to src/webhook.ts and commits the code automatically:

Commit 4a8b2d1: Refactored console logs to use pino logging in webhook server.

All this computation takes place directly inside Aider's CLI wrapper, using only a tiny fraction of the CPU and memory required by heavy browser-based agent stacks.


Next Steps

We now have a fully functional local AI engine and workflow automation layer running on our server. We understand the limits of CPU-only inference, have deployed Ollama, selected optimized models, and configured a lightweight visual canvas using Flowise.

In our fifth and final installment, Part 5, we will conduct a complete financial and operational analysis. We will calculate the exact electricity cost of running our HP laptop 24/7 in India, compare the total cost of ownership (TCO) over 3 years against commercial cloud providers (AWS, GCP, Hetzner, Hostinger), map our physical service directory, and conclude with a hybrid architecture blueprint.

Comments

Comments are powered by giscus. Set PUBLIC_GISCUS_REPO_ID and PUBLIC_GISCUS_CATEGORY_ID in your environment to enable them.