Windows Home Server on HP 15s — Part 3: Local AI Stack with Ollama, Open WebUI, and Odysseus

This is the part most people build a home server for in 2026: running large language models locally, with no API costs, no data leaving your network, and no usage limits. Your HP 15s-du2077TU — with its upgraded 16 GB of dual-channel DDR4-2666 RAM — is capable of this. Not at the blazing speed of an RTX 4090, but at a throughput that is useful and free.

Before we write a single command, we need to understand the physics.

1. The Physics of CPU-Only LLM Inference

Why RAM Bandwidth Determines Token Speed

LLM inference in the "decode" phase (generating tokens one at a time) is not compute-bound. It is memory-bandwidth-bound. Here is why:

Every token generated by a model requires the CPU to load the model's weight matrices from RAM into cache, perform matrix-vector multiplications, and write the results back. For a 7B parameter model at 4-bit quantization (~4 GB in size), every single token requires reading most or all of those 4 GB through the memory bus.

The critical formula:

Token Speed (tokens/sec) = Memory Bandwidth (GB/s) / Model Size (GB)

For your HP 15s-du2077TU:
  Dual-channel DDR4-2666: 2 × 2666 × 8 bytes = 42.6 GB/s (theoretical)
  Real-world effective bandwidth: ~70–80% = ~30–34 GB/s

With a Phi-3.5 Mini (3.8B params, Q4_K_M ≈ 2.2 GB):
  Estimated speed: 30 GB/s ÷ 2.2 GB ≈ ~13 tokens/sec

With Qwen2.5 3B (Q4_K_M ≈ 1.9 GB):
  Estimated speed: 30 GB/s ÷ 1.9 GB ≈ ~15 tokens/sec

With Llama 3.2 8B (Q4_K_M ≈ 4.7 GB):
  Estimated speed: 30 GB/s ÷ 4.7 GB ≈ ~6 tokens/sec

With Llama 3.1 70B (Q4_K_M ≈ 42 GB — IMPOSSIBLE on 16 GB):
  Would require ≥ 48 GB RAM to load. Skip.

These are theoretical estimates. Real-world speeds are affected by:

Number of layers vs. context window size
CPU core frequency (turbo boost at 3.6 GHz vs. sustained 1.8 GHz under thermal throttling)
Whether the GPU shared memory reduces available bandwidth
Windows background processes consuming RAM bus bandwidth

Model Size Guide for 16 GB RAM

Your 16 GB RAM is shared between Windows (~3 GB), WSL2 overhead (~200 MB), Docker services (~2–3 GB), and the AI stack. In practice, Ollama can use:

Total RAM:          16 GB
Windows (idle):    - 2.8 GB
WSL2 OS:           - 0.2 GB
Docker services:   - 2.5 GB (Part 2 stack: NC + VW + n8n + Kuma + Portainer)
─────────────────────────────
Available for AI:  ~10.5 GB

Model selection based on available RAM (approximate loaded sizes):

Model	Params	Quant	VRAM Need	Fit?	Est. Speed
`qwen2.5:1.5b`	1.5B	Q4_K_M	0.9 GB	✅ Yes	~20–30 t/s
`llama3.2:1b`	1B	Q4_K_M	0.6 GB	✅ Yes	~30+ t/s
`phi3.5:3.8b`	3.8B	Q4_K_M	2.2 GB	✅ Yes	~12–15 t/s
`qwen2.5:3b`	3B	Q4_K_M	1.9 GB	✅ Yes	~14–17 t/s
`qwen2.5-coder:3b`	3B	Q4_K_M	1.9 GB	✅ Yes	~14–17 t/s
`mistral:7b`	7B	Q4_K_M	4.1 GB	✅ Yes	~7–9 t/s
`llama3.1:8b`	8B	Q4_K_M	4.7 GB	✅ Yes	~6–8 t/s
`qwen2.5:7b`	7B	Q4_K_M	4.4 GB	✅ Yes	~7–9 t/s
`codellama:13b`	13B	Q4_K_M	7.9 GB	✅ Yes	~3–4 t/s
`llama3.1:70b`	70B	Q4_K_M	~42 GB	❌ No	N/A

Recommended starting model: phi3.5:3.8b — excellent reasoning, coding ability, and multilingual support at a size that feels responsive on this hardware.

For coding specifically: qwen2.5-coder:3b — trained specifically for code generation, outperforms much larger general models on coding tasks.

The Thermal Constraint

The i5-1035G1 sustains 15W TDP. During LLM inference, which is memory- bound, the CPU rarely approaches thermal limits — the bottleneck is the memory controller, not the ALUs. However, if you are simultaneously running the full Docker stack (Nextcloud, n8n, etc.) AND inference, watch temperatures.

Inside WSL2 Ubuntu:

# Install temperature monitoring
sudo apt install lm-sensors -y
sudo sensors-detect  # Press Enter for all questions
sensors  # View current temperatures

Safe operating range: CPU Package < 75°C under sustained inference. Warning threshold: > 85°C. At 100°C the CPU throttles to prevent damage.

2. Ollama — The Local LLM Inference Engine

Ollama is an open-source runtime for local LLM inference. It provides:

A simple REST API (POST /api/generate, /api/chat) compatible with the OpenAI API schema
Automatic model downloading and management
Multi-model support with hot-swapping
Modelfile support for model customization

2.1 Add Ollama to Docker Compose

Open your compose file and add the Ollama service:

nano ~/server/docker-compose.yml

Add below the existing services:

  # ═══════════════════════════════════════════════════════════════
  # OLLAMA — LOCAL LLM INFERENCE ENGINE
  # Provides an OpenAI-compatible REST API for local models
  # Models stored on NVMe SSD for fast loading
  # Access: localhost:11434 (NOT exposed publicly — internal only)
  # ═══════════════════════════════════════════════════════════════
  ollama:
    image: ollama/ollama:latest
    container_name: ollama
    restart: unless-stopped
    ports:
      - "127.0.0.1:11434:11434"  # ONLY localhost — no public exposure
    environment:
      # Unload model after 5 minutes of inactivity (frees RAM)
      # Without this, the model stays loaded permanently
      OLLAMA_KEEP_ALIVE: "5m"
      # Number of threads to use for inference
      # Physical cores (4) often outperforms hyperthreads for LLM workloads
      OLLAMA_NUM_THREADS: "4"
      # Maximum number of requests to process in parallel
      OLLAMA_MAX_QUEUE: "3"
    volumes:
      # Store model weights on NVMe SSD for fast access
      - ./ollama:/root/.ollama
    networks:
      - server_net
    deploy:
      resources:
        limits:
          memory: 8G      # Allow up to 8 GB for model + runtime
          cpus: "4.0"     # All 4 physical cores

Why OLLAMA_KEEP_ALIVE=5m is critical: Without this, Ollama keeps the loaded model in memory indefinitely. If you load a 4.7 GB model and then stop chatting, those 4.7 GB remain consumed and other services (Nextcloud, n8n) may trigger OOM kills. The 5-minute timeout unloads the model weights and returns memory to the system.

2.2 Deploy Ollama and Pull Models

# Start Ollama
docker compose up -d ollama

# Wait for it to initialize
sleep 10

# Verify the API is responding
curl http://localhost:11434/api/version
# Should return: {"version":"0.x.x"}

# Pull your first model (Phi-3.5 Mini — excellent all-rounder)
docker exec ollama ollama pull phi3.5

# Pull a coding-specialized model
docker exec ollama ollama pull qwen2.5-coder:3b

# Pull a conversational model for quick responses
docker exec ollama ollama pull llama3.2:1b

# List installed models
docker exec ollama ollama list

Model files are downloaded to ~/server/ollama/models/ inside WSL2 (on your NVMe SSD). Phi-3.5 is ~2.2 GB; plan for 5–10 GB of model storage depending on which models you install.

2.3 Test Inference via REST API

# Test inference — first token will be slow (model loading)
# Subsequent tokens much faster (model already in memory)
curl -X POST http://localhost:11434/api/generate \
  -H "Content-Type: application/json" \
  -d '{
    "model": "phi3.5",
    "prompt": "Explain how Cloudflare Tunnel works in one paragraph.",
    "stream": false
  }'

# Test the OpenAI-compatible chat endpoint
curl -X POST http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "phi3.5",
    "messages": [
      {"role": "user", "content": "What is 256 GB NVMe better than SATA SSD?"}
    ]
  }'

3. Open WebUI — ChatGPT-Style Interface for Your Local Models

Open WebUI provides a polished, feature-rich web interface identical in feel to ChatGPT or Claude. It connects to Ollama's API and provides:

Multi-model chat with model switching
Conversation history with search
File upload for document analysis (RAG)
Custom system prompts and "characters"
API key management for sharing access
Image generation (if using compatible models)

3.1 Add Open WebUI to Docker Compose

  # ═══════════════════════════════════════════════════════════════
  # OPEN WEBUI — AI CHAT INTERFACE
  # ChatGPT-like UI connecting to your local Ollama instance
  # Access: ai.yourdomain.com
  # ═══════════════════════════════════════════════════════════════
  open_webui:
    image: ghcr.io/open-webui/open-webui:main
    container_name: open_webui
    restart: unless-stopped
    depends_on:
      - ollama
    ports:
      - "127.0.0.1:8084:8080"
    environment:
      # Connect to Ollama using Docker service name (internal network)
      OLLAMA_BASE_URL: "http://ollama:11434"
      # Generate with: openssl rand -hex 32
      WEBUI_SECRET_KEY: "replace_with_secure_random_string"
      # Disable user registration after your account is created
      ENABLE_SIGNUP: "true"  # Change to "false" after setup
    volumes:
      - ./open-webui:/app/backend/data
    networks:
      - server_net
    deploy:
      resources:
        limits:
          memory: 512M
          cpus: "1.0"

docker compose up -d open_webui
sleep 15

# Access it at http://localhost:8084
# Or through Cloudflare Tunnel at https://ai.yourdomain.com

On first access:

Create your admin account.
Go to Settings → Connections → verify Ollama URL is set to http://ollama:11434.
Click Refresh next to "Ollama" — you should see your models listed.
Under Settings → Users → set "Default User Role" to user (not admin) for security.

4. Odysseus AI — The Autonomous Agent Workspace

Odysseus is the most powerful tool in this stack. Unlike Open WebUI which provides a chat interface, Odysseus operates as a full AI workspace — the LLM can take multi-step actions: searching the web, reading and writing files, managing your calendar, drafting and sending emails via your SMTP server, and executing code.

Think of it as giving your local AI model hands and tools.

4.1 Feature Overview

Category	Odysseus Capabilities
LLM Backends	Ollama (local), llama.cpp, OpenAI API, OpenRouter
Agent Loops	Multi-step planning, web search, shell execution, file I/O
Email	IMAP/SMTP integration, AI-drafting, automated triage
Calendar	CalDAV sync (Nextcloud compatible), task scheduling
Documents	Markdown editor, HTML, CSV, AI-assisted writing
Memory	Persistent ChromaDB vector memory across sessions
Model Picker	Hardware-aware model recommendations

4.2 Add Odysseus to Docker Compose

Odysseus requires ChromaDB as its vector memory backend:

  # ═══════════════════════════════════════════════════════════════
  # CHROMA — VECTOR DATABASE (Odysseus memory backend)
  # Stores agent memories and enables semantic search
  # ═══════════════════════════════════════════════════════════════
  chroma:
    image: chromadb/chroma:latest
    container_name: chroma
    restart: unless-stopped
    ports:
      - "127.0.0.1:8000:8000"
    volumes:
      - ./chroma:/chroma/chroma
    networks:
      - server_net
    deploy:
      resources:
        limits:
          memory: 512M
          cpus: "0.5"

  # ═══════════════════════════════════════════════════════════════
  # ODYSSEUS AI — AUTONOMOUS AGENT WORKSPACE
  # Full AI workspace with multi-step agents, email, calendar tools
  # Access: odysseus.yourdomain.com
  # ═══════════════════════════════════════════════════════════════
  odysseus:
    image: ghcr.io/pewdiepie-archdaemon/odysseus:latest
    container_name: odysseus
    restart: unless-stopped
    depends_on:
      - ollama
      - chroma
    ports:
      - "127.0.0.1:7000:7000"
    environment:
      # Point to local Ollama (Docker internal network)
      OLLAMA_HOST: "http://ollama:11434"
      # Point to ChromaDB for persistent memory
      CHROMA_HOST: "http://chroma:8000"
      # Public URL for the workspace (for link generation)
      ODYSSEUS_BASE_URL: "https://odysseus.yourdomain.com"
      # Default model for autonomous agent loops
      # Phi-3.5 is fast enough for agent steps on this hardware
      ODYSSEUS_DEFAULT_MODEL: "phi3.5"
      # Enable memory persistence
      MEMORY_ENABLED: "true"
      # Secret key (generate: openssl rand -hex 32)
      SECRET_KEY: "replace_with_secure_random_string"
    volumes:
      - ./odysseus:/app/data
    networks:
      - server_net
    deploy:
      resources:
        limits:
          memory: 512M
          cpus: "1.0"

4.3 Deploy and Configure Odysseus

# Start ChromaDB first
docker compose up -d chroma
sleep 5

# Start Odysseus
docker compose up -d odysseus
sleep 20

# Check startup logs
docker compose logs --tail=30 odysseus

Initial Setup:

Access https://odysseus.yourdomain.com (or http://localhost:7000).

Check logs for the initial admin credentials:

docker compose logs odysseus | grep -i "password\|credentials\|token"

Connect to Local Ollama:

In Odysseus Settings → LLM Backends → Add Backend
Type: Ollama
URL: http://ollama:11434 (use Docker service name, not localhost)
Test connection → select available models

Configure the Model Cookbook (Hardware Recommendations): Odysseus will scan your hardware and recommend the best models. For the i5-1035G1 with 16 GB RAM, it typically recommends:

Primary: phi3.5 (best balance of quality/speed)
Fast: llama3.2:1b (for quick queries and agent intermediate steps)
Coding: qwen2.5-coder:3b (for code generation tasks)

Connect Email (Optional):

Settings → Integrations → Email
IMAP server: your mail provider's settings
This enables the agent to read and draft emails on your behalf

4.4 Sample Agent Use Cases

Once Odysseus is running with Ollama as its backend, you can assign it complex multi-step tasks:

Example 1: Research + Summary

"Search the web for the latest developments in India's 5G rollout,
 summarize the key points, and save the summary to a markdown file
 called 5g-india-research.md"

The agent will: (1) perform web searches, (2) read multiple pages, (3) synthesize the information, (4) write the file to your storage.

Example 2: Email Triage

"Check my inbox, find all emails about unpaid invoices from the last
 week, draft polite follow-up replies, and show me for approval before
 sending."

Example 3: Scheduled Automation

"Every Monday morning at 8 AM, check the news about stock markets
 and cryptocurrency, generate a brief summary, and post it to my
 Nextcloud notes."

5. Flowise — Visual AI Pipeline Builder

Flowise is a drag-and-drop tool for building AI pipelines visually. It is the fastest way to create:

Custom chatbots with retrieval-augmented generation (RAG)
Document Q&A systems (upload PDFs, get answers)
Custom API endpoints powered by your local models
Multi-step AI workflows without writing code

5.1 Add Flowise to Docker Compose

  # ═══════════════════════════════════════════════════════════════
  # FLOWISE — VISUAL AI PIPELINE BUILDER
  # Drag-and-drop LangChain/LlamaIndex workflow designer
  # Access: flowise.yourdomain.com
  # ═══════════════════════════════════════════════════════════════
  flowise:
    image: flowiseai/flowise:latest
    container_name: flowise
    restart: unless-stopped
    ports:
      - "127.0.0.1:8083:3000"
    environment:
      # Admin credentials
      FLOWISE_USERNAME: admin
      FLOWISE_PASSWORD: changeme_flowise_password
      # Secret key for JWT tokens
      SECRETKEY_PATH: /root/.flowise
      FLOWISE_SECRETKEY_OVERWRITE: "true"
    volumes:
      - ./flowise:/root/.flowise
    networks:
      - server_net
    deploy:
      resources:
        limits:
          memory: 768M
          cpus: "2.0"

docker compose up -d flowise

Build a Document Q&A Pipeline in Flowise:

Access https://flowise.yourdomain.com
Create a new Chatflow
Drag in these nodes:
- Ollama node → configure: URL http://ollama:11434, model phi3.5
- PDF File document loader (upload your documents)
- Recursive Character Text Splitter (chunk size: 1000, overlap: 100)
- Chroma vector store → URL http://chroma:8000
- Ollama Embeddings → same URL
- Conversational Retrieval QA Chain
Connect: PDF → Splitter → Chroma ← Embeddings; Ollama → QA Chain ← Chroma → Chat output
Save and test: upload a PDF (e.g., your electricity bill, a research paper), then ask questions about its contents

6. Qdrant — Production Vector Database for RAG

Qdrant is a more production-grade vector database than ChromaDB, designed for high-performance similarity search. Use it when your document corpus grows beyond a few hundred documents.

6.1 Add Qdrant to Docker Compose

  # ═══════════════════════════════════════════════════════════════
  # QDRANT — VECTOR SEARCH DATABASE
  # High-performance vector database for RAG pipelines
  # REST API on port 6333, gRPC on 6334
  # ═══════════════════════════════════════════════════════════════
  qdrant:
    image: qdrant/qdrant:latest
    container_name: qdrant
    restart: unless-stopped
    ports:
      - "127.0.0.1:6333:6333"   # REST API
      - "127.0.0.1:6334:6334"   # gRPC API
    volumes:
      - ./qdrant/storage:/qdrant/storage
      - ./qdrant/config:/qdrant/config
    networks:
      - server_net
    deploy:
      resources:
        limits:
          memory: 512M
          cpus: "1.0"

docker compose up -d qdrant

# Test the REST API
curl http://localhost:6333/healthz
# Should return: {"title":"qdrant - vector search engine","version":"x.x.x"}

# Create a collection for document embeddings
curl -X PUT http://localhost:6333/collections/documents \
  -H "Content-Type: application/json" \
  -d '{
    "vectors": {
      "size": 768,
      "distance": "Cosine"
    }
  }'

7. Complete AI Stack Resource Budget

Now let us verify that all AI services fit within our RAM envelope:

RAM Budget with Full AI Stack Running
═════════════════════════════════════════════════════════
Windows 10 (base processes, Explorer, Defender):  ~2.8 GB
WSL2 Linux kernel + Ubuntu OS:                    ~0.2 GB
─────────────────────────────────────────────────────────
Docker Services (Part 2):
  Caddy:                                          ~0.1 GB
  Portainer:                                      ~0.1 GB
  MariaDB (Nextcloud):                            ~0.3 GB
  Nextcloud:                                      ~0.4 GB
  Vaultwarden:                                    ~0.1 GB
  Uptime Kuma:                                    ~0.1 GB
  n8n:                                            ~0.3 GB
─────────────────────────────────────────────────────────
AI Stack (Part 3):
  Ollama daemon (no model loaded):                ~0.1 GB
  Open WebUI:                                     ~0.2 GB
  Odysseus workspace:                             ~0.2 GB
  ChromaDB:                                       ~0.2 GB
  Flowise:                                        ~0.3 GB
  Qdrant:                                         ~0.2 GB
─────────────────────────────────────────────────────────
Subtotal (no model loaded):                       ~5.6 GB
─────────────────────────────────────────────────────────
When phi3.5 is loaded (inference):                +2.2 GB
  Total with phi3.5:                              ~7.8 GB

When llama3.1:8b is loaded (inference):           +4.7 GB
  Total with 8B model:                            ~10.3 GB
─────────────────────────────────────────────────────────
16 GB Total Available
Headroom (with 8B model):                         ~5.7 GB ✅
═════════════════════════════════════════════════════════

All services fit comfortably. The OLLAMA_KEEP_ALIVE=5m setting ensures the model is evicted after 5 minutes of inactivity, returning 2–5 GB to the system.

8. Selecting the Right Model for the Right Task

Not every task needs the most powerful model. Using a 1B model for a quick lookup is significantly faster than loading an 8B model:

Use Case	Recommended Model	Why
Quick Q&A, simple tasks	`llama3.2:1b`	~30+ t/s, instant responses
General reasoning, writing	`phi3.5:3.8b`	Best quality/speed balance
Code generation, review	`qwen2.5-coder:3b`	Trained specifically for code
Complex analysis, long tasks	`qwen2.5:7b`	Higher quality, patience needed
Hindi/multilingual tasks	`qwen2.5:3b`	Strong multilingual support
Agent planning steps	`phi3.5:3.8b`	Good tool-use understanding
Document summarization	`mistral:7b`	Long context window support

Creating Custom Model Configurations

Ollama's Modelfile syntax lets you customize model behavior:

# Create a custom system prompt for a coding assistant
docker exec -it ollama sh -c "cat > /tmp/CodingAssistant << 'EOF'
FROM qwen2.5-coder:3b
SYSTEM \"\"\"You are an expert software engineer. You write clean, modular,
well-commented code. You always follow SOLID principles. You prefer
functional patterns where appropriate. When showing code, always include
comments explaining non-obvious logic. Format all code in proper markdown
code blocks with the language specified. Ask clarifying questions if the
requirements are ambiguous before writing any code.\"\"\"
PARAMETER temperature 0.1
PARAMETER num_ctx 4096
EOF
ollama create coding-assistant -f /tmp/CodingAssistant"

# Verify the new model appears
docker exec ollama ollama list

# Test it
curl -X POST http://localhost:11434/api/chat \
  -d '{"model":"coding-assistant","messages":[{"role":"user","content":"Write a Python function to validate an Indian phone number"}],"stream":false}'

9. Performance Optimization Tips

Maximize Inference Speed

# Set CPU governor to performance mode inside WSL2
# (Note: this may not persist across WSL2 restarts)
echo performance | sudo tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

# Monitor CPU frequency during inference
watch -n 1 "cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq"
# Aim to see values near 3,600,000 (3.6 GHz turbo boost)

# On Windows side: ensure CPU is at max performance during inference
# (The "Server - Always On" power plan from Part 1 handles this)
# Verify no thermal throttling via Windows Resource Monitor:
# Task Manager → Performance → CPU → check if clock speed is at max

Reduce Context Window for Speed

Every token in the context window increases memory bandwidth usage. For quick tasks, reduce the context:

# Test with a smaller context window (faster for short conversations)
curl -X POST http://localhost:11434/api/generate \
  -d '{
    "model": "phi3.5",
    "prompt": "Summarize this in 2 sentences.",
    "options": {
      "num_ctx": 2048,
      "num_thread": 4,
      "num_batch": 512
    },
    "stream": false
  }'

Monitor Real-Time Resource Usage

# Inside WSL2 — watch memory and CPU during inference
watch -n 2 "docker stats --no-stream --format 'table {{.Name}}\t{{.CPUPerc}}\t{{.MemUsage}}'"

Summary and What's Next

In Part 3, you have:

✅ Understood the physics of CPU-only LLM inference — memory bandwidth math and why dual-channel DDR4-2666 is the limiting factor.
✅ Deployed Ollama with optimized environment variables (KEEP_ALIVE, NUM_THREADS) and pulled recommended models for 16 GB hardware.
✅ Configured Open WebUI as a polished chat interface connecting to your local Ollama.
✅ Deployed Odysseus AI workspace with autonomous agents, persistent memory via ChromaDB, and multi-step tool execution.
✅ Set up Flowise for visual AI pipeline building with a working document Q&A example.
✅ Deployed Qdrant as a production-grade vector database for RAG.
✅ Verified the complete RAM budget — all services run comfortably within 16 GB with headroom.

In Part 4, we tackle operations: Task Scheduler automation edge cases, PowerShell health monitoring with Telegram alerts, handling Windows sleep resume (clock drift fix), DNS resolution failures, WSL2 OOM recovery, and configuring the laptop for seamless dual-use as both a daily driver and a background server.

Continue to Part 4: Automation, Monitoring, and Dual-Use Operations →