Windows Home Server on HP 15s — Part 3: Local AI Stack with Ollama, Open WebUI, and Odysseus
This is the part most people build a home server for in 2026: running large language models locally, with no API costs, no data leaving your network, and no usage limits. Your HP 15s-du2077TU — with its upgraded 16 GB of dual-channel DDR4-2666 RAM — is capable of this. Not at the blazing speed of an RTX 4090, but at a throughput that is useful and free.
Before we write a single command, we need to understand the physics.
1. The Physics of CPU-Only LLM Inference
Why RAM Bandwidth Determines Token Speed
LLM inference in the "decode" phase (generating tokens one at a time) is not compute-bound. It is memory-bandwidth-bound. Here is why:
Every token generated by a model requires the CPU to load the model's weight matrices from RAM into cache, perform matrix-vector multiplications, and write the results back. For a 7B parameter model at 4-bit quantization (~4 GB in size), every single token requires reading most or all of those 4 GB through the memory bus.
The critical formula:
Token Speed (tokens/sec) = Memory Bandwidth (GB/s) / Model Size (GB)
For your HP 15s-du2077TU:
Dual-channel DDR4-2666: 2 × 2666 × 8 bytes = 42.6 GB/s (theoretical)
Real-world effective bandwidth: ~70–80% = ~30–34 GB/s
With a Phi-3.5 Mini (3.8B params, Q4_K_M ≈ 2.2 GB):
Estimated speed: 30 GB/s ÷ 2.2 GB ≈ ~13 tokens/sec
With Qwen2.5 3B (Q4_K_M ≈ 1.9 GB):
Estimated speed: 30 GB/s ÷ 1.9 GB ≈ ~15 tokens/sec
With Llama 3.2 8B (Q4_K_M ≈ 4.7 GB):
Estimated speed: 30 GB/s ÷ 4.7 GB ≈ ~6 tokens/sec
With Llama 3.1 70B (Q4_K_M ≈ 42 GB — IMPOSSIBLE on 16 GB):
Would require ≥ 48 GB RAM to load. Skip.
These are theoretical estimates. Real-world speeds are affected by:
- Number of layers vs. context window size
- CPU core frequency (turbo boost at 3.6 GHz vs. sustained 1.8 GHz under thermal throttling)
- Whether the GPU shared memory reduces available bandwidth
- Windows background processes consuming RAM bus bandwidth
Model Size Guide for 16 GB RAM
Your 16 GB RAM is shared between Windows (~3 GB), WSL2 overhead (~200 MB), Docker services (~2–3 GB), and the AI stack. In practice, Ollama can use:
Total RAM: 16 GB
Windows (idle): - 2.8 GB
WSL2 OS: - 0.2 GB
Docker services: - 2.5 GB (Part 2 stack: NC + VW + n8n + Kuma + Portainer)
─────────────────────────────
Available for AI: ~10.5 GB
Model selection based on available RAM (approximate loaded sizes):
| Model | Params | Quant | VRAM Need | Fit? | Est. Speed |
|---|---|---|---|---|---|
qwen2.5:1.5b | 1.5B | Q4_K_M | 0.9 GB | ✅ Yes | ~20–30 t/s |
llama3.2:1b | 1B | Q4_K_M | 0.6 GB | ✅ Yes | ~30+ t/s |
phi3.5:3.8b | 3.8B | Q4_K_M | 2.2 GB | ✅ Yes | ~12–15 t/s |
qwen2.5:3b | 3B | Q4_K_M | 1.9 GB | ✅ Yes | ~14–17 t/s |
qwen2.5-coder:3b | 3B | Q4_K_M | 1.9 GB | ✅ Yes | ~14–17 t/s |
mistral:7b | 7B | Q4_K_M | 4.1 GB | ✅ Yes | ~7–9 t/s |
llama3.1:8b | 8B | Q4_K_M | 4.7 GB | ✅ Yes | ~6–8 t/s |
qwen2.5:7b | 7B | Q4_K_M | 4.4 GB | ✅ Yes | ~7–9 t/s |
codellama:13b | 13B | Q4_K_M | 7.9 GB | ✅ Yes | ~3–4 t/s |
llama3.1:70b | 70B | Q4_K_M | ~42 GB | ❌ No | N/A |
Recommended starting model: phi3.5:3.8b — excellent reasoning,
coding ability, and multilingual support at a size that feels responsive
on this hardware.
For coding specifically: qwen2.5-coder:3b — trained specifically
for code generation, outperforms much larger general models on coding tasks.
The Thermal Constraint
The i5-1035G1 sustains 15W TDP. During LLM inference, which is memory- bound, the CPU rarely approaches thermal limits — the bottleneck is the memory controller, not the ALUs. However, if you are simultaneously running the full Docker stack (Nextcloud, n8n, etc.) AND inference, watch temperatures.
Inside WSL2 Ubuntu:
# Install temperature monitoring
sudo apt install lm-sensors -y
sudo sensors-detect # Press Enter for all questions
sensors # View current temperatures
Safe operating range: CPU Package < 75°C under sustained inference. Warning threshold: > 85°C. At 100°C the CPU throttles to prevent damage.
2. Ollama — The Local LLM Inference Engine
Ollama is an open-source runtime for local LLM inference. It provides:
- A simple REST API (
POST /api/generate,/api/chat) compatible with the OpenAI API schema - Automatic model downloading and management
- Multi-model support with hot-swapping
Modelfilesupport for model customization
2.1 Add Ollama to Docker Compose
Open your compose file and add the Ollama service:
nano ~/server/docker-compose.yml
Add below the existing services:
# ═══════════════════════════════════════════════════════════════
# OLLAMA — LOCAL LLM INFERENCE ENGINE
# Provides an OpenAI-compatible REST API for local models
# Models stored on NVMe SSD for fast loading
# Access: localhost:11434 (NOT exposed publicly — internal only)
# ═══════════════════════════════════════════════════════════════
ollama:
image: ollama/ollama:latest
container_name: ollama
restart: unless-stopped
ports:
- "127.0.0.1:11434:11434" # ONLY localhost — no public exposure
environment:
# Unload model after 5 minutes of inactivity (frees RAM)
# Without this, the model stays loaded permanently
OLLAMA_KEEP_ALIVE: "5m"
# Number of threads to use for inference
# Physical cores (4) often outperforms hyperthreads for LLM workloads
OLLAMA_NUM_THREADS: "4"
# Maximum number of requests to process in parallel
OLLAMA_MAX_QUEUE: "3"
volumes:
# Store model weights on NVMe SSD for fast access
- ./ollama:/root/.ollama
networks:
- server_net
deploy:
resources:
limits:
memory: 8G # Allow up to 8 GB for model + runtime
cpus: "4.0" # All 4 physical cores
Why
OLLAMA_KEEP_ALIVE=5mis critical: Without this, Ollama keeps the loaded model in memory indefinitely. If you load a 4.7 GB model and then stop chatting, those 4.7 GB remain consumed and other services (Nextcloud, n8n) may trigger OOM kills. The 5-minute timeout unloads the model weights and returns memory to the system.
2.2 Deploy Ollama and Pull Models
# Start Ollama
docker compose up -d ollama
# Wait for it to initialize
sleep 10
# Verify the API is responding
curl http://localhost:11434/api/version
# Should return: {"version":"0.x.x"}
# Pull your first model (Phi-3.5 Mini — excellent all-rounder)
docker exec ollama ollama pull phi3.5
# Pull a coding-specialized model
docker exec ollama ollama pull qwen2.5-coder:3b
# Pull a conversational model for quick responses
docker exec ollama ollama pull llama3.2:1b
# List installed models
docker exec ollama ollama list
Model files are downloaded to ~/server/ollama/models/ inside WSL2
(on your NVMe SSD). Phi-3.5 is ~2.2 GB; plan for 5–10 GB of model
storage depending on which models you install.
2.3 Test Inference via REST API
# Test inference — first token will be slow (model loading)
# Subsequent tokens much faster (model already in memory)
curl -X POST http://localhost:11434/api/generate \
-H "Content-Type: application/json" \
-d '{
"model": "phi3.5",
"prompt": "Explain how Cloudflare Tunnel works in one paragraph.",
"stream": false
}'
# Test the OpenAI-compatible chat endpoint
curl -X POST http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "phi3.5",
"messages": [
{"role": "user", "content": "What is 256 GB NVMe better than SATA SSD?"}
]
}'
3. Open WebUI — ChatGPT-Style Interface for Your Local Models
Open WebUI provides a polished, feature-rich web interface identical in feel to ChatGPT or Claude. It connects to Ollama's API and provides:
- Multi-model chat with model switching
- Conversation history with search
- File upload for document analysis (RAG)
- Custom system prompts and "characters"
- API key management for sharing access
- Image generation (if using compatible models)
3.1 Add Open WebUI to Docker Compose
# ═══════════════════════════════════════════════════════════════
# OPEN WEBUI — AI CHAT INTERFACE
# ChatGPT-like UI connecting to your local Ollama instance
# Access: ai.yourdomain.com
# ═══════════════════════════════════════════════════════════════
open_webui:
image: ghcr.io/open-webui/open-webui:main
container_name: open_webui
restart: unless-stopped
depends_on:
- ollama
ports:
- "127.0.0.1:8084:8080"
environment:
# Connect to Ollama using Docker service name (internal network)
OLLAMA_BASE_URL: "http://ollama:11434"
# Generate with: openssl rand -hex 32
WEBUI_SECRET_KEY: "replace_with_secure_random_string"
# Disable user registration after your account is created
ENABLE_SIGNUP: "true" # Change to "false" after setup
volumes:
- ./open-webui:/app/backend/data
networks:
- server_net
deploy:
resources:
limits:
memory: 512M
cpus: "1.0"
docker compose up -d open_webui
sleep 15
# Access it at http://localhost:8084
# Or through Cloudflare Tunnel at https://ai.yourdomain.com
On first access:
- Create your admin account.
- Go to Settings → Connections → verify Ollama URL is set to
http://ollama:11434. - Click Refresh next to "Ollama" — you should see your models listed.
- Under Settings → Users → set "Default User Role" to
user(not admin) for security.
4. Odysseus AI — The Autonomous Agent Workspace
Odysseus is the most powerful tool in this stack. Unlike Open WebUI which provides a chat interface, Odysseus operates as a full AI workspace — the LLM can take multi-step actions: searching the web, reading and writing files, managing your calendar, drafting and sending emails via your SMTP server, and executing code.
Think of it as giving your local AI model hands and tools.
4.1 Feature Overview
| Category | Odysseus Capabilities |
|---|---|
| LLM Backends | Ollama (local), llama.cpp, OpenAI API, OpenRouter |
| Agent Loops | Multi-step planning, web search, shell execution, file I/O |
| IMAP/SMTP integration, AI-drafting, automated triage | |
| Calendar | CalDAV sync (Nextcloud compatible), task scheduling |
| Documents | Markdown editor, HTML, CSV, AI-assisted writing |
| Memory | Persistent ChromaDB vector memory across sessions |
| Model Picker | Hardware-aware model recommendations |
4.2 Add Odysseus to Docker Compose
Odysseus requires ChromaDB as its vector memory backend:
# ═══════════════════════════════════════════════════════════════
# CHROMA — VECTOR DATABASE (Odysseus memory backend)
# Stores agent memories and enables semantic search
# ═══════════════════════════════════════════════════════════════
chroma:
image: chromadb/chroma:latest
container_name: chroma
restart: unless-stopped
ports:
- "127.0.0.1:8000:8000"
volumes:
- ./chroma:/chroma/chroma
networks:
- server_net
deploy:
resources:
limits:
memory: 512M
cpus: "0.5"
# ═══════════════════════════════════════════════════════════════
# ODYSSEUS AI — AUTONOMOUS AGENT WORKSPACE
# Full AI workspace with multi-step agents, email, calendar tools
# Access: odysseus.yourdomain.com
# ═══════════════════════════════════════════════════════════════
odysseus:
image: ghcr.io/pewdiepie-archdaemon/odysseus:latest
container_name: odysseus
restart: unless-stopped
depends_on:
- ollama
- chroma
ports:
- "127.0.0.1:7000:7000"
environment:
# Point to local Ollama (Docker internal network)
OLLAMA_HOST: "http://ollama:11434"
# Point to ChromaDB for persistent memory
CHROMA_HOST: "http://chroma:8000"
# Public URL for the workspace (for link generation)
ODYSSEUS_BASE_URL: "https://odysseus.yourdomain.com"
# Default model for autonomous agent loops
# Phi-3.5 is fast enough for agent steps on this hardware
ODYSSEUS_DEFAULT_MODEL: "phi3.5"
# Enable memory persistence
MEMORY_ENABLED: "true"
# Secret key (generate: openssl rand -hex 32)
SECRET_KEY: "replace_with_secure_random_string"
volumes:
- ./odysseus:/app/data
networks:
- server_net
deploy:
resources:
limits:
memory: 512M
cpus: "1.0"
4.3 Deploy and Configure Odysseus
# Start ChromaDB first
docker compose up -d chroma
sleep 5
# Start Odysseus
docker compose up -d odysseus
sleep 20
# Check startup logs
docker compose logs --tail=30 odysseus
Initial Setup:
- Access
https://odysseus.yourdomain.com(orhttp://localhost:7000). - Check logs for the initial admin credentials:
docker compose logs odysseus | grep -i "password\|credentials\|token" - Log in and complete the setup wizard.
Connect to Local Ollama:
- In Odysseus Settings → LLM Backends → Add Backend
- Type:
Ollama - URL:
http://ollama:11434(use Docker service name, not localhost) - Test connection → select available models
Configure the Model Cookbook (Hardware Recommendations): Odysseus will scan your hardware and recommend the best models. For the i5-1035G1 with 16 GB RAM, it typically recommends:
- Primary:
phi3.5(best balance of quality/speed) - Fast:
llama3.2:1b(for quick queries and agent intermediate steps) - Coding:
qwen2.5-coder:3b(for code generation tasks)
Connect Email (Optional):
- Settings → Integrations → Email
- IMAP server: your mail provider's settings
- This enables the agent to read and draft emails on your behalf
4.4 Sample Agent Use Cases
Once Odysseus is running with Ollama as its backend, you can assign it complex multi-step tasks:
Example 1: Research + Summary
"Search the web for the latest developments in India's 5G rollout,
summarize the key points, and save the summary to a markdown file
called 5g-india-research.md"
The agent will: (1) perform web searches, (2) read multiple pages, (3) synthesize the information, (4) write the file to your storage.
Example 2: Email Triage
"Check my inbox, find all emails about unpaid invoices from the last
week, draft polite follow-up replies, and show me for approval before
sending."
Example 3: Scheduled Automation
"Every Monday morning at 8 AM, check the news about stock markets
and cryptocurrency, generate a brief summary, and post it to my
Nextcloud notes."
5. Flowise — Visual AI Pipeline Builder
Flowise is a drag-and-drop tool for building AI pipelines visually. It is the fastest way to create:
- Custom chatbots with retrieval-augmented generation (RAG)
- Document Q&A systems (upload PDFs, get answers)
- Custom API endpoints powered by your local models
- Multi-step AI workflows without writing code
5.1 Add Flowise to Docker Compose
# ═══════════════════════════════════════════════════════════════
# FLOWISE — VISUAL AI PIPELINE BUILDER
# Drag-and-drop LangChain/LlamaIndex workflow designer
# Access: flowise.yourdomain.com
# ═══════════════════════════════════════════════════════════════
flowise:
image: flowiseai/flowise:latest
container_name: flowise
restart: unless-stopped
ports:
- "127.0.0.1:8083:3000"
environment:
# Admin credentials
FLOWISE_USERNAME: admin
FLOWISE_PASSWORD: changeme_flowise_password
# Secret key for JWT tokens
SECRETKEY_PATH: /root/.flowise
FLOWISE_SECRETKEY_OVERWRITE: "true"
volumes:
- ./flowise:/root/.flowise
networks:
- server_net
deploy:
resources:
limits:
memory: 768M
cpus: "2.0"
docker compose up -d flowise
Build a Document Q&A Pipeline in Flowise:
- Access
https://flowise.yourdomain.com - Create a new Chatflow
- Drag in these nodes:
- Ollama node → configure: URL
http://ollama:11434, modelphi3.5 - PDF File document loader (upload your documents)
- Recursive Character Text Splitter (chunk size: 1000, overlap: 100)
- Chroma vector store → URL
http://chroma:8000 - Ollama Embeddings → same URL
- Conversational Retrieval QA Chain
- Ollama node → configure: URL
- Connect: PDF → Splitter → Chroma ← Embeddings; Ollama → QA Chain ← Chroma → Chat output
- Save and test: upload a PDF (e.g., your electricity bill, a research paper), then ask questions about its contents
6. Qdrant — Production Vector Database for RAG
Qdrant is a more production-grade vector database than ChromaDB, designed for high-performance similarity search. Use it when your document corpus grows beyond a few hundred documents.
6.1 Add Qdrant to Docker Compose
# ═══════════════════════════════════════════════════════════════
# QDRANT — VECTOR SEARCH DATABASE
# High-performance vector database for RAG pipelines
# REST API on port 6333, gRPC on 6334
# ═══════════════════════════════════════════════════════════════
qdrant:
image: qdrant/qdrant:latest
container_name: qdrant
restart: unless-stopped
ports:
- "127.0.0.1:6333:6333" # REST API
- "127.0.0.1:6334:6334" # gRPC API
volumes:
- ./qdrant/storage:/qdrant/storage
- ./qdrant/config:/qdrant/config
networks:
- server_net
deploy:
resources:
limits:
memory: 512M
cpus: "1.0"
docker compose up -d qdrant
# Test the REST API
curl http://localhost:6333/healthz
# Should return: {"title":"qdrant - vector search engine","version":"x.x.x"}
# Create a collection for document embeddings
curl -X PUT http://localhost:6333/collections/documents \
-H "Content-Type: application/json" \
-d '{
"vectors": {
"size": 768,
"distance": "Cosine"
}
}'
7. Complete AI Stack Resource Budget
Now let us verify that all AI services fit within our RAM envelope:
RAM Budget with Full AI Stack Running
═════════════════════════════════════════════════════════
Windows 10 (base processes, Explorer, Defender): ~2.8 GB
WSL2 Linux kernel + Ubuntu OS: ~0.2 GB
─────────────────────────────────────────────────────────
Docker Services (Part 2):
Caddy: ~0.1 GB
Portainer: ~0.1 GB
MariaDB (Nextcloud): ~0.3 GB
Nextcloud: ~0.4 GB
Vaultwarden: ~0.1 GB
Uptime Kuma: ~0.1 GB
n8n: ~0.3 GB
─────────────────────────────────────────────────────────
AI Stack (Part 3):
Ollama daemon (no model loaded): ~0.1 GB
Open WebUI: ~0.2 GB
Odysseus workspace: ~0.2 GB
ChromaDB: ~0.2 GB
Flowise: ~0.3 GB
Qdrant: ~0.2 GB
─────────────────────────────────────────────────────────
Subtotal (no model loaded): ~5.6 GB
─────────────────────────────────────────────────────────
When phi3.5 is loaded (inference): +2.2 GB
Total with phi3.5: ~7.8 GB
When llama3.1:8b is loaded (inference): +4.7 GB
Total with 8B model: ~10.3 GB
─────────────────────────────────────────────────────────
16 GB Total Available
Headroom (with 8B model): ~5.7 GB ✅
═════════════════════════════════════════════════════════
All services fit comfortably. The OLLAMA_KEEP_ALIVE=5m setting ensures
the model is evicted after 5 minutes of inactivity, returning 2–5 GB to
the system.
8. Selecting the Right Model for the Right Task
Not every task needs the most powerful model. Using a 1B model for a quick lookup is significantly faster than loading an 8B model:
| Use Case | Recommended Model | Why |
|---|---|---|
| Quick Q&A, simple tasks | llama3.2:1b | ~30+ t/s, instant responses |
| General reasoning, writing | phi3.5:3.8b | Best quality/speed balance |
| Code generation, review | qwen2.5-coder:3b | Trained specifically for code |
| Complex analysis, long tasks | qwen2.5:7b | Higher quality, patience needed |
| Hindi/multilingual tasks | qwen2.5:3b | Strong multilingual support |
| Agent planning steps | phi3.5:3.8b | Good tool-use understanding |
| Document summarization | mistral:7b | Long context window support |
Creating Custom Model Configurations
Ollama's Modelfile syntax lets you customize model behavior:
# Create a custom system prompt for a coding assistant
docker exec -it ollama sh -c "cat > /tmp/CodingAssistant << 'EOF'
FROM qwen2.5-coder:3b
SYSTEM \"\"\"You are an expert software engineer. You write clean, modular,
well-commented code. You always follow SOLID principles. You prefer
functional patterns where appropriate. When showing code, always include
comments explaining non-obvious logic. Format all code in proper markdown
code blocks with the language specified. Ask clarifying questions if the
requirements are ambiguous before writing any code.\"\"\"
PARAMETER temperature 0.1
PARAMETER num_ctx 4096
EOF
ollama create coding-assistant -f /tmp/CodingAssistant"
# Verify the new model appears
docker exec ollama ollama list
# Test it
curl -X POST http://localhost:11434/api/chat \
-d '{"model":"coding-assistant","messages":[{"role":"user","content":"Write a Python function to validate an Indian phone number"}],"stream":false}'
9. Performance Optimization Tips
Maximize Inference Speed
# Set CPU governor to performance mode inside WSL2
# (Note: this may not persist across WSL2 restarts)
echo performance | sudo tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
# Monitor CPU frequency during inference
watch -n 1 "cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq"
# Aim to see values near 3,600,000 (3.6 GHz turbo boost)
# On Windows side: ensure CPU is at max performance during inference
# (The "Server - Always On" power plan from Part 1 handles this)
# Verify no thermal throttling via Windows Resource Monitor:
# Task Manager → Performance → CPU → check if clock speed is at max
Reduce Context Window for Speed
Every token in the context window increases memory bandwidth usage. For quick tasks, reduce the context:
# Test with a smaller context window (faster for short conversations)
curl -X POST http://localhost:11434/api/generate \
-d '{
"model": "phi3.5",
"prompt": "Summarize this in 2 sentences.",
"options": {
"num_ctx": 2048,
"num_thread": 4,
"num_batch": 512
},
"stream": false
}'
Monitor Real-Time Resource Usage
# Inside WSL2 — watch memory and CPU during inference
watch -n 2 "docker stats --no-stream --format 'table {{.Name}}\t{{.CPUPerc}}\t{{.MemUsage}}'"
Summary and What's Next
In Part 3, you have:
- ✅ Understood the physics of CPU-only LLM inference — memory bandwidth math and why dual-channel DDR4-2666 is the limiting factor.
- ✅ Deployed Ollama with optimized environment variables (
KEEP_ALIVE,NUM_THREADS) and pulled recommended models for 16 GB hardware. - ✅ Configured Open WebUI as a polished chat interface connecting to your local Ollama.
- ✅ Deployed Odysseus AI workspace with autonomous agents, persistent memory via ChromaDB, and multi-step tool execution.
- ✅ Set up Flowise for visual AI pipeline building with a working document Q&A example.
- ✅ Deployed Qdrant as a production-grade vector database for RAG.
- ✅ Verified the complete RAM budget — all services run comfortably within 16 GB with headroom.
In Part 4, we tackle operations: Task Scheduler automation edge cases, PowerShell health monitoring with Telegram alerts, handling Windows sleep resume (clock drift fix), DNS resolution failures, WSL2 OOM recovery, and configuring the laptop for seamless dual-use as both a daily driver and a background server.
Continue to Part 4: Automation, Monitoring, and Dual-Use Operations →
Comments
Comments are powered by giscus. Set
PUBLIC_GISCUS_REPO_IDandPUBLIC_GISCUS_CATEGORY_IDin your environment to enable them.