Part 18: Observability - Prometheus, Grafana & OpenTelemetry
As backends scale from simple monoliths to distributed microservices and dynamic serverless structures, diagnosing why a transaction failed or why a database request stalled becomes incredibly difficult. Traditional "monitoring"—merely checking if a service is online—is no longer sufficient.
Observability is the practice of measuring the internal state of a system based on its external outputs (signals). In 2026, the global telemetry standard has unified around OpenTelemetry (OTel) for instrumentation and the Grafana LGTM stack (Loki, Grafana, Tempo, Mimir/Pyroscope) for ingestion and analysis. This guide provides a deep architectural breakdown and a complete blueprint of exactly 30 curated resources to master modern observability.
Telemetry Signals: The Three Pillars and Beyond
Modern systems collect four core classes of telemetry data:
Telemetry Ingestion Pipeline
┌───────────────┐ ┌───────────────┐ ┌───────────────┐ ┌───────────────┐
│ METRICS │ │ LOGS │ │ TRACES │ │ PROFILING │
└───────┬───────┘ └───────┬───────┘ └───────┬───────┘ └───────┬───────┘
│ │ │ │
└──────────────────┼──────────────────┼──────────────────┘
▼
[ OpenTelemetry Collector ]
│
┌──────────────────┼──────────────────┐
▼ ▼ ▼
[ Prometheus / Mimir ] [ Loki ] [ Tempo / Jaeger ]
(Metrics TSDB) (Logs Index) (Trace Spans)
- Metrics:
- Numeric values aggregated over time (e.g., CPU utilization, HTTP request rates, active database connections).
- Incredibly fast to query and cheap to store.
- Leveraged to trigger instant Slack/PagerDuty alerts when thresholds are violated.
- Logs:
- Timestamps and text payloads representing distinct, discrete application events.
- Transitioning logs into structured JSON formats is mandatory in modern platforms. This permits downstream parsers to query, index, and organize entries efficiently.
- Distributed Tracing:
- Visual representation of a request's journey across service boundaries.
- Maps out Spans (single operations inside a database or HTTP call) linked by a shared Trace ID, allowing you to pinpoint the exact service causing an API bottleneck.
- Continuous Profiling:
- Continuously measures resource allocation (CPU, memory, threads) at the function level in production.
- Visualized via Flame Graphs to show the exact lines of code leaking memory or stalling threads.
1. Unified Collection with OpenTelemetry SDKs & Collector
Master OpenTelemetry SDK architectures and unified collectors with these 5 resources.
Subtopic Resources
| Resource Name & Metadata | Access Category | Status & Skip Conditions |
|---|---|---|
| Observability Engineering by Charity Majors, Liz Fong-Jones, & George Miranda (O'Reilly) The definitive book on structured telemetry pipelines. | Book | Required |
| OpenTelemetry Fundamentals (Pluralsight Course) Hands-on video training detailing SDK integrations and collector modes. | Video Course | Required |
| OpenTelemetry Official Collector Documentation Reference manual for configuring receivers, processors, and exporters. | Documentation | Required |
| OpenTelemetry Crash Course by TechWorld with Nana (YouTube) Visual walkthrough on tracing spans and collector configurations. | Video Stream | Required |
| OpenTelemetry Collector Pipeline Sandbox (StackBlitz) Interactive sandbox to test and validate pipeline processors. | Interactive Sandbox | Required |
Resource Identification & Access
- Observability Engineering
- Direct URL:
https://www.oreilly.com/library/view/observability-engineering/9781492097723/ - Search Identification: Search O'Reilly for
"Observability Engineering Charity Majors"
- Direct URL:
- OpenTelemetry Fundamentals
- Direct URL:
https://www.pluralsight.com/courses/opentelemetry-fundamentals - Search Identification: Search Pluralsight for
"OpenTelemetry Fundamentals"
- Direct URL:
- OpenTelemetry Collector Documentation
- Direct URL:
https://opentelemetry.io/docs/collector/ - Search Identification: Search OpenTelemetry Docs for
"Collector Architecture receivers processors exporters"
- Direct URL:
- OpenTelemetry Crash Course
- Direct URL:
https://www.youtube.com/watch?v=r8H46V41R6A - Search Identification: Search YouTube for
"TechWorld with Nana OpenTelemetry Crash Course"
- Direct URL:
- OpenTelemetry Collector Pipeline Sandbox
- Direct URL:
https://stackblitz.com/edit/opentelemetry-collector-sandbox - Search Identification: Search StackBlitz for
"OpenTelemetry collector pipeline testing"
- Direct URL:
2. Prometheus Pull Architecture, TSDB Mechanics & PromQL
Master metrics collection and alerting with these 5 curated resources.
Subtopic Resources
| Resource Name & Metadata | Access Category | Status & Skip Conditions |
|---|---|---|
| Prometheus: Up & Running by Julien Pivotto & Brian Brazil In-depth book covering TSDB compression and PromQL syntax. | Book | Required |
| Prometheus & Grafana - The Complete Guide by Stephane Maarek (Udemy) Popular video guide on metric scraping and alertmanager configurations. | Video Course | Required |
| Prometheus Query Language (PromQL) Guide (Prometheus) Official reference detailing range vectors, instant queries, and aggregations. | Documentation | Required |
| Prometheus Deep Dive: Storage Engine by Julius Volz Core co-founder lecture explaining time-series index serialization. | Video Stream | Required |
| Interactive PromQL Exercises Playground (StackBlitz) Interactive sandbox containing raw metrics to run PromQL queries. | Interactive Sandbox | Alternative (Skip if "Prometheus & Grafana - The Complete Guide" is completed) |
Resource Identification & Access
- Prometheus: Up & Running
- Direct URL:
https://www.oreilly.com/library/view/prometheus-up/9781492034131/ - Search Identification: Search O'Reilly for
"Prometheus Up and Running Brian Brazil"
- Direct URL:
- Prometheus & Grafana - The Complete Guide
- Direct URL:
https://www.udemy.com/course/prometheus-grafana/ - Search Identification: Search Udemy for
"Prometheus and Grafana Stephane Maarek"
- Direct URL:
- Prometheus Query Language (PromQL) Guide
- Direct URL:
https://prometheus.io/docs/prometheus/latest/querying/basics/ - Search Identification: Search Prometheus Docs for
"Querying basics PromQL vectors"
- Direct URL:
- Prometheus Deep Dive: Storage Engine
- Direct URL:
https://www.youtube.com/watch?v=hTz1c80rVvQ - Search Identification: Search YouTube for
"Julius Volz Prometheus architecture storage"
- Direct URL:
- Interactive PromQL Exercises Playground
- Direct URL:
https://stackblitz.com/edit/promql-sandbox-exercises - Search Identification: Search StackBlitz for
"PromQL interactive queries simulation"
- Direct URL:
3. Custom Grafana Dashboards & Dynamic Data Visualization
Learn to build dynamic, useful dashboards with these 5 resources.
Subtopic Resources
| Resource Name & Metadata | Access Category | Status & Skip Conditions |
|---|---|---|
| Observability with Grafana by Rob Chapman & Peter Holmes Practical handbook covering multi-tenant visual rendering patterns. | Book | Required |
| Grafana Learning Paths (Grafana Academy) Free, interactive official courses on panel styling and metrics mapping. | Video Course | Required |
| Grafana Panel Plugins & Variables Reference (Grafana Docs) Official documentation detailing template queries and variable inputs. | Documentation | Required |
| How to Build Professional Grafana Dashboards (YouTube) Grafana Labs video on alert thresholds and telemetry layouts. | Video Stream | Required |
| Grafana Live Playground (Grafana Play) Official public interactive sandbox featuring fully populated metrics. | Interactive Sandbox | Required |
Resource Identification & Access
- Observability with Grafana
- Direct URL:
https://www.manning.com/books/observability-with-grafana - Search Identification: Search Manning for
"Observability with Grafana Chapman Holmes"
- Direct URL:
- Grafana Learning Paths
- Direct URL:
https://grafana.com/tutorials/ - Search Identification: Search Grafana for
"Grafana tutorials dashboards getting started"
- Direct URL:
- Grafana Panel Plugins & Variables Reference
- Direct URL:
https://grafana.com/docs/grafana/latest/panels-visualizations/ - Search Identification: Search Grafana Docs for
"Panels Visualizations variables query"
- Direct URL:
- How to Build Professional Grafana Dashboards
- Direct URL:
https://www.youtube.com/watch?v=B9JbZ1Zl65U - Search Identification: Search YouTube for
"Grafana Labs dynamic production dashboards"
- Direct URL:
- Grafana Live Playground
- Direct URL:
https://play.grafana.org/ - Search Identification: Search Web for
"Grafana Play public sandbox server"
- Direct URL:
4. Distributed Tracing with OpenTelemetry, Tempo & Jaeger
Master context propagation and trace visualization with these 5 resources.
Subtopic Resources
| Resource Name & Metadata | Access Category | Status & Skip Conditions |
|---|---|---|
| Mastering Distributed Tracing by Yuri Shkuro (Packt) Definitive book written by the creator of Jaeger. | Book | Required |
| Distributed Tracing Bootcamp (Udemy) Visual course tracing microservices across HTTP/gRPC boundaries. | Video Course | Required |
| Grafana Tempo Documentation (Grafana Docs) Technical manual covering high-performance, object-storage tracing. | Documentation | Required |
| Wtf is Distributed Tracing? Why do we need Spans? (YouTube) Humorous and insightful lecture by Charity Majors on spans and context. | Video Stream | Required |
| OpenTelemetry Jaeger Tracing Sandbox (StackBlitz) Interactive Node.js sandbox executing trace spans locally. | Interactive Sandbox | Alternative (Skip if "Mastering Distributed Tracing" by Yuri Shkuro is completed) |
Resource Identification & Access
- Mastering Distributed Tracing
- Direct URL:
https://www.packtpub.com/product/mastering-distributed-tracing/9781788627856 - Search Identification: Search Packt for
"Mastering Distributed Tracing Yuri Shkuro"
- Direct URL:
- Distributed Tracing Bootcamp
- Direct URL:
https://www.udemy.com/course/distributed-tracing/ - Search Identification: Search Udemy for
"Distributed Tracing Bootcamp Jaeger"
- Direct URL:
- Grafana Tempo Documentation
- Direct URL:
https://grafana.com/docs/tempo/latest/ - Search Identification: Search Grafana Tempo Docs for
"Tracing storage formats"
- Direct URL:
- Wtf is Distributed Tracing? Why do we need Spans?
- Direct URL:
https://www.youtube.com/watch?v=Yf1eZ029Jso - Search Identification: Search YouTube for
"Charity Majors WTF is Distributed Tracing"
- Direct URL:
- OpenTelemetry Jaeger Tracing Sandbox
- Direct URL:
https://stackblitz.com/edit/opentelemetry-jaeger-tracing-sandbox - Search Identification: Search StackBlitz for
"OpenTelemetry Jaeger trace spans"
- Direct URL:
5. Structured Application Logging & Log Aggregation with Loki
Ingest, index, and query application logs at scale with these 5 resources.
Subtopic Resources
| Resource Name & Metadata | Access Category | Status & Skip Conditions |
|---|---|---|
| Log Management in the Cloud (Packt Publishing) Book covering centralized cloud logging patterns and structured indices. | Book | Required |
| Grafana Loki: Modern Log Ingestion & Querying (Udemy) Practical video series demonstrating Loki agent routing configurations. | Video Course | Required |
| Grafana Loki: LogQL Reference (Grafana Docs) Official LogQL syntax parser guides and dynamic formatting tutorials. | Documentation | Required |
| Like Prometheus, But for Logs: Loki Architecture (YouTube) Official architectural breakdown video explaining log metadata labels. | Video Stream | Required |
| Loki LogQL Query Sandbox (StackBlitz) Interactive sandbox template validating LogQL queries locally. | Interactive Sandbox | Required |
Resource Identification & Access
- Log Management in the Cloud
- Direct URL:
https://www.packtpub.com/product/log-management-in-the-cloud/9781801815123 - Search Identification: Search Packt for
"Log Management in the Cloud"
- Direct URL:
- Grafana Loki: Modern Log Ingestion
- Direct URL:
https://www.udemy.com/course/grafana-loki/ - Search Identification: Search Udemy for
"Grafana Loki log ingestion"
- Direct URL:
- Grafana Loki: LogQL Reference
- Direct URL:
https://grafana.com/docs/loki/latest/logql/ - Search Identification: Search Grafana Loki Docs for
"LogQL query guide"
- Direct URL:
- Like Prometheus, But for Logs: Loki Architecture
- Direct URL:
https://www.youtube.com/watch?v=Vl03qGpyE7A - Search Identification: Search YouTube for
"Grafana Labs Loki Architecture Logs"
- Direct URL:
- Loki LogQL Query Sandbox
- Direct URL:
https://stackblitz.com/edit/grafana-loki-logql-sandbox - Search Identification: Search StackBlitz for
"Loki LogQL query exercises"
- Direct URL:
6. Continuous Profiling Internals (Grafana Pyroscope)
Find CPU hot paths and memory leaks in production with these 5 resources.
Subtopic Resources
| Resource Name & Metadata | Access Category | Status & Skip Conditions |
|---|---|---|
| Systems Performance by Brendan Gregg (O'Reilly) The master reference book for performance engineering and tracing. | Book | Required |
| Performance Engineering & Continuous Profiling (LinkedIn) Video track demonstrating CPU profiling and memory leak tracing. | Video Course | Required |
| Grafana Pyroscope: Continuous Profiling (Grafana Docs) Official integration guide covering continuous stack tracing agents. | Documentation | Required |
| Continuous Profiling: The Fourth Pillar of Observability (YouTube) Grafana Labs presentation detailing flame graph performance overlays. | Video Stream | Required |
| Profiling Python & Node applications using Pyroscope (StackBlitz) Interactive sandbox generating real flame graphs from CPU load. | Interactive Sandbox | Alternative (Skip if "Systems Performance" by Brendan Gregg is completed) |
Resource Identification & Access
- Systems Performance
- Direct URL:
https://www.oreilly.com/library/view/systems-performance-2nd/9780136821694/ - Search Identification: Search O'Reilly for
"Systems Performance Brendan Gregg"
- Direct URL:
- Performance Engineering & Continuous Profiling
- Direct URL:
https://www.linkedin.com/learning/performance-engineering-and-continuous-profiling - Search Identification: Search LinkedIn Learning for
"Continuous Profiling Pyroscope"
- Direct URL:
- Grafana Pyroscope: Continuous Profiling
- Direct URL:
https://grafana.com/docs/pyroscope/latest/ - Search Identification: Search Grafana Pyroscope Docs for
"Flame graphs continuous profiling"
- Direct URL:
- Continuous Profiling: The Fourth Pillar
- Direct URL:
https://www.youtube.com/watch?v=F3a7dZpB17k - Search Identification: Search YouTube for
"Grafana Labs Pyroscope Continuous Profiling"
- Direct URL:
- Profiling Python & Node applications
- Direct URL:
https://stackblitz.com/edit/pyroscope-continuous-profiling-sandbox - Search Identification: Search StackBlitz for
"Grafana Pyroscope flame graphs node"
- Direct URL:
Portfolio Project Lab: Instrumented FastAPI Microservice
Objective
Create a fully instrumented, production-grade FastAPI microservice utilizing OpenTelemetry SDKs to collect transaction metrics, propagate distributed tracing, and write structured JSON application logs.
1. Project Dependencies
Create a requirements.txt containing the necessary OpenTelemetry and application libraries:
fastapi==0.110.0
uvicorn==0.28.0
opentelemetry-api==1.23.0
opentelemetry-sdk==1.23.0
opentelemetry-instrumentation-fastapi==0.44b0
opentelemetry-exporter-otlp-proto-grpc==1.23.0
python-json-logger==2.0.7
2. Structured JSON Logger & Telemetry SDK Configuration
Save this file as logger_config.py in your workspace directory:
import logging
from pythonjsonlogger import jsonlogger
def setup_logger():
logger = logging.getLogger("api_logger")
logger.setLevel(logging.INFO)
# In production, logs must write to standard output in clean JSON
log_handler = logging.StreamHandler()
formatter = jsonlogger.JsonFormatter(
'%(asctime)s %(levelname)s %(message)s %(trace_id)s %(span_id)s'
)
log_handler.setFormatter(formatter)
logger.addHandler(log_handler)
return logger
logger = setup_logger()
3. Fully Instrumented Core Application
Save this code block as main.py. It implements automatic trace context propagation and metric generation:
import time
import random
from fastapi import FastAPI, Request
from logger_config import logger
# Import OpenTelemetry core APIs
from opentelemetry import trace, metrics
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import SimpleSpanProcessor
from opentelemetry.sdk.trace.export import ConsoleSpanExporter
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader
from opentelemetry.sdk.metrics.export import ConsoleMetricExporter
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
# 1. Initialize Tracing Engine & Console Exporters for local validation
provider = TracerProvider()
processor = SimpleSpanProcessor(ConsoleSpanExporter())
provider.add_span_processor(processor)
trace.set_tracer_provider(provider)
tracer = trace.get_tracer("fastapi-service-tracer")
# 2. Initialize Metrics Engine & Console Exporters for checking counts
metric_reader = PeriodicExportingMetricReader(ConsoleMetricExporter())
metrics.set_meter_provider(MeterProvider(metric_readers=[metric_reader]))
meter = metrics.get_meter("fastapi-service-metrics")
# Declare a custom counter metric to track transaction volumes
order_counter = meter.create_counter(
name="api_orders_processed_total",
description="Total volume of processed client checkout orders",
unit="1"
)
# 3. Initialize FastAPI Application
app = FastAPI(title="Instrumented API Service")
# Inject global middleware to intercept requests, measure latency, and log details
@app.middleware("http")
async def add_telemetry_headers(request: Request, call_next):
start_time = time.time()
# Retrieve current active OpenTelemetry span
current_span = trace.get_current_span()
trace_id = format(current_span.get_span_context().trace_id, '032x') if current_span else "0"
span_id = format(current_span.get_span_context().span_id, '016x') if current_span else "0"
# Pass execution to the next handler
response = await call_next(request)
duration = time.time() - start_time
# Log structured transaction telemetry metrics in clean JSON
logger.info(
"HTTP Request Processed",
extra={
"http_method": request.method,
"http_path": request.url.path,
"http_status": response.status_code,
"duration_seconds": duration,
"trace_id": trace_id,
"span_id": span_id
}
)
# Inject trace headers to HTTP responses for client troubleshooting
response.headers["X-Trace-ID"] = trace_id
return response
@app.get("/api/checkout")
async def checkout():
# Wrap database logic inside a dedicated custom sub-span
with tracer.start_as_current_span("database_save_transaction") as span:
# Simulate dynamic processing latency
latency = random.uniform(0.01, 0.15)
time.sleep(latency)
span.set_attribute("db.system", "postgresql")
span.set_attribute("db.statement", "INSERT INTO orders (total) VALUES (99)")
# Increment our custom Prometheus telemetry counter
order_counter.add(1, {"status": "success"})
return {"status": "Order Placed Successfully"}
# Auto-instrument FastAPI routes
FastAPIInstrumentor.instrument_app(app)
To run this instrumented service locally:
pip install -r requirements.txt
uvicorn main:app --host 0.0.0.0 --port 8080
Perform an API hit:
curl http://localhost:8080/api/checkout
You will immediately see dynamic console dumps detailing your trace spans, active counter metrics, and structured JSON logs complete with injected Trace and Span IDs.
Common Observability Interview Questions
1. Explain the difference between Logs, Metrics, and Distributed Tracing.
- Answer:
- Metrics are numeric aggregations over time windows (e.g. request count). They are lightweight, fast to query, and cheap to store, making them ideal for high-level alerting.
- Logs are timestamped strings or structured objects representing discrete application events. They provide rich details but are expensive to store.
- Distributed Tracing tracks requests across system boundaries by passing context headers (Trace IDs). It shows dynamic dependency maps and pinpoints service bottlenecks.
2. How does OpenTelemetry context propagation work across HTTP boundaries?
- Answer: Context propagation passes metadata (Trace ID, Span ID) across service boundaries by injecting specific key-value pairs into HTTP headers (using standards like W3C Trace Context). The sending service injects headers like
traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01. The receiving service reads this header, extracts the IDs, and binds its local spans to that parent trace.
3. What is Continuous Profiling, and why is it valuable compared to standard metrics?
- Answer: Continuous profiling continuously measures resource usage (CPU time, heap memory allocations, thread counts) at the function and code-line level in production with minimal overhead. While standard metrics tell you if CPU usage is high, continuous profiling points you directly to the exact lines of code causing the bottleneck, visualized via Flame Graphs.
Next Steps
Now that you have learned to monitor and profile distributed systems, we will integrate these checks into automated release workflows.
Comments
Comments are powered by giscus. Set
PUBLIC_GISCUS_REPO_IDandPUBLIC_GISCUS_CATEGORY_IDin your environment to enable them.