Part 18: Observability - Prometheus, Grafana & OpenTelemetry

Master OpenTelemetry SDK instrumentation, Prometheus pull architectures, Grafana dashboard panels, Tempo spans, Loki LogQL, and Pyroscope flame graphs. Complete 30-resource blueprint.

Part 18: Observability - Prometheus, Grafana & OpenTelemetry

← Back to Master Index


As backends scale from simple monoliths to distributed microservices and dynamic serverless structures, diagnosing why a transaction failed or why a database request stalled becomes incredibly difficult. Traditional "monitoring"—merely checking if a service is online—is no longer sufficient.

Observability is the practice of measuring the internal state of a system based on its external outputs (signals). In 2026, the global telemetry standard has unified around OpenTelemetry (OTel) for instrumentation and the Grafana LGTM stack (Loki, Grafana, Tempo, Mimir/Pyroscope) for ingestion and analysis. This guide provides a deep architectural breakdown and a complete blueprint of exactly 30 curated resources to master modern observability.


Telemetry Signals: The Three Pillars and Beyond

Modern systems collect four core classes of telemetry data:

                  Telemetry Ingestion Pipeline
 ┌───────────────┐  ┌───────────────┐  ┌───────────────┐  ┌───────────────┐
 │    METRICS    │  │     LOGS      │  │    TRACES     │  │   PROFILING   │
 └───────┬───────┘  └───────┬───────┘  └───────┬───────┘  └───────┬───────┘
         │                  │                  │                  │
         └──────────────────┼──────────────────┼──────────────────┘
                            ▼
               [ OpenTelemetry Collector ]
                            │
         ┌──────────────────┼──────────────────┐
         ▼                  ▼                  ▼
 [ Prometheus / Mimir ]  [ Loki ]     [ Tempo / Jaeger ]
     (Metrics TSDB)    (Logs Index)     (Trace Spans)
  1. Metrics:
    • Numeric values aggregated over time (e.g., CPU utilization, HTTP request rates, active database connections).
    • Incredibly fast to query and cheap to store.
    • Leveraged to trigger instant Slack/PagerDuty alerts when thresholds are violated.
  2. Logs:
    • Timestamps and text payloads representing distinct, discrete application events.
    • Transitioning logs into structured JSON formats is mandatory in modern platforms. This permits downstream parsers to query, index, and organize entries efficiently.
  3. Distributed Tracing:
    • Visual representation of a request's journey across service boundaries.
    • Maps out Spans (single operations inside a database or HTTP call) linked by a shared Trace ID, allowing you to pinpoint the exact service causing an API bottleneck.
  4. Continuous Profiling:
    • Continuously measures resource allocation (CPU, memory, threads) at the function level in production.
    • Visualized via Flame Graphs to show the exact lines of code leaking memory or stalling threads.

1. Unified Collection with OpenTelemetry SDKs & Collector

Master OpenTelemetry SDK architectures and unified collectors with these 5 resources.

Subtopic Resources

Resource Name & MetadataAccess CategoryStatus & Skip Conditions
Observability Engineering by Charity Majors, Liz Fong-Jones, & George Miranda (O'Reilly)
The definitive book on structured telemetry pipelines.
BookRequired
OpenTelemetry Fundamentals (Pluralsight Course)
Hands-on video training detailing SDK integrations and collector modes.
Video CourseRequired
OpenTelemetry Official Collector Documentation
Reference manual for configuring receivers, processors, and exporters.
DocumentationRequired
OpenTelemetry Crash Course by TechWorld with Nana (YouTube)
Visual walkthrough on tracing spans and collector configurations.
Video StreamRequired
OpenTelemetry Collector Pipeline Sandbox (StackBlitz)
Interactive sandbox to test and validate pipeline processors.
Interactive SandboxRequired

Resource Identification & Access

  • Observability Engineering
    • Direct URL: https://www.oreilly.com/library/view/observability-engineering/9781492097723/
    • Search Identification: Search O'Reilly for "Observability Engineering Charity Majors"
  • OpenTelemetry Fundamentals
    • Direct URL: https://www.pluralsight.com/courses/opentelemetry-fundamentals
    • Search Identification: Search Pluralsight for "OpenTelemetry Fundamentals"
  • OpenTelemetry Collector Documentation
    • Direct URL: https://opentelemetry.io/docs/collector/
    • Search Identification: Search OpenTelemetry Docs for "Collector Architecture receivers processors exporters"
  • OpenTelemetry Crash Course
    • Direct URL: https://www.youtube.com/watch?v=r8H46V41R6A
    • Search Identification: Search YouTube for "TechWorld with Nana OpenTelemetry Crash Course"
  • OpenTelemetry Collector Pipeline Sandbox
    • Direct URL: https://stackblitz.com/edit/opentelemetry-collector-sandbox
    • Search Identification: Search StackBlitz for "OpenTelemetry collector pipeline testing"

2. Prometheus Pull Architecture, TSDB Mechanics & PromQL

Master metrics collection and alerting with these 5 curated resources.

Subtopic Resources

Resource Name & MetadataAccess CategoryStatus & Skip Conditions
Prometheus: Up & Running by Julien Pivotto & Brian Brazil
In-depth book covering TSDB compression and PromQL syntax.
BookRequired
Prometheus & Grafana - The Complete Guide by Stephane Maarek (Udemy)
Popular video guide on metric scraping and alertmanager configurations.
Video CourseRequired
Prometheus Query Language (PromQL) Guide (Prometheus)
Official reference detailing range vectors, instant queries, and aggregations.
DocumentationRequired
Prometheus Deep Dive: Storage Engine by Julius Volz
Core co-founder lecture explaining time-series index serialization.
Video StreamRequired
Interactive PromQL Exercises Playground (StackBlitz)
Interactive sandbox containing raw metrics to run PromQL queries.
Interactive SandboxAlternative
(Skip if "Prometheus & Grafana - The Complete Guide" is completed)

Resource Identification & Access

  • Prometheus: Up & Running
    • Direct URL: https://www.oreilly.com/library/view/prometheus-up/9781492034131/
    • Search Identification: Search O'Reilly for "Prometheus Up and Running Brian Brazil"
  • Prometheus & Grafana - The Complete Guide
    • Direct URL: https://www.udemy.com/course/prometheus-grafana/
    • Search Identification: Search Udemy for "Prometheus and Grafana Stephane Maarek"
  • Prometheus Query Language (PromQL) Guide
    • Direct URL: https://prometheus.io/docs/prometheus/latest/querying/basics/
    • Search Identification: Search Prometheus Docs for "Querying basics PromQL vectors"
  • Prometheus Deep Dive: Storage Engine
    • Direct URL: https://www.youtube.com/watch?v=hTz1c80rVvQ
    • Search Identification: Search YouTube for "Julius Volz Prometheus architecture storage"
  • Interactive PromQL Exercises Playground
    • Direct URL: https://stackblitz.com/edit/promql-sandbox-exercises
    • Search Identification: Search StackBlitz for "PromQL interactive queries simulation"

3. Custom Grafana Dashboards & Dynamic Data Visualization

Learn to build dynamic, useful dashboards with these 5 resources.

Subtopic Resources

Resource Name & MetadataAccess CategoryStatus & Skip Conditions
Observability with Grafana by Rob Chapman & Peter Holmes
Practical handbook covering multi-tenant visual rendering patterns.
BookRequired
Grafana Learning Paths (Grafana Academy)
Free, interactive official courses on panel styling and metrics mapping.
Video CourseRequired
Grafana Panel Plugins & Variables Reference (Grafana Docs)
Official documentation detailing template queries and variable inputs.
DocumentationRequired
How to Build Professional Grafana Dashboards (YouTube)
Grafana Labs video on alert thresholds and telemetry layouts.
Video StreamRequired
Grafana Live Playground (Grafana Play)
Official public interactive sandbox featuring fully populated metrics.
Interactive SandboxRequired

Resource Identification & Access

  • Observability with Grafana
    • Direct URL: https://www.manning.com/books/observability-with-grafana
    • Search Identification: Search Manning for "Observability with Grafana Chapman Holmes"
  • Grafana Learning Paths
    • Direct URL: https://grafana.com/tutorials/
    • Search Identification: Search Grafana for "Grafana tutorials dashboards getting started"
  • Grafana Panel Plugins & Variables Reference
    • Direct URL: https://grafana.com/docs/grafana/latest/panels-visualizations/
    • Search Identification: Search Grafana Docs for "Panels Visualizations variables query"
  • How to Build Professional Grafana Dashboards
    • Direct URL: https://www.youtube.com/watch?v=B9JbZ1Zl65U
    • Search Identification: Search YouTube for "Grafana Labs dynamic production dashboards"
  • Grafana Live Playground
    • Direct URL: https://play.grafana.org/
    • Search Identification: Search Web for "Grafana Play public sandbox server"

4. Distributed Tracing with OpenTelemetry, Tempo & Jaeger

Master context propagation and trace visualization with these 5 resources.

Subtopic Resources

Resource Name & MetadataAccess CategoryStatus & Skip Conditions
Mastering Distributed Tracing by Yuri Shkuro (Packt)
Definitive book written by the creator of Jaeger.
BookRequired
Distributed Tracing Bootcamp (Udemy)
Visual course tracing microservices across HTTP/gRPC boundaries.
Video CourseRequired
Grafana Tempo Documentation (Grafana Docs)
Technical manual covering high-performance, object-storage tracing.
DocumentationRequired
Wtf is Distributed Tracing? Why do we need Spans? (YouTube)
Humorous and insightful lecture by Charity Majors on spans and context.
Video StreamRequired
OpenTelemetry Jaeger Tracing Sandbox (StackBlitz)
Interactive Node.js sandbox executing trace spans locally.
Interactive SandboxAlternative
(Skip if "Mastering Distributed Tracing" by Yuri Shkuro is completed)

Resource Identification & Access

  • Mastering Distributed Tracing
    • Direct URL: https://www.packtpub.com/product/mastering-distributed-tracing/9781788627856
    • Search Identification: Search Packt for "Mastering Distributed Tracing Yuri Shkuro"
  • Distributed Tracing Bootcamp
    • Direct URL: https://www.udemy.com/course/distributed-tracing/
    • Search Identification: Search Udemy for "Distributed Tracing Bootcamp Jaeger"
  • Grafana Tempo Documentation
    • Direct URL: https://grafana.com/docs/tempo/latest/
    • Search Identification: Search Grafana Tempo Docs for "Tracing storage formats"
  • Wtf is Distributed Tracing? Why do we need Spans?
    • Direct URL: https://www.youtube.com/watch?v=Yf1eZ029Jso
    • Search Identification: Search YouTube for "Charity Majors WTF is Distributed Tracing"
  • OpenTelemetry Jaeger Tracing Sandbox
    • Direct URL: https://stackblitz.com/edit/opentelemetry-jaeger-tracing-sandbox
    • Search Identification: Search StackBlitz for "OpenTelemetry Jaeger trace spans"

5. Structured Application Logging & Log Aggregation with Loki

Ingest, index, and query application logs at scale with these 5 resources.

Subtopic Resources

Resource Name & MetadataAccess CategoryStatus & Skip Conditions
Log Management in the Cloud (Packt Publishing)
Book covering centralized cloud logging patterns and structured indices.
BookRequired
Grafana Loki: Modern Log Ingestion & Querying (Udemy)
Practical video series demonstrating Loki agent routing configurations.
Video CourseRequired
Grafana Loki: LogQL Reference (Grafana Docs)
Official LogQL syntax parser guides and dynamic formatting tutorials.
DocumentationRequired
Like Prometheus, But for Logs: Loki Architecture (YouTube)
Official architectural breakdown video explaining log metadata labels.
Video StreamRequired
Loki LogQL Query Sandbox (StackBlitz)
Interactive sandbox template validating LogQL queries locally.
Interactive SandboxRequired

Resource Identification & Access

  • Log Management in the Cloud
    • Direct URL: https://www.packtpub.com/product/log-management-in-the-cloud/9781801815123
    • Search Identification: Search Packt for "Log Management in the Cloud"
  • Grafana Loki: Modern Log Ingestion
    • Direct URL: https://www.udemy.com/course/grafana-loki/
    • Search Identification: Search Udemy for "Grafana Loki log ingestion"
  • Grafana Loki: LogQL Reference
    • Direct URL: https://grafana.com/docs/loki/latest/logql/
    • Search Identification: Search Grafana Loki Docs for "LogQL query guide"
  • Like Prometheus, But for Logs: Loki Architecture
    • Direct URL: https://www.youtube.com/watch?v=Vl03qGpyE7A
    • Search Identification: Search YouTube for "Grafana Labs Loki Architecture Logs"
  • Loki LogQL Query Sandbox
    • Direct URL: https://stackblitz.com/edit/grafana-loki-logql-sandbox
    • Search Identification: Search StackBlitz for "Loki LogQL query exercises"

6. Continuous Profiling Internals (Grafana Pyroscope)

Find CPU hot paths and memory leaks in production with these 5 resources.

Subtopic Resources

Resource Name & MetadataAccess CategoryStatus & Skip Conditions
Systems Performance by Brendan Gregg (O'Reilly)
The master reference book for performance engineering and tracing.
BookRequired
Performance Engineering & Continuous Profiling (LinkedIn)
Video track demonstrating CPU profiling and memory leak tracing.
Video CourseRequired
Grafana Pyroscope: Continuous Profiling (Grafana Docs)
Official integration guide covering continuous stack tracing agents.
DocumentationRequired
Continuous Profiling: The Fourth Pillar of Observability (YouTube)
Grafana Labs presentation detailing flame graph performance overlays.
Video StreamRequired
Profiling Python & Node applications using Pyroscope (StackBlitz)
Interactive sandbox generating real flame graphs from CPU load.
Interactive SandboxAlternative
(Skip if "Systems Performance" by Brendan Gregg is completed)

Resource Identification & Access

  • Systems Performance
    • Direct URL: https://www.oreilly.com/library/view/systems-performance-2nd/9780136821694/
    • Search Identification: Search O'Reilly for "Systems Performance Brendan Gregg"
  • Performance Engineering & Continuous Profiling
    • Direct URL: https://www.linkedin.com/learning/performance-engineering-and-continuous-profiling
    • Search Identification: Search LinkedIn Learning for "Continuous Profiling Pyroscope"
  • Grafana Pyroscope: Continuous Profiling
    • Direct URL: https://grafana.com/docs/pyroscope/latest/
    • Search Identification: Search Grafana Pyroscope Docs for "Flame graphs continuous profiling"
  • Continuous Profiling: The Fourth Pillar
    • Direct URL: https://www.youtube.com/watch?v=F3a7dZpB17k
    • Search Identification: Search YouTube for "Grafana Labs Pyroscope Continuous Profiling"
  • Profiling Python & Node applications
    • Direct URL: https://stackblitz.com/edit/pyroscope-continuous-profiling-sandbox
    • Search Identification: Search StackBlitz for "Grafana Pyroscope flame graphs node"

Portfolio Project Lab: Instrumented FastAPI Microservice

Objective

Create a fully instrumented, production-grade FastAPI microservice utilizing OpenTelemetry SDKs to collect transaction metrics, propagate distributed tracing, and write structured JSON application logs.

1. Project Dependencies

Create a requirements.txt containing the necessary OpenTelemetry and application libraries:

fastapi==0.110.0
uvicorn==0.28.0
opentelemetry-api==1.23.0
opentelemetry-sdk==1.23.0
opentelemetry-instrumentation-fastapi==0.44b0
opentelemetry-exporter-otlp-proto-grpc==1.23.0
python-json-logger==2.0.7

2. Structured JSON Logger & Telemetry SDK Configuration

Save this file as logger_config.py in your workspace directory:

import logging
from pythonjsonlogger import jsonlogger

def setup_logger():
    logger = logging.getLogger("api_logger")
    logger.setLevel(logging.INFO)
    
    # In production, logs must write to standard output in clean JSON
    log_handler = logging.StreamHandler()
    formatter = jsonlogger.JsonFormatter(
        '%(asctime)s %(levelname)s %(message)s %(trace_id)s %(span_id)s'
    )
    log_handler.setFormatter(formatter)
    logger.addHandler(log_handler)
    return logger

logger = setup_logger()

3. Fully Instrumented Core Application

Save this code block as main.py. It implements automatic trace context propagation and metric generation:

import time
import random
from fastapi import FastAPI, Request
from logger_config import logger

# Import OpenTelemetry core APIs
from opentelemetry import trace, metrics
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import SimpleSpanProcessor
from opentelemetry.sdk.trace.export import ConsoleSpanExporter
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader
from opentelemetry.sdk.metrics.export import ConsoleMetricExporter
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor

# 1. Initialize Tracing Engine & Console Exporters for local validation
provider = TracerProvider()
processor = SimpleSpanProcessor(ConsoleSpanExporter())
provider.add_span_processor(processor)
trace.set_tracer_provider(provider)
tracer = trace.get_tracer("fastapi-service-tracer")

# 2. Initialize Metrics Engine & Console Exporters for checking counts
metric_reader = PeriodicExportingMetricReader(ConsoleMetricExporter())
metrics.set_meter_provider(MeterProvider(metric_readers=[metric_reader]))
meter = metrics.get_meter("fastapi-service-metrics")

# Declare a custom counter metric to track transaction volumes
order_counter = meter.create_counter(
    name="api_orders_processed_total",
    description="Total volume of processed client checkout orders",
    unit="1"
)

# 3. Initialize FastAPI Application
app = FastAPI(title="Instrumented API Service")

# Inject global middleware to intercept requests, measure latency, and log details
@app.middleware("http")
async def add_telemetry_headers(request: Request, call_next):
    start_time = time.time()
    
    # Retrieve current active OpenTelemetry span
    current_span = trace.get_current_span()
    trace_id = format(current_span.get_span_context().trace_id, '032x') if current_span else "0"
    span_id = format(current_span.get_span_context().span_id, '016x') if current_span else "0"
    
    # Pass execution to the next handler
    response = await call_next(request)
    
    duration = time.time() - start_time
    
    # Log structured transaction telemetry metrics in clean JSON
    logger.info(
        "HTTP Request Processed",
        extra={
            "http_method": request.method,
            "http_path": request.url.path,
            "http_status": response.status_code,
            "duration_seconds": duration,
            "trace_id": trace_id,
            "span_id": span_id
        }
    )
    
    # Inject trace headers to HTTP responses for client troubleshooting
    response.headers["X-Trace-ID"] = trace_id
    return response

@app.get("/api/checkout")
async def checkout():
    # Wrap database logic inside a dedicated custom sub-span
    with tracer.start_as_current_span("database_save_transaction") as span:
        # Simulate dynamic processing latency
        latency = random.uniform(0.01, 0.15)
        time.sleep(latency)
        
        span.set_attribute("db.system", "postgresql")
        span.set_attribute("db.statement", "INSERT INTO orders (total) VALUES (99)")
        
    # Increment our custom Prometheus telemetry counter
    order_counter.add(1, {"status": "success"})
    
    return {"status": "Order Placed Successfully"}

# Auto-instrument FastAPI routes
FastAPIInstrumentor.instrument_app(app)

To run this instrumented service locally:

pip install -r requirements.txt
uvicorn main:app --host 0.0.0.0 --port 8080

Perform an API hit:

curl http://localhost:8080/api/checkout

You will immediately see dynamic console dumps detailing your trace spans, active counter metrics, and structured JSON logs complete with injected Trace and Span IDs.


Common Observability Interview Questions

1. Explain the difference between Logs, Metrics, and Distributed Tracing.

  • Answer:
    • Metrics are numeric aggregations over time windows (e.g. request count). They are lightweight, fast to query, and cheap to store, making them ideal for high-level alerting.
    • Logs are timestamped strings or structured objects representing discrete application events. They provide rich details but are expensive to store.
    • Distributed Tracing tracks requests across system boundaries by passing context headers (Trace IDs). It shows dynamic dependency maps and pinpoints service bottlenecks.

2. How does OpenTelemetry context propagation work across HTTP boundaries?

  • Answer: Context propagation passes metadata (Trace ID, Span ID) across service boundaries by injecting specific key-value pairs into HTTP headers (using standards like W3C Trace Context). The sending service injects headers like traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01. The receiving service reads this header, extracts the IDs, and binds its local spans to that parent trace.

3. What is Continuous Profiling, and why is it valuable compared to standard metrics?

  • Answer: Continuous profiling continuously measures resource usage (CPU time, heap memory allocations, thread counts) at the function and code-line level in production with minimal overhead. While standard metrics tell you if CPU usage is high, continuous profiling points you directly to the exact lines of code causing the bottleneck, visualized via Flame Graphs.

Next Steps

Now that you have learned to monitor and profile distributed systems, we will integrate these checks into automated release workflows.

Proceed to Part 19: CI/CD Pipelines →

Comments

Comments are powered by giscus. Set PUBLIC_GISCUS_REPO_ID and PUBLIC_GISCUS_CATEGORY_ID in your environment to enable them.