Career Guide 29 May 2026 14 min read

Part 18: Observability - Prometheus, Grafana & OpenTelemetry

Master OpenTelemetry SDK instrumentation, Prometheus pull architectures, Grafana dashboard panels, Tempo spans, Loki LogQL, and Pyroscope flame graphs. Complete 30-resource blueprint.

By Chirag Singhal

Part 18: Observability - Prometheus, Grafana & OpenTelemetry

← Back to Master Index

As backends scale from simple monoliths to distributed microservices and dynamic serverless structures, diagnosing why a transaction failed or why a database request stalled becomes incredibly difficult. Traditional "monitoring"—merely checking if a service is online—is no longer sufficient.

Observability is the practice of measuring the internal state of a system based on its external outputs (signals). In 2026, the global telemetry standard has unified around OpenTelemetry (OTel) for instrumentation and the Grafana LGTM stack (Loki, Grafana, Tempo, Mimir/Pyroscope) for ingestion and analysis. This guide provides a deep architectural breakdown and a complete blueprint of exactly 30 curated resources to master modern observability.

Telemetry Signals: The Three Pillars and Beyond

Modern systems collect four core classes of telemetry data:

                  Telemetry Ingestion Pipeline
 ┌───────────────┐  ┌───────────────┐  ┌───────────────┐  ┌───────────────┐
 │    METRICS    │  │     LOGS      │  │    TRACES     │  │   PROFILING   │
 └───────┬───────┘  └───────┬───────┘  └───────┬───────┘  └───────┬───────┘
         │                  │                  │                  │
         └──────────────────┼──────────────────┼──────────────────┘
                            ▼
               [ OpenTelemetry Collector ]
                            │
         ┌──────────────────┼──────────────────┐
         ▼                  ▼                  ▼
 [ Prometheus / Mimir ]  [ Loki ]     [ Tempo / Jaeger ]
     (Metrics TSDB)    (Logs Index)     (Trace Spans)

Metrics:
- Numeric values aggregated over time (e.g., CPU utilization, HTTP request rates, active database connections).
- Incredibly fast to query and cheap to store.
- Leveraged to trigger instant Slack/PagerDuty alerts when thresholds are violated.
Logs:
- Timestamps and text payloads representing distinct, discrete application events.
- Transitioning logs into structured JSON formats is mandatory in modern platforms. This permits downstream parsers to query, index, and organize entries efficiently.
Distributed Tracing:
- Visual representation of a request's journey across service boundaries.
- Maps out Spans (single operations inside a database or HTTP call) linked by a shared Trace ID, allowing you to pinpoint the exact service causing an API bottleneck.
Continuous Profiling:
- Continuously measures resource allocation (CPU, memory, threads) at the function level in production.
- Visualized via Flame Graphs to show the exact lines of code leaking memory or stalling threads.

1. Unified Collection with OpenTelemetry SDKs & Collector

Master OpenTelemetry SDK architectures and unified collectors with these 5 resources.

Subtopic Resources

Resource Name & Metadata	Access Category	Status & Skip Conditions
Observability Engineering by Charity Majors, Liz Fong-Jones, & George Miranda (O'Reilly) The definitive book on structured telemetry pipelines.	Book	Required
OpenTelemetry Fundamentals (Pluralsight Course) Hands-on video training detailing SDK integrations and collector modes.	Video Course	Required
OpenTelemetry Official Collector Documentation Reference manual for configuring receivers, processors, and exporters.	Documentation	Required
OpenTelemetry Crash Course by TechWorld with Nana (YouTube) Visual walkthrough on tracing spans and collector configurations.	Video Stream	Required
OpenTelemetry Collector Pipeline Sandbox (StackBlitz) Interactive sandbox to test and validate pipeline processors.	Interactive Sandbox	Required

Resource Identification & Access

Observability Engineering
- Direct URL: https://www.oreilly.com/library/view/observability-engineering/9781492097723/
- Search Identification: Search O'Reilly for "Observability Engineering Charity Majors"
OpenTelemetry Fundamentals
- Direct URL: https://www.pluralsight.com/courses/opentelemetry-fundamentals
- Search Identification: Search Pluralsight for "OpenTelemetry Fundamentals"
OpenTelemetry Collector Documentation
- Direct URL: https://opentelemetry.io/docs/collector/
- Search Identification: Search OpenTelemetry Docs for "Collector Architecture receivers processors exporters"
OpenTelemetry Crash Course
- Direct URL: https://www.youtube.com/watch?v=r8H46V41R6A
- Search Identification: Search YouTube for "TechWorld with Nana OpenTelemetry Crash Course"
OpenTelemetry Collector Pipeline Sandbox
- Direct URL: https://stackblitz.com/edit/opentelemetry-collector-sandbox
- Search Identification: Search StackBlitz for "OpenTelemetry collector pipeline testing"

2. Prometheus Pull Architecture, TSDB Mechanics & PromQL

Master metrics collection and alerting with these 5 curated resources.

Subtopic Resources

Resource Name & Metadata	Access Category	Status & Skip Conditions
Prometheus: Up & Running by Julien Pivotto & Brian Brazil In-depth book covering TSDB compression and PromQL syntax.	Book	Required
Prometheus & Grafana - The Complete Guide by Stephane Maarek (Udemy) Popular video guide on metric scraping and alertmanager configurations.	Video Course	Required
Prometheus Query Language (PromQL) Guide (Prometheus) Official reference detailing range vectors, instant queries, and aggregations.	Documentation	Required
Prometheus Deep Dive: Storage Engine by Julius Volz Core co-founder lecture explaining time-series index serialization.	Video Stream	Required
Interactive PromQL Exercises Playground (StackBlitz) Interactive sandbox containing raw metrics to run PromQL queries.	Interactive Sandbox	Alternative (Skip if "Prometheus & Grafana - The Complete Guide" is completed)

Resource Identification & Access

Prometheus: Up & Running
- Direct URL: https://www.oreilly.com/library/view/prometheus-up/9781492034131/
- Search Identification: Search O'Reilly for "Prometheus Up and Running Brian Brazil"
Prometheus & Grafana - The Complete Guide
- Direct URL: https://www.udemy.com/course/prometheus-grafana/
- Search Identification: Search Udemy for "Prometheus and Grafana Stephane Maarek"
Prometheus Query Language (PromQL) Guide
- Direct URL: https://prometheus.io/docs/prometheus/latest/querying/basics/
- Search Identification: Search Prometheus Docs for "Querying basics PromQL vectors"
Prometheus Deep Dive: Storage Engine
- Direct URL: https://www.youtube.com/watch?v=hTz1c80rVvQ
- Search Identification: Search YouTube for "Julius Volz Prometheus architecture storage"
Interactive PromQL Exercises Playground
- Direct URL: https://stackblitz.com/edit/promql-sandbox-exercises
- Search Identification: Search StackBlitz for "PromQL interactive queries simulation"

3. Custom Grafana Dashboards & Dynamic Data Visualization

Learn to build dynamic, useful dashboards with these 5 resources.

Subtopic Resources

Resource Name & Metadata	Access Category	Status & Skip Conditions
Observability with Grafana by Rob Chapman & Peter Holmes Practical handbook covering multi-tenant visual rendering patterns.	Book	Required
Grafana Learning Paths (Grafana Academy) Free, interactive official courses on panel styling and metrics mapping.	Video Course	Required
Grafana Panel Plugins & Variables Reference (Grafana Docs) Official documentation detailing template queries and variable inputs.	Documentation	Required
How to Build Professional Grafana Dashboards (YouTube) Grafana Labs video on alert thresholds and telemetry layouts.	Video Stream	Required
Grafana Live Playground (Grafana Play) Official public interactive sandbox featuring fully populated metrics.	Interactive Sandbox	Required

Resource Identification & Access

Observability with Grafana
- Direct URL: https://www.manning.com/books/observability-with-grafana
- Search Identification: Search Manning for "Observability with Grafana Chapman Holmes"
Grafana Learning Paths
- Direct URL: https://grafana.com/tutorials/
- Search Identification: Search Grafana for "Grafana tutorials dashboards getting started"
Grafana Panel Plugins & Variables Reference
- Direct URL: https://grafana.com/docs/grafana/latest/panels-visualizations/
- Search Identification: Search Grafana Docs for "Panels Visualizations variables query"
How to Build Professional Grafana Dashboards
- Direct URL: https://www.youtube.com/watch?v=B9JbZ1Zl65U
- Search Identification: Search YouTube for "Grafana Labs dynamic production dashboards"
Grafana Live Playground
- Direct URL: https://play.grafana.org/
- Search Identification: Search Web for "Grafana Play public sandbox server"

4. Distributed Tracing with OpenTelemetry, Tempo & Jaeger

Master context propagation and trace visualization with these 5 resources.

Subtopic Resources

Resource Name & Metadata	Access Category	Status & Skip Conditions
Mastering Distributed Tracing by Yuri Shkuro (Packt) Definitive book written by the creator of Jaeger.	Book	Required
Distributed Tracing Bootcamp (Udemy) Visual course tracing microservices across HTTP/gRPC boundaries.	Video Course	Required
Grafana Tempo Documentation (Grafana Docs) Technical manual covering high-performance, object-storage tracing.	Documentation	Required
Wtf is Distributed Tracing? Why do we need Spans? (YouTube) Humorous and insightful lecture by Charity Majors on spans and context.	Video Stream	Required
OpenTelemetry Jaeger Tracing Sandbox (StackBlitz) Interactive Node.js sandbox executing trace spans locally.	Interactive Sandbox	Alternative (Skip if "Mastering Distributed Tracing" by Yuri Shkuro is completed)

Resource Identification & Access

Mastering Distributed Tracing
- Direct URL: https://www.packtpub.com/product/mastering-distributed-tracing/9781788627856
- Search Identification: Search Packt for "Mastering Distributed Tracing Yuri Shkuro"
Distributed Tracing Bootcamp
- Direct URL: https://www.udemy.com/course/distributed-tracing/
- Search Identification: Search Udemy for "Distributed Tracing Bootcamp Jaeger"
Grafana Tempo Documentation
- Direct URL: https://grafana.com/docs/tempo/latest/
- Search Identification: Search Grafana Tempo Docs for "Tracing storage formats"
Wtf is Distributed Tracing? Why do we need Spans?
- Direct URL: https://www.youtube.com/watch?v=Yf1eZ029Jso
- Search Identification: Search YouTube for "Charity Majors WTF is Distributed Tracing"
OpenTelemetry Jaeger Tracing Sandbox
- Direct URL: https://stackblitz.com/edit/opentelemetry-jaeger-tracing-sandbox
- Search Identification: Search StackBlitz for "OpenTelemetry Jaeger trace spans"

5. Structured Application Logging & Log Aggregation with Loki

Ingest, index, and query application logs at scale with these 5 resources.

Subtopic Resources

Resource Name & Metadata	Access Category	Status & Skip Conditions
Log Management in the Cloud (Packt Publishing) Book covering centralized cloud logging patterns and structured indices.	Book	Required
Grafana Loki: Modern Log Ingestion & Querying (Udemy) Practical video series demonstrating Loki agent routing configurations.	Video Course	Required
Grafana Loki: LogQL Reference (Grafana Docs) Official LogQL syntax parser guides and dynamic formatting tutorials.	Documentation	Required
Like Prometheus, But for Logs: Loki Architecture (YouTube) Official architectural breakdown video explaining log metadata labels.	Video Stream	Required
Loki LogQL Query Sandbox (StackBlitz) Interactive sandbox template validating LogQL queries locally.	Interactive Sandbox	Required

Resource Identification & Access

Log Management in the Cloud
- Direct URL: https://www.packtpub.com/product/log-management-in-the-cloud/9781801815123
- Search Identification: Search Packt for "Log Management in the Cloud"
Grafana Loki: Modern Log Ingestion
- Direct URL: https://www.udemy.com/course/grafana-loki/
- Search Identification: Search Udemy for "Grafana Loki log ingestion"
Grafana Loki: LogQL Reference
- Direct URL: https://grafana.com/docs/loki/latest/logql/
- Search Identification: Search Grafana Loki Docs for "LogQL query guide"
Like Prometheus, But for Logs: Loki Architecture
- Direct URL: https://www.youtube.com/watch?v=Vl03qGpyE7A
- Search Identification: Search YouTube for "Grafana Labs Loki Architecture Logs"
Loki LogQL Query Sandbox
- Direct URL: https://stackblitz.com/edit/grafana-loki-logql-sandbox
- Search Identification: Search StackBlitz for "Loki LogQL query exercises"

6. Continuous Profiling Internals (Grafana Pyroscope)

Find CPU hot paths and memory leaks in production with these 5 resources.

Subtopic Resources

Resource Name & Metadata	Access Category	Status & Skip Conditions
Systems Performance by Brendan Gregg (O'Reilly) The master reference book for performance engineering and tracing.	Book	Required
Performance Engineering & Continuous Profiling (LinkedIn) Video track demonstrating CPU profiling and memory leak tracing.	Video Course	Required
Grafana Pyroscope: Continuous Profiling (Grafana Docs) Official integration guide covering continuous stack tracing agents.	Documentation	Required
Continuous Profiling: The Fourth Pillar of Observability (YouTube) Grafana Labs presentation detailing flame graph performance overlays.	Video Stream	Required
Profiling Python & Node applications using Pyroscope (StackBlitz) Interactive sandbox generating real flame graphs from CPU load.	Interactive Sandbox	Alternative (Skip if "Systems Performance" by Brendan Gregg is completed)

Resource Identification & Access

Systems Performance
- Direct URL: https://www.oreilly.com/library/view/systems-performance-2nd/9780136821694/
- Search Identification: Search O'Reilly for "Systems Performance Brendan Gregg"
Performance Engineering & Continuous Profiling
- Direct URL: https://www.linkedin.com/learning/performance-engineering-and-continuous-profiling
- Search Identification: Search LinkedIn Learning for "Continuous Profiling Pyroscope"
Grafana Pyroscope: Continuous Profiling
- Direct URL: https://grafana.com/docs/pyroscope/latest/
- Search Identification: Search Grafana Pyroscope Docs for "Flame graphs continuous profiling"
Continuous Profiling: The Fourth Pillar
- Direct URL: https://www.youtube.com/watch?v=F3a7dZpB17k
- Search Identification: Search YouTube for "Grafana Labs Pyroscope Continuous Profiling"
Profiling Python & Node applications
- Direct URL: https://stackblitz.com/edit/pyroscope-continuous-profiling-sandbox
- Search Identification: Search StackBlitz for "Grafana Pyroscope flame graphs node"

Portfolio Project Lab: Instrumented FastAPI Microservice

Objective

Create a fully instrumented, production-grade FastAPI microservice utilizing OpenTelemetry SDKs to collect transaction metrics, propagate distributed tracing, and write structured JSON application logs.

1. Project Dependencies

Create a requirements.txt containing the necessary OpenTelemetry and application libraries:

fastapi==0.110.0
uvicorn==0.28.0
opentelemetry-api==1.23.0
opentelemetry-sdk==1.23.0
opentelemetry-instrumentation-fastapi==0.44b0
opentelemetry-exporter-otlp-proto-grpc==1.23.0
python-json-logger==2.0.7

2. Structured JSON Logger & Telemetry SDK Configuration

Save this file as logger_config.py in your workspace directory:

import logging
from pythonjsonlogger import jsonlogger

def setup_logger():
    logger = logging.getLogger("api_logger")
    logger.setLevel(logging.INFO)
    
    # In production, logs must write to standard output in clean JSON
    log_handler = logging.StreamHandler()
    formatter = jsonlogger.JsonFormatter(
        '%(asctime)s %(levelname)s %(message)s %(trace_id)s %(span_id)s'
    )
    log_handler.setFormatter(formatter)
    logger.addHandler(log_handler)
    return logger

logger = setup_logger()

3. Fully Instrumented Core Application

Save this code block as main.py. It implements automatic trace context propagation and metric generation:

import time
import random
from fastapi import FastAPI, Request
from logger_config import logger

# Import OpenTelemetry core APIs
from opentelemetry import trace, metrics
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import SimpleSpanProcessor
from opentelemetry.sdk.trace.export import ConsoleSpanExporter
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader
from opentelemetry.sdk.metrics.export import ConsoleMetricExporter
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor

# 1. Initialize Tracing Engine & Console Exporters for local validation
provider = TracerProvider()
processor = SimpleSpanProcessor(ConsoleSpanExporter())
provider.add_span_processor(processor)
trace.set_tracer_provider(provider)
tracer = trace.get_tracer("fastapi-service-tracer")

# 2. Initialize Metrics Engine & Console Exporters for checking counts
metric_reader = PeriodicExportingMetricReader(ConsoleMetricExporter())
metrics.set_meter_provider(MeterProvider(metric_readers=[metric_reader]))
meter = metrics.get_meter("fastapi-service-metrics")

# Declare a custom counter metric to track transaction volumes
order_counter = meter.create_counter(
    name="api_orders_processed_total",
    description="Total volume of processed client checkout orders",
    unit="1"
)

# 3. Initialize FastAPI Application
app = FastAPI(title="Instrumented API Service")

# Inject global middleware to intercept requests, measure latency, and log details
@app.middleware("http")
async def add_telemetry_headers(request: Request, call_next):
    start_time = time.time()
    
    # Retrieve current active OpenTelemetry span
    current_span = trace.get_current_span()
    trace_id = format(current_span.get_span_context().trace_id, '032x') if current_span else "0"
    span_id = format(current_span.get_span_context().span_id, '016x') if current_span else "0"
    
    # Pass execution to the next handler
    response = await call_next(request)
    
    duration = time.time() - start_time
    
    # Log structured transaction telemetry metrics in clean JSON
    logger.info(
        "HTTP Request Processed",
        extra={
            "http_method": request.method,
            "http_path": request.url.path,
            "http_status": response.status_code,
            "duration_seconds": duration,
            "trace_id": trace_id,
            "span_id": span_id
        }
    )
    
    # Inject trace headers to HTTP responses for client troubleshooting
    response.headers["X-Trace-ID"] = trace_id
    return response

@app.get("/api/checkout")
async def checkout():
    # Wrap database logic inside a dedicated custom sub-span
    with tracer.start_as_current_span("database_save_transaction") as span:
        # Simulate dynamic processing latency
        latency = random.uniform(0.01, 0.15)
        time.sleep(latency)
        
        span.set_attribute("db.system", "postgresql")
        span.set_attribute("db.statement", "INSERT INTO orders (total) VALUES (99)")
        
    # Increment our custom Prometheus telemetry counter
    order_counter.add(1, {"status": "success"})
    
    return {"status": "Order Placed Successfully"}

# Auto-instrument FastAPI routes
FastAPIInstrumentor.instrument_app(app)

To run this instrumented service locally:

pip install -r requirements.txt
uvicorn main:app --host 0.0.0.0 --port 8080

Perform an API hit:

curl http://localhost:8080/api/checkout

You will immediately see dynamic console dumps detailing your trace spans, active counter metrics, and structured JSON logs complete with injected Trace and Span IDs.

Common Observability Interview Questions

1. Explain the difference between Logs, Metrics, and Distributed Tracing.

Answer:
- Metrics are numeric aggregations over time windows (e.g. request count). They are lightweight, fast to query, and cheap to store, making them ideal for high-level alerting.
- Logs are timestamped strings or structured objects representing discrete application events. They provide rich details but are expensive to store.
- Distributed Tracing tracks requests across system boundaries by passing context headers (Trace IDs). It shows dynamic dependency maps and pinpoints service bottlenecks.

2. How does OpenTelemetry context propagation work across HTTP boundaries?

Answer: Context propagation passes metadata (Trace ID, Span ID) across service boundaries by injecting specific key-value pairs into HTTP headers (using standards like W3C Trace Context). The sending service injects headers like traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01. The receiving service reads this header, extracts the IDs, and binds its local spans to that parent trace.

3. What is Continuous Profiling, and why is it valuable compared to standard metrics?

Answer: Continuous profiling continuously measures resource usage (CPU time, heap memory allocations, thread counts) at the function and code-line level in production with minimal overhead. While standard metrics tell you if CPU usage is high, continuous profiling points you directly to the exact lines of code causing the bottleneck, visualized via Flame Graphs.

Next Steps

Now that you have learned to monitor and profile distributed systems, we will integrate these checks into automated release workflows.

Proceed to Part 19: CI/CD Pipelines →

Comments

Comments are powered by giscus. Set PUBLIC_GISCUS_REPO_ID and PUBLIC_GISCUS_CATEGORY_ID in your environment to enable them.