Skip to content
5 min read·Lesson 1 of 10

The Three Pillars of Observability

Observability vs monitoring; what metrics, logs, and traces each give you; and how the three pillars combine into a useful whole.

Observability is the practice of building systems you can understand from the outside. It is more than monitoring — monitoring tells you that something is wrong; observability helps you figure out why.

Monitoring vs Observability

MonitoringObservability
Pre-defined dashboards and alertsAd-hoc queries on rich, high-cardinality data
"Known unknowns" — disk space, CPU"Unknown unknowns" — what is causing the new latency spike?
Often per-host metricsPer-request, per-customer, per-feature
Tells you something is wrongHelps you find what and why

The two complement each other. Strong observability does not eliminate alerts; it makes alerts more useful and incidents shorter.

The Three Pillars

The community has settled on three signal types. Each excels at different tasks.

1. Metrics

Numbers tracked over time: request rate, error rate, queue depth, CPU usage, p99 latency.

  • Cheap: aggregated, sampled at fixed intervals
  • Fast: small, indexable, queryable in milliseconds
  • Great for: dashboards, alerts, capacity planning, long-term trends
  • Limitation: you cannot ask "show me the metrics for this one request" — they are aggregated

2. Logs

Discrete, timestamped events: "user 42 logged in", "DB query took 800ms", "OOMKilled". Either plain text or structured (JSON).

  • Rich: full context per event
  • Flexible: easy to add new fields
  • Great for: forensics, debugging, audit trails
  • Limitation: expensive to store and search at scale; high cardinality cripples log indexes

3. Traces

A trace follows one request as it traverses multiple services. Each hop is a "span" with a parent span ID, so you can reconstruct the call tree.

  • Essential for microservices and serverless
  • Pinpoints latency to a specific component
  • Great for: "why is this endpoint slow?", "which downstream service is failing?"
  • Limitation: tracing every request is expensive — sampling is required at scale

The Pillars in One Picture

    Time →

Metrics:    rate ─────────╱╲────────────╱╲
            error rate ────╱╲ ──── spike at 14:32

Logs:       (event)  (event)  (event)  (ERR: timeout)  (event)

Trace:      [GET /checkout 1.8s] ┐
            ├─ [auth-service 60ms]
            ├─ [cart-service 110ms]
            └─ [payment-service 1.4s] ← bottleneck

Metrics tell you "errors are spiking." Logs tell you "this specific request failed because the DB connection pool was exhausted." Traces tell you "out of the 200ms this took, 180ms was waiting on the auth service."

A Pivot in Practice

An on-call engineer gets paged: "checkout error rate above 1%."

  1. Open the metrics dashboard — confirm error rate is up since 14:32.
  2. Click through to traces filtered to service=checkout AND status=error. See most failures involve calls to payment-service.
  3. Open a sample trace — most time is spent waiting on a DB query that times out.
  4. Pivot to logs filtered to service=payment-service AND level=ERROR. See "connection pool exhausted, deploy at 14:30 doubled traffic on this DB".
  5. Roll back the deploy. Page resolved in 8 minutes.

That fluid pivoting between metrics → traces → logs is what mature observability looks like. Each pillar feeds the next.

Cardinality: The Hidden Constraint

Cardinality is the number of unique combinations of label values for a metric or log field. http_requests_total{method="GET",status="200"} is low cardinality. Adding user_id to the labels turns 10 series into 10 million.

  • Metrics tools (Prometheus, Datadog) charge by, and slow down with, cardinality.
  • Logs tolerate high cardinality but are expensive to scan at scale.
  • Traces with sampling sit in the middle.

Rule of thumb: keep metric labels to bounded, finite values (status code, region, service name). Anything per-user or per-request belongs in logs or traces.

The Modern Stack

Three common shapes:

StackMetricsLogsTraces
Open source (Grafana)Prometheus / MimirLokiTempo
Open source (Elastic)Elastic / BeatsLogstash + ElasticsearchElastic APM
Commercial (unified)Datadog, New Relic, Dynatrace, Honeycomb, Splunk
Cloud-nativeCloudWatch / Azure Monitor / Cloud MonitoringsameX-Ray / App Insights / Cloud Trace

Whatever you choose, instrument with OpenTelemetry — a vendor-neutral standard. Switching backends becomes a config change, not a code rewrite.

What Comes Next

The next lesson digs into when to reach for each pillar — the right tool for each kind of question — before we dive into Prometheus, Grafana, log aggregation, and OpenTelemetry in detail.

Key Takeaways

  • Monitoring tells you whether known things are broken; observability lets you ask new questions.
  • Metrics are numeric, aggregated, cheap, and great for dashboards and alerts.
  • Logs are individual events; rich detail per event, expensive at scale.
  • Traces show one request flowing through many services — essential for microservices.
  • Modern observability platforms unify all three so you can pivot between them.

Test your knowledge

Try exam-style practice questions to reinforce what you've learned.

Practice Questions →