Observability is the practice of building systems you can understand from the outside. It is more than monitoring — monitoring tells you that something is wrong; observability helps you figure out why.
Monitoring vs Observability
| Monitoring | Observability |
|---|---|
| Pre-defined dashboards and alerts | Ad-hoc queries on rich, high-cardinality data |
| "Known unknowns" — disk space, CPU | "Unknown unknowns" — what is causing the new latency spike? |
| Often per-host metrics | Per-request, per-customer, per-feature |
| Tells you something is wrong | Helps you find what and why |
The two complement each other. Strong observability does not eliminate alerts; it makes alerts more useful and incidents shorter.
The Three Pillars
The community has settled on three signal types. Each excels at different tasks.
1. Metrics
Numbers tracked over time: request rate, error rate, queue depth, CPU usage, p99 latency.
- Cheap: aggregated, sampled at fixed intervals
- Fast: small, indexable, queryable in milliseconds
- Great for: dashboards, alerts, capacity planning, long-term trends
- Limitation: you cannot ask "show me the metrics for this one request" — they are aggregated
2. Logs
Discrete, timestamped events: "user 42 logged in", "DB query took 800ms", "OOMKilled". Either plain text or structured (JSON).
- Rich: full context per event
- Flexible: easy to add new fields
- Great for: forensics, debugging, audit trails
- Limitation: expensive to store and search at scale; high cardinality cripples log indexes
3. Traces
A trace follows one request as it traverses multiple services. Each hop is a "span" with a parent span ID, so you can reconstruct the call tree.
- Essential for microservices and serverless
- Pinpoints latency to a specific component
- Great for: "why is this endpoint slow?", "which downstream service is failing?"
- Limitation: tracing every request is expensive — sampling is required at scale
The Pillars in One Picture
Time →
Metrics: rate ─────────╱╲────────────╱╲
error rate ────╱╲ ──── spike at 14:32
Logs: (event) (event) (event) (ERR: timeout) (event)
Trace: [GET /checkout 1.8s] ┐
├─ [auth-service 60ms]
├─ [cart-service 110ms]
└─ [payment-service 1.4s] ← bottleneck
Metrics tell you "errors are spiking." Logs tell you "this specific request failed because the DB connection pool was exhausted." Traces tell you "out of the 200ms this took, 180ms was waiting on the auth service."
A Pivot in Practice
An on-call engineer gets paged: "checkout error rate above 1%."
- Open the metrics dashboard — confirm error rate is up since 14:32.
- Click through to traces filtered to
service=checkout AND status=error. See most failures involve calls topayment-service. - Open a sample trace — most time is spent waiting on a DB query that times out.
- Pivot to logs filtered to
service=payment-service AND level=ERROR. See "connection pool exhausted, deploy at 14:30 doubled traffic on this DB". - Roll back the deploy. Page resolved in 8 minutes.
That fluid pivoting between metrics → traces → logs is what mature observability looks like. Each pillar feeds the next.
Cardinality: The Hidden Constraint
Cardinality is the number of unique combinations of label values for a metric or log field. http_requests_total{method="GET",status="200"} is low cardinality. Adding user_id to the labels turns 10 series into 10 million.
- Metrics tools (Prometheus, Datadog) charge by, and slow down with, cardinality.
- Logs tolerate high cardinality but are expensive to scan at scale.
- Traces with sampling sit in the middle.
Rule of thumb: keep metric labels to bounded, finite values (status code, region, service name). Anything per-user or per-request belongs in logs or traces.
The Modern Stack
Three common shapes:
| Stack | Metrics | Logs | Traces |
|---|---|---|---|
| Open source (Grafana) | Prometheus / Mimir | Loki | Tempo |
| Open source (Elastic) | Elastic / Beats | Logstash + Elasticsearch | Elastic APM |
| Commercial (unified) | Datadog, New Relic, Dynatrace, Honeycomb, Splunk | ||
| Cloud-native | CloudWatch / Azure Monitor / Cloud Monitoring | same | X-Ray / App Insights / Cloud Trace |
Whatever you choose, instrument with OpenTelemetry — a vendor-neutral standard. Switching backends becomes a config change, not a code rewrite.
What Comes Next
The next lesson digs into when to reach for each pillar — the right tool for each kind of question — before we dive into Prometheus, Grafana, log aggregation, and OpenTelemetry in detail.