The Three Pillars of Observability — Observability and Monitoring | CertQnA

Observability is the practice of building systems you can understand from the outside. It is more than monitoring — monitoring tells you that something is wrong; observability helps you figure out why.

Monitoring vs Observability

Monitoring	Observability
Pre-defined dashboards and alerts	Ad-hoc queries on rich, high-cardinality data
"Known unknowns" — disk space, CPU	"Unknown unknowns" — what is causing the new latency spike?
Often per-host metrics	Per-request, per-customer, per-feature
Tells you something is wrong	Helps you find what and why

The two complement each other. Strong observability does not eliminate alerts; it makes alerts more useful and incidents shorter.

The Three Pillars

The community has settled on three signal types. Each excels at different tasks.

1. Metrics

Numbers tracked over time: request rate, error rate, queue depth, CPU usage, p99 latency.

Cheap: aggregated, sampled at fixed intervals
Fast: small, indexable, queryable in milliseconds
Great for: dashboards, alerts, capacity planning, long-term trends
Limitation: you cannot ask "show me the metrics for this one request" — they are aggregated

2. Logs

Discrete, timestamped events: "user 42 logged in", "DB query took 800ms", "OOMKilled". Either plain text or structured (JSON).

Rich: full context per event
Flexible: easy to add new fields
Great for: forensics, debugging, audit trails
Limitation: expensive to store and search at scale; high cardinality cripples log indexes

3. Traces

A trace follows one request as it traverses multiple services. Each hop is a "span" with a parent span ID, so you can reconstruct the call tree.

Essential for microservices and serverless
Pinpoints latency to a specific component
Great for: "why is this endpoint slow?", "which downstream service is failing?"
Limitation: tracing every request is expensive — sampling is required at scale

The Pillars in One Picture

    Time →

Metrics:    rate ─────────╱╲────────────╱╲
            error rate ────╱╲ ──── spike at 14:32

Logs:       (event)  (event)  (event)  (ERR: timeout)  (event)

Trace:      [GET /checkout 1.8s] ┐
            ├─ [auth-service 60ms]
            ├─ [cart-service 110ms]
            └─ [payment-service 1.4s] ← bottleneck

Metrics tell you "errors are spiking." Logs tell you "this specific request failed because the DB connection pool was exhausted." Traces tell you "out of the 200ms this took, 180ms was waiting on the auth service."

A Pivot in Practice

An on-call engineer gets paged: "checkout error rate above 1%."

Open the metrics dashboard — confirm error rate is up since 14:32.
Click through to traces filtered to service=checkout AND status=error. See most failures involve calls to payment-service.
Open a sample trace — most time is spent waiting on a DB query that times out.
Pivot to logs filtered to service=payment-service AND level=ERROR. See "connection pool exhausted, deploy at 14:30 doubled traffic on this DB".
Roll back the deploy. Page resolved in 8 minutes.

That fluid pivoting between metrics → traces → logs is what mature observability looks like. Each pillar feeds the next.

Cardinality: The Hidden Constraint

Cardinality is the number of unique combinations of label values for a metric or log field. http_requests_total{method="GET",status="200"} is low cardinality. Adding user_id to the labels turns 10 series into 10 million.

Metrics tools (Prometheus, Datadog) charge by, and slow down with, cardinality.
Logs tolerate high cardinality but are expensive to scan at scale.
Traces with sampling sit in the middle.

Rule of thumb: keep metric labels to bounded, finite values (status code, region, service name). Anything per-user or per-request belongs in logs or traces.

The Modern Stack

Three common shapes:

Stack	Metrics	Logs	Traces
Open source (Grafana)	Prometheus / Mimir	Loki	Tempo
Open source (Elastic)	Elastic / Beats	Logstash + Elasticsearch	Elastic APM
Commercial (unified)	Datadog, New Relic, Dynatrace, Honeycomb, Splunk
Cloud-native	CloudWatch / Azure Monitor / Cloud Monitoring	same	X-Ray / App Insights / Cloud Trace

Whatever you choose, instrument with OpenTelemetry — a vendor-neutral standard. Switching backends becomes a config change, not a code rewrite.

What Comes Next

The next lesson digs into when to reach for each pillar — the right tool for each kind of question — before we dive into Prometheus, Grafana, log aggregation, and OpenTelemetry in detail.