When to Use Metrics, Logs, or Traces — Observability and Monitoring | CertQnA

"Should this be a metric, a log, or a trace?" is the most common observability design question. There are clear rules of thumb.

Reach for Metrics When…

You want a dashboard or an alert
The question is "how many / how often / how long"
Long retention (months) at low cost is important
You need fast aggregation across many machines

Examples: requests per second, p95 latency, queue depth, error rate, CPU utilisation, business KPIs (signups per minute, revenue per hour).

Reach for Logs When…

You need the full context of an individual event
You will look at this rarely but in great detail when something goes wrong
Audit, compliance, or security forensics is involved
The data is naturally textual (stack traces, command lines, request bodies)

Examples: stack traces, login attempts, configuration changes, slow-query logs, ad-hoc debug output.

Reach for Traces When…

The question is "why is this slow" or "which downstream broke"
A request crosses two or more services
You need to see causality between operations within a request

Examples: a checkout request that touches 8 services, a Lambda + SQS + DynamoDB chain, a SaaS app where a customer reports a single slow operation.

The Decision Table

Question	Metric	Log	Trace
Are we receiving traffic?	✅
How many errors per minute?	✅
Is the p99 over our SLO?	✅
What was the exact error?		✅	✅
Why did request X take 4 seconds?			✅
Which user had a problem at 3:14pm?		✅	✅
Did anyone change a security group?		✅
Is service A or service B the bottleneck?	partial		✅
What is our 95th percentile error budget burn?	✅

Cost Drivers

Vendors price by quantity. Understanding the drivers prevents budget surprises.

Signal	Cost driver
Metrics	Number of unique time series (cardinality) × retention
Logs	Bytes ingested + bytes indexed + retention
Traces	Spans ingested (after sampling)

The biggest waste in observability spend is high-cardinality metrics. If you put user_id in a Prometheus label, your bill explodes and queries slow to a crawl. If you put it in a log line, you pay only for the bytes.

Sampling

Metrics are pre-aggregated; sampling is the wrong concept.
Logs can be sampled at high volume — keep all errors, sample 1% of debug.
Traces almost always sample. Strategies:
- Head-based — decide when the trace starts (cheap, simple, may miss interesting traces)
- Tail-based — buffer traces and decide after they finish, keeping all errors and slow ones (expensive but optimal)

Retention

Metrics	13 months — capacity planning, year-over-year
Logs	30–90 days hot, 1+ year cold for compliance
Traces	7–30 days — most useful while incidents are recent
Audit logs	1–7 years (varies by regulation)

Tier the storage. Hot, indexed storage for the last 30 days; cheap object storage for older data with slower query.

Wide Events: A Newer Idea

Honeycomb and others advocate for "wide events" — a single structured log per request with hundreds of fields (user_id, region, build_id, dependency latencies, feature flags, error class). You then derive metrics and traces from those events on the fly.

It is the unification of the three pillars: store events, derive aggregates, follow causality through them. Powerful, but the storage tooling has to handle it (column store, Druid, ClickHouse-style).

Anti-Patterns

Logging everything as text — useless to your platform without parsing
High-cardinality metric labels (user_id, request_id) — kills Prometheus
One metric per customer — same problem
Tracing every request unsampled in production — pay through the nose
Logging at DEBUG in production by default — noise drowns signal
Logs as your alerting source — too slow, too unreliable; alert on metrics

The Right Default

For a typical web service:

Emit metrics for every request — service, route, status code, duration histogram.
Emit one structured log per request at INFO with the same fields plus user/customer/request IDs.
Emit a trace span for the request, with sub-spans for downstream calls.
Sample traces at 1–10% but keep all error traces.
Alert from metrics; investigate with traces; reconstruct stories with logs.

This pattern carries you a long way before you need anything custom.