Skip to content
5 min read·Lesson 2 of 10

When to Use Metrics, Logs, or Traces

Pick the right signal type for each question. The trade-offs of cardinality, cost, retention, and granularity that drive observability design.

"Should this be a metric, a log, or a trace?" is the most common observability design question. There are clear rules of thumb.

Reach for Metrics When…

  • You want a dashboard or an alert
  • The question is "how many / how often / how long"
  • Long retention (months) at low cost is important
  • You need fast aggregation across many machines

Examples: requests per second, p95 latency, queue depth, error rate, CPU utilisation, business KPIs (signups per minute, revenue per hour).

Reach for Logs When…

  • You need the full context of an individual event
  • You will look at this rarely but in great detail when something goes wrong
  • Audit, compliance, or security forensics is involved
  • The data is naturally textual (stack traces, command lines, request bodies)

Examples: stack traces, login attempts, configuration changes, slow-query logs, ad-hoc debug output.

Reach for Traces When…

  • The question is "why is this slow" or "which downstream broke"
  • A request crosses two or more services
  • You need to see causality between operations within a request

Examples: a checkout request that touches 8 services, a Lambda + SQS + DynamoDB chain, a SaaS app where a customer reports a single slow operation.

The Decision Table

QuestionMetricLogTrace
Are we receiving traffic?
How many errors per minute?
Is the p99 over our SLO?
What was the exact error?
Why did request X take 4 seconds?
Which user had a problem at 3:14pm?
Did anyone change a security group?
Is service A or service B the bottleneck?partial
What is our 95th percentile error budget burn?

Cost Drivers

Vendors price by quantity. Understanding the drivers prevents budget surprises.

SignalCost driver
MetricsNumber of unique time series (cardinality) × retention
LogsBytes ingested + bytes indexed + retention
TracesSpans ingested (after sampling)

The biggest waste in observability spend is high-cardinality metrics. If you put user_id in a Prometheus label, your bill explodes and queries slow to a crawl. If you put it in a log line, you pay only for the bytes.

Sampling

  • Metrics are pre-aggregated; sampling is the wrong concept.
  • Logs can be sampled at high volume — keep all errors, sample 1% of debug.
  • Traces almost always sample. Strategies:
    • Head-based — decide when the trace starts (cheap, simple, may miss interesting traces)
    • Tail-based — buffer traces and decide after they finish, keeping all errors and slow ones (expensive but optimal)

Retention

Metrics13 months — capacity planning, year-over-year
Logs30–90 days hot, 1+ year cold for compliance
Traces7–30 days — most useful while incidents are recent
Audit logs1–7 years (varies by regulation)

Tier the storage. Hot, indexed storage for the last 30 days; cheap object storage for older data with slower query.

Wide Events: A Newer Idea

Honeycomb and others advocate for "wide events" — a single structured log per request with hundreds of fields (user_id, region, build_id, dependency latencies, feature flags, error class). You then derive metrics and traces from those events on the fly.

It is the unification of the three pillars: store events, derive aggregates, follow causality through them. Powerful, but the storage tooling has to handle it (column store, Druid, ClickHouse-style).

Anti-Patterns

  • Logging everything as text — useless to your platform without parsing
  • High-cardinality metric labels (user_id, request_id) — kills Prometheus
  • One metric per customer — same problem
  • Tracing every request unsampled in production — pay through the nose
  • Logging at DEBUG in production by default — noise drowns signal
  • Logs as your alerting source — too slow, too unreliable; alert on metrics

The Right Default

For a typical web service:

  1. Emit metrics for every request — service, route, status code, duration histogram.
  2. Emit one structured log per request at INFO with the same fields plus user/customer/request IDs.
  3. Emit a trace span for the request, with sub-spans for downstream calls.
  4. Sample traces at 1–10% but keep all error traces.
  5. Alert from metrics; investigate with traces; reconstruct stories with logs.

This pattern carries you a long way before you need anything custom.

Key Takeaways

  • Use metrics for "how much, how often, how long" questions and all alerts.
  • Use logs when you need a full record of what happened, especially audit and forensics.
  • Use traces for "why is this request slow" and any cross-service question.
  • Cardinality, retention, and cost shape your design more than feature lists.
  • Modern systems often emit logs as structured events that double as low-volume metrics.

Test your knowledge

Try exam-style practice questions to reinforce what you've learned.

Practice Questions →