"Should this be a metric, a log, or a trace?" is the most common observability design question. There are clear rules of thumb.
Reach for Metrics When…
- You want a dashboard or an alert
- The question is "how many / how often / how long"
- Long retention (months) at low cost is important
- You need fast aggregation across many machines
Examples: requests per second, p95 latency, queue depth, error rate, CPU utilisation, business KPIs (signups per minute, revenue per hour).
Reach for Logs When…
- You need the full context of an individual event
- You will look at this rarely but in great detail when something goes wrong
- Audit, compliance, or security forensics is involved
- The data is naturally textual (stack traces, command lines, request bodies)
Examples: stack traces, login attempts, configuration changes, slow-query logs, ad-hoc debug output.
Reach for Traces When…
- The question is "why is this slow" or "which downstream broke"
- A request crosses two or more services
- You need to see causality between operations within a request
Examples: a checkout request that touches 8 services, a Lambda + SQS + DynamoDB chain, a SaaS app where a customer reports a single slow operation.
The Decision Table
| Question | Metric | Log | Trace |
|---|---|---|---|
| Are we receiving traffic? | ✅ | ||
| How many errors per minute? | ✅ | ||
| Is the p99 over our SLO? | ✅ | ||
| What was the exact error? | ✅ | ✅ | |
| Why did request X take 4 seconds? | ✅ | ||
| Which user had a problem at 3:14pm? | ✅ | ✅ | |
| Did anyone change a security group? | ✅ | ||
| Is service A or service B the bottleneck? | partial | ✅ | |
| What is our 95th percentile error budget burn? | ✅ |
Cost Drivers
Vendors price by quantity. Understanding the drivers prevents budget surprises.
| Signal | Cost driver |
|---|---|
| Metrics | Number of unique time series (cardinality) × retention |
| Logs | Bytes ingested + bytes indexed + retention |
| Traces | Spans ingested (after sampling) |
The biggest waste in observability spend is high-cardinality metrics. If you put user_id in a Prometheus label, your bill explodes and queries slow to a crawl. If you put it in a log line, you pay only for the bytes.
Sampling
- Metrics are pre-aggregated; sampling is the wrong concept.
- Logs can be sampled at high volume — keep all errors, sample 1% of debug.
- Traces almost always sample. Strategies:
- Head-based — decide when the trace starts (cheap, simple, may miss interesting traces)
- Tail-based — buffer traces and decide after they finish, keeping all errors and slow ones (expensive but optimal)
Retention
| Metrics | 13 months — capacity planning, year-over-year |
| Logs | 30–90 days hot, 1+ year cold for compliance |
| Traces | 7–30 days — most useful while incidents are recent |
| Audit logs | 1–7 years (varies by regulation) |
Tier the storage. Hot, indexed storage for the last 30 days; cheap object storage for older data with slower query.
Wide Events: A Newer Idea
Honeycomb and others advocate for "wide events" — a single structured log per request with hundreds of fields (user_id, region, build_id, dependency latencies, feature flags, error class). You then derive metrics and traces from those events on the fly.
It is the unification of the three pillars: store events, derive aggregates, follow causality through them. Powerful, but the storage tooling has to handle it (column store, Druid, ClickHouse-style).
Anti-Patterns
- Logging everything as text — useless to your platform without parsing
- High-cardinality metric labels (user_id, request_id) — kills Prometheus
- One metric per customer — same problem
- Tracing every request unsampled in production — pay through the nose
- Logging at DEBUG in production by default — noise drowns signal
- Logs as your alerting source — too slow, too unreliable; alert on metrics
The Right Default
For a typical web service:
- Emit metrics for every request — service, route, status code, duration histogram.
- Emit one structured log per request at INFO with the same fields plus user/customer/request IDs.
- Emit a trace span for the request, with sub-spans for downstream calls.
- Sample traces at 1–10% but keep all error traces.
- Alert from metrics; investigate with traces; reconstruct stories with logs.
This pattern carries you a long way before you need anything custom.