Distributed tracing answers the question metrics cannot: "where did my request actually spend its time?" In a microservice architecture, that answer often surprises people.
Traces, Spans, and Context
- A trace represents a single request flowing through your system.
- A span is one unit of work within that trace — an HTTP handler, a DB query, a downstream call.
- Each span has a
trace_id(shared by all spans in the trace) and aspan_id(unique). Spans link to a parent viaparent_span_id. - Spans carry attributes (key-value tags), events (timestamped log entries), and a status (OK / error).
trace_id = 4bf92...
[POST /checkout] span 1 (root) duration 1.8s
│
├─[ auth.verify ] span 2 60ms
│
├─[ cart.fetch ] span 3 110ms
│ │
│ └─[ redis.GET ] span 4 4ms
│
└─[ payment.charge ] span 5 1.4s ← bottleneck
│
└─[ stripe.api ] span 6 1.35s
OpenTelemetry: The Standard
OpenTelemetry (OTel) is a CNCF project that unifies older tools (OpenTracing, OpenCensus). It defines:
- An API — what your code calls to create spans.
- An SDK — the implementation that exports them.
- A protocol (OTLP) — how spans/metrics/logs travel over the wire.
- A Collector — a sidecar/daemon that receives, processes, and forwards telemetry.
Crucially, OTel is vendor-neutral. Instrument once, switch backends (Tempo, Jaeger, Datadog, New Relic, Honeycomb) by changing the Collector config.
Auto-Instrumentation
Most languages have an OTel auto-instrumentation agent that wraps common libraries (HTTP clients, frameworks, DB drivers). Drop it in and you get traces with no code changes.
# Java
java -javaagent:opentelemetry-javaagent.jar \
-Dotel.service.name=checkout \
-Dotel.exporter.otlp.endpoint=http://collector:4317 \
-jar app.jar
# Node
node --require '@opentelemetry/auto-instrumentations-node/register' app.js
# Python
opentelemetry-instrument --service_name=checkout python app.py
Manual Instrumentation
For business logic auto-instrumentation cannot see, add spans manually:
import { trace } from '@opentelemetry/api';
const tracer = trace.getTracer('checkout');
async function processOrder(order) {
return tracer.startActiveSpan('processOrder', async (span) => {
span.setAttribute('order.id', order.id);
span.setAttribute('order.total_cents', order.total);
try {
await chargeCard(order);
await fulfill(order);
span.setStatus({ code: 1 }); // OK
} catch (err) {
span.recordException(err);
span.setStatus({ code: 2, message: err.message });
throw err;
} finally {
span.end();
}
});
}
Context Propagation
For traces to span services, the trace ID must travel across the network. The W3C standard traceparent header carries it.
traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
│ │ │ │
version trace_id parent_span_id flags
Auto-instrumentation handles this automatically for HTTP, gRPC, Kafka, and most messaging libraries. If you write a custom client, propagate the header yourself.
The OTel Collector
A small process that:
- Receives telemetry from your apps (OTLP, Jaeger, Zipkin formats).
- Processes — batching, sampling, attribute scrubbing, redaction.
- Exports to one or many backends (Tempo, Jaeger, Datadog, New Relic, Loki).
receivers:
otlp:
protocols: { grpc: {}, http: {} }
processors:
batch: {}
tail_sampling:
policies:
- { name: errors, type: status_code, status_code: { status_codes: [ERROR] } }
- { name: slow, type: latency, latency: { threshold_ms: 1000 } }
- { name: random, type: probabilistic, probabilistic: { sampling_percentage: 5 } }
exporters:
otlphttp/tempo: { endpoint: https://tempo:4318 }
otlp/datadog: { endpoint: https://api.datadoghq.com }
service:
pipelines:
traces:
receivers: [otlp]
processors: [batch, tail_sampling]
exporters: [otlphttp/tempo, otlp/datadog]
Sampling
Tracing every request gets expensive fast. Two main strategies:
- Head-based: decide at the start of the trace whether to keep it. Cheap and consistent across services. May miss interesting traces.
- Tail-based: buffer the full trace at the Collector, then keep all errors and slow ones plus a small random sample. Optimal but requires more memory and time at the Collector.
A common production setting: 100% of error traces, 100% of slow traces (>1s), 1–5% of healthy traces.
Backends
| Backend | Notes |
|---|---|
| Jaeger | OG open-source tracing, simple, runs anywhere |
| Grafana Tempo | Object-storage-backed, cheap, integrates with Loki and Prometheus |
| Zipkin | Mature, simple, less feature-rich than Jaeger |
| Datadog APM | Excellent UX, full APM features, expensive |
| New Relic, Dynatrace | Enterprise APM |
| Honeycomb | Wide-event model, deep ad-hoc analysis, BubbleUp diff |
| AWS X-Ray, Azure App Insights, Google Cloud Trace | Cloud-native, OTel-compatible |
What Tracing Reveals
- The N+1 problem — a parent span calls 200 small child spans in serial.
- The silent dependency — a service no one knew was on the critical path.
- The slow-but-rare path — a code branch that times out only for some users.
- The misattributed bottleneck — what you blamed on the DB was the JSON serializer.
The 80/20 of Adopting Tracing
- Run an OTel Collector (sidecar in Kubernetes, or per-host).
- Add auto-instrumentation to each service.
- Propagate
traceparentat every network boundary. - Send to a backend, even Jaeger running in a single container is enough to start.
- Add manual spans only around your most important business logic.
That gets you 80% of the value. Add tail sampling and richer attributes once you outgrow the basics.