Distributed Tracing with OpenTelemetry — Observability and Monitoring | CertQnA

Distributed tracing answers the question metrics cannot: "where did my request actually spend its time?" In a microservice architecture, that answer often surprises people.

Traces, Spans, and Context

A trace represents a single request flowing through your system.
A span is one unit of work within that trace — an HTTP handler, a DB query, a downstream call.
Each span has a trace_id (shared by all spans in the trace) and a span_id (unique). Spans link to a parent via parent_span_id.
Spans carry attributes (key-value tags), events (timestamped log entries), and a status (OK / error).

trace_id = 4bf92...

[POST /checkout] span 1 (root)  duration 1.8s
    │
    ├─[ auth.verify ] span 2  60ms
    │
    ├─[ cart.fetch ] span 3  110ms
    │       │
    │       └─[ redis.GET ] span 4  4ms
    │
    └─[ payment.charge ] span 5  1.4s   ← bottleneck
            │
            └─[ stripe.api ] span 6  1.35s

OpenTelemetry: The Standard

OpenTelemetry (OTel) is a CNCF project that unifies older tools (OpenTracing, OpenCensus). It defines:

An API — what your code calls to create spans.
An SDK — the implementation that exports them.
A protocol (OTLP) — how spans/metrics/logs travel over the wire.
A Collector — a sidecar/daemon that receives, processes, and forwards telemetry.

Crucially, OTel is vendor-neutral. Instrument once, switch backends (Tempo, Jaeger, Datadog, New Relic, Honeycomb) by changing the Collector config.

Auto-Instrumentation

Most languages have an OTel auto-instrumentation agent that wraps common libraries (HTTP clients, frameworks, DB drivers). Drop it in and you get traces with no code changes.

# Java
java -javaagent:opentelemetry-javaagent.jar \
  -Dotel.service.name=checkout \
  -Dotel.exporter.otlp.endpoint=http://collector:4317 \
  -jar app.jar

# Node
node --require '@opentelemetry/auto-instrumentations-node/register' app.js

# Python
opentelemetry-instrument --service_name=checkout python app.py

Manual Instrumentation

For business logic auto-instrumentation cannot see, add spans manually:

import { trace } from '@opentelemetry/api';
const tracer = trace.getTracer('checkout');

async function processOrder(order) {
  return tracer.startActiveSpan('processOrder', async (span) => {
    span.setAttribute('order.id', order.id);
    span.setAttribute('order.total_cents', order.total);
    try {
      await chargeCard(order);
      await fulfill(order);
      span.setStatus({ code: 1 }); // OK
    } catch (err) {
      span.recordException(err);
      span.setStatus({ code: 2, message: err.message });
      throw err;
    } finally {
      span.end();
    }
  });
}

Context Propagation

For traces to span services, the trace ID must travel across the network. The W3C standard traceparent header carries it.

traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
              │     │                                │                │
            version trace_id                       parent_span_id   flags

Auto-instrumentation handles this automatically for HTTP, gRPC, Kafka, and most messaging libraries. If you write a custom client, propagate the header yourself.

The OTel Collector

A small process that:

Receives telemetry from your apps (OTLP, Jaeger, Zipkin formats).
Processes — batching, sampling, attribute scrubbing, redaction.
Exports to one or many backends (Tempo, Jaeger, Datadog, New Relic, Loki).

receivers:
  otlp:
    protocols: { grpc: {}, http: {} }

processors:
  batch: {}
  tail_sampling:
    policies:
      - { name: errors, type: status_code, status_code: { status_codes: [ERROR] } }
      - { name: slow,   type: latency,    latency: { threshold_ms: 1000 } }
      - { name: random, type: probabilistic, probabilistic: { sampling_percentage: 5 } }

exporters:
  otlphttp/tempo:    { endpoint: https://tempo:4318 }
  otlp/datadog:      { endpoint: https://api.datadoghq.com }

service:
  pipelines:
    traces:
      receivers:  [otlp]
      processors: [batch, tail_sampling]
      exporters:  [otlphttp/tempo, otlp/datadog]

Sampling

Tracing every request gets expensive fast. Two main strategies:

Head-based: decide at the start of the trace whether to keep it. Cheap and consistent across services. May miss interesting traces.
Tail-based: buffer the full trace at the Collector, then keep all errors and slow ones plus a small random sample. Optimal but requires more memory and time at the Collector.

A common production setting: 100% of error traces, 100% of slow traces (>1s), 1–5% of healthy traces.

Backends

Backend	Notes
Jaeger	OG open-source tracing, simple, runs anywhere
Grafana Tempo	Object-storage-backed, cheap, integrates with Loki and Prometheus
Zipkin	Mature, simple, less feature-rich than Jaeger
Datadog APM	Excellent UX, full APM features, expensive
New Relic, Dynatrace	Enterprise APM
Honeycomb	Wide-event model, deep ad-hoc analysis, BubbleUp diff
AWS X-Ray, Azure App Insights, Google Cloud Trace	Cloud-native, OTel-compatible

What Tracing Reveals

The N+1 problem — a parent span calls 200 small child spans in serial.
The silent dependency — a service no one knew was on the critical path.
The slow-but-rare path — a code branch that times out only for some users.
The misattributed bottleneck — what you blamed on the DB was the JSON serializer.

The 80/20 of Adopting Tracing

Run an OTel Collector (sidecar in Kubernetes, or per-host).
Add auto-instrumentation to each service.
Propagate traceparent at every network boundary.
Send to a backend, even Jaeger running in a single container is enough to start.
Add manual spans only around your most important business logic.

That gets you 80% of the value. Add tail sampling and richer attributes once you outgrow the basics.