Skip to content
7 min read·Lesson 10 of 10

Observability: Metrics, Logs, and Traces

Learn the three pillars of observability — metrics, logs, and traces — and how tools like Prometheus, Grafana, and OpenTelemetry give DevOps teams deep insight into production systems.

Monitoring tells you when something is wrong. Observability tells you why. A system is observable when you can answer questions about its internal state just by examining its external outputs — without needing to deploy new instrumentation or restart processes.

Monitoring vs Observability

MonitoringObservability
FocusKnown failure modesUnknown and novel failures
Question"Is the service up?""Why is checkout slow for users in EU-West?"
ToolingDashboards, uptime checksMetrics + logs + traces, correlation across them
ApproachPre-defined alerts for expected failuresExploratory debugging of production behaviour

Pillar 1 — Metrics

Metrics are aggregated numerical measurements over time. They are cheap to store and query at scale. Common metric types:

  • Counter: Monotonically increasing value (total requests, total errors). Query the rate of increase.
  • Gauge: A value that goes up and down (current memory usage, queue depth).
  • Histogram: Samples observations into buckets — used for request duration percentiles (p50, p95, p99).

Prometheus is the standard open-source metrics system in cloud-native environments. It scrapes an HTTP endpoint (/metrics) on your services at regular intervals and stores the data in a time-series database. Grafana queries Prometheus (via PromQL) and renders dashboards and alerts.

The Four Golden Signals

Google SRE defines four golden signals that should be monitored for any user-facing service:

SignalWhat it measuresExample SLI
LatencyTime to serve a request (distinguish success from error latency)p99 latency < 500ms
TrafficDemand placed on the systemRequests per second
ErrorsRate of failed requestsError rate < 0.1%
SaturationHow full the system is (CPU, memory, queue depth)CPU utilisation < 80%

Pillar 2 — Logs

Logs are immutable, timestamped records of discrete events. Use structured logging (JSON format) so logs can be parsed and queried programmatically:

{"timestamp":"2024-01-15T10:23:45Z","level":"error","service":"checkout","traceId":"a1b2c3","userId":"u-789","message":"payment declined","provider":"stripe","code":"card_declined"}

Key log management stacks:

  • ELK Stack: Elasticsearch (storage + query) + Logstash (collection) + Kibana (visualisation)
  • EFK Stack: Elasticsearch + Fluentd (collection) + Kibana — Fluentd is the CNCF standard log forwarder
  • Grafana Loki: Log aggregation designed for cloud-native environments; stores only index metadata, not log content — much cheaper than Elasticsearch at scale

Pillar 3 — Distributed Traces

In a microservices architecture, a single user request may touch ten services. A trace records the end-to-end journey of a request across all services. Each service adds a span to the trace — a timed unit of work with metadata. Traces help you:

  • Find which service introduced latency in a slow request
  • Understand dependencies between services
  • Detect cascading failure patterns

OpenTelemetry (OTel) is the CNCF vendor-neutral standard for distributed tracing (and metrics and logs). Instrument your application once with the OTel SDK, then export to any backend: Jaeger, Zipkin, Tempo, Datadog, or Honeycomb.

Alerting Best Practices

Bad alerts erode trust and cause alert fatigue — on-call engineers start ignoring pages because most are false positives. Good alerts are:

  • Symptom-based, not cause-based: Alert on "user-visible error rate > 1%" not "database CPU > 80%" (the DB may be busy but users unaffected).
  • Actionable: Every alert page should require immediate human action. If the action is "check if it resolves itself," make it a ticket, not a page.
  • Linked to SLOs: Alert when your error budget is being consumed faster than expected — this directly ties alerts to user impact.
  • Calibrated: Regularly review alert history. Remove alerts that never fired. Fix alerts that fire too often.

Putting It Together

A modern observability stack for a Kubernetes-based system:

  • Metrics: Prometheus scrapes services; Grafana dashboards and alerting rules send to PagerDuty
  • Logs: Fluentd/Fluent Bit collects pod logs and sends to Loki or Elasticsearch; Grafana queries both
  • Traces: OTel SDK instruments services; Tempo or Jaeger receives spans; Grafana links traces to logs and metrics for a unified view

The ability to correlate metrics, logs, and traces — jumping from a high-latency alert to the relevant trace to the error log — is what makes a system truly observable and dramatically reduces mean time to resolution (MTTR).

Key Takeaways

  • Observability is the ability to understand system behaviour from its outputs — not just whether it is up or down.
  • The three pillars are metrics (aggregated numbers), logs (event records), and traces (request journeys).
  • Prometheus scrapes metrics; Grafana visualises them with dashboards and alerts.
  • Google's four golden signals — latency, traffic, errors, and saturation — are a practical starting point for any service.
  • OpenTelemetry is the vendor-neutral standard for instrumenting applications for metrics, logs, and traces.
🎉

Course Complete!

You've finished DevOps and SRE Fundamentals. Now put your knowledge to the test with real exam-style practice questions.