Observability: Metrics, Logs, and Traces

Monitoring tells you when something is wrong. Observability tells you why. A system is observable when you can answer questions about its internal state just by examining its external outputs — without needing to deploy new instrumentation or restart processes.

Monitoring vs Observability

	Monitoring	Observability
Focus	Known failure modes	Unknown and novel failures
Question	"Is the service up?"	"Why is checkout slow for users in EU-West?"
Tooling	Dashboards, uptime checks	Metrics + logs + traces, correlation across them
Approach	Pre-defined alerts for expected failures	Exploratory debugging of production behaviour

Pillar 1 — Metrics

Metrics are aggregated numerical measurements over time. They are cheap to store and query at scale. Common metric types:

Counter: Monotonically increasing value (total requests, total errors). Query the rate of increase.
Gauge: A value that goes up and down (current memory usage, queue depth).
Histogram: Samples observations into buckets — used for request duration percentiles (p50, p95, p99).

Prometheus is the standard open-source metrics system in cloud-native environments. It scrapes an HTTP endpoint (/metrics) on your services at regular intervals and stores the data in a time-series database. Grafana queries Prometheus (via PromQL) and renders dashboards and alerts.

The Four Golden Signals

Google SRE defines four golden signals that should be monitored for any user-facing service:

Signal	What it measures	Example SLI
Latency	Time to serve a request (distinguish success from error latency)	p99 latency < 500ms
Traffic	Demand placed on the system	Requests per second
Errors	Rate of failed requests	Error rate < 0.1%
Saturation	How full the system is (CPU, memory, queue depth)	CPU utilisation < 80%

Pillar 2 — Logs

Logs are immutable, timestamped records of discrete events. Use structured logging (JSON format) so logs can be parsed and queried programmatically:

{"timestamp":"2024-01-15T10:23:45Z","level":"error","service":"checkout","traceId":"a1b2c3","userId":"u-789","message":"payment declined","provider":"stripe","code":"card_declined"}

Key log management stacks:

ELK Stack: Elasticsearch (storage + query) + Logstash (collection) + Kibana (visualisation)
EFK Stack: Elasticsearch + Fluentd (collection) + Kibana — Fluentd is the CNCF standard log forwarder
Grafana Loki: Log aggregation designed for cloud-native environments; stores only index metadata, not log content — much cheaper than Elasticsearch at scale

Pillar 3 — Distributed Traces

In a microservices architecture, a single user request may touch ten services. A trace records the end-to-end journey of a request across all services. Each service adds a span to the trace — a timed unit of work with metadata. Traces help you:

Find which service introduced latency in a slow request
Understand dependencies between services
Detect cascading failure patterns

OpenTelemetry (OTel) is the CNCF vendor-neutral standard for distributed tracing (and metrics and logs). Instrument your application once with the OTel SDK, then export to any backend: Jaeger, Zipkin, Tempo, Datadog, or Honeycomb.

Alerting Best Practices

Bad alerts erode trust and cause alert fatigue — on-call engineers start ignoring pages because most are false positives. Good alerts are:

Symptom-based, not cause-based: Alert on "user-visible error rate > 1%" not "database CPU > 80%" (the DB may be busy but users unaffected).
Actionable: Every alert page should require immediate human action. If the action is "check if it resolves itself," make it a ticket, not a page.
Linked to SLOs: Alert when your error budget is being consumed faster than expected — this directly ties alerts to user impact.
Calibrated: Regularly review alert history. Remove alerts that never fired. Fix alerts that fire too often.

Putting It Together

A modern observability stack for a Kubernetes-based system:

Metrics: Prometheus scrapes services; Grafana dashboards and alerting rules send to PagerDuty
Logs: Fluentd/Fluent Bit collects pod logs and sends to Loki or Elasticsearch; Grafana queries both
Traces: OTel SDK instruments services; Tempo or Jaeger receives spans; Grafana links traces to logs and metrics for a unified view

The ability to correlate metrics, logs, and traces — jumping from a high-latency alert to the relevant trace to the error log — is what makes a system truly observable and dramatically reduces mean time to resolution (MTTR).