Monitoring tells you when something is wrong. Observability tells you why. A system is observable when you can answer questions about its internal state just by examining its external outputs — without needing to deploy new instrumentation or restart processes.
Monitoring vs Observability
| Monitoring | Observability | |
|---|---|---|
| Focus | Known failure modes | Unknown and novel failures |
| Question | "Is the service up?" | "Why is checkout slow for users in EU-West?" |
| Tooling | Dashboards, uptime checks | Metrics + logs + traces, correlation across them |
| Approach | Pre-defined alerts for expected failures | Exploratory debugging of production behaviour |
Pillar 1 — Metrics
Metrics are aggregated numerical measurements over time. They are cheap to store and query at scale. Common metric types:
- Counter: Monotonically increasing value (total requests, total errors). Query the rate of increase.
- Gauge: A value that goes up and down (current memory usage, queue depth).
- Histogram: Samples observations into buckets — used for request duration percentiles (p50, p95, p99).
Prometheus is the standard open-source metrics system in cloud-native environments. It scrapes an HTTP endpoint (/metrics) on your services at regular intervals and stores the data in a time-series database. Grafana queries Prometheus (via PromQL) and renders dashboards and alerts.
The Four Golden Signals
Google SRE defines four golden signals that should be monitored for any user-facing service:
| Signal | What it measures | Example SLI |
|---|---|---|
| Latency | Time to serve a request (distinguish success from error latency) | p99 latency < 500ms |
| Traffic | Demand placed on the system | Requests per second |
| Errors | Rate of failed requests | Error rate < 0.1% |
| Saturation | How full the system is (CPU, memory, queue depth) | CPU utilisation < 80% |
Pillar 2 — Logs
Logs are immutable, timestamped records of discrete events. Use structured logging (JSON format) so logs can be parsed and queried programmatically:
{"timestamp":"2024-01-15T10:23:45Z","level":"error","service":"checkout","traceId":"a1b2c3","userId":"u-789","message":"payment declined","provider":"stripe","code":"card_declined"}
Key log management stacks:
- ELK Stack: Elasticsearch (storage + query) + Logstash (collection) + Kibana (visualisation)
- EFK Stack: Elasticsearch + Fluentd (collection) + Kibana — Fluentd is the CNCF standard log forwarder
- Grafana Loki: Log aggregation designed for cloud-native environments; stores only index metadata, not log content — much cheaper than Elasticsearch at scale
Pillar 3 — Distributed Traces
In a microservices architecture, a single user request may touch ten services. A trace records the end-to-end journey of a request across all services. Each service adds a span to the trace — a timed unit of work with metadata. Traces help you:
- Find which service introduced latency in a slow request
- Understand dependencies between services
- Detect cascading failure patterns
OpenTelemetry (OTel) is the CNCF vendor-neutral standard for distributed tracing (and metrics and logs). Instrument your application once with the OTel SDK, then export to any backend: Jaeger, Zipkin, Tempo, Datadog, or Honeycomb.
Alerting Best Practices
Bad alerts erode trust and cause alert fatigue — on-call engineers start ignoring pages because most are false positives. Good alerts are:
- Symptom-based, not cause-based: Alert on "user-visible error rate > 1%" not "database CPU > 80%" (the DB may be busy but users unaffected).
- Actionable: Every alert page should require immediate human action. If the action is "check if it resolves itself," make it a ticket, not a page.
- Linked to SLOs: Alert when your error budget is being consumed faster than expected — this directly ties alerts to user impact.
- Calibrated: Regularly review alert history. Remove alerts that never fired. Fix alerts that fire too often.
Putting It Together
A modern observability stack for a Kubernetes-based system:
- Metrics: Prometheus scrapes services; Grafana dashboards and alerting rules send to PagerDuty
- Logs: Fluentd/Fluent Bit collects pod logs and sends to Loki or Elasticsearch; Grafana queries both
- Traces: OTel SDK instruments services; Tempo or Jaeger receives spans; Grafana links traces to logs and metrics for a unified view
The ability to correlate metrics, logs, and traces — jumping from a high-latency alert to the relevant trace to the error log — is what makes a system truly observable and dramatically reduces mean time to resolution (MTTR).