Alerting and Incident Response

Alerts are the most expensive part of observability — every page costs human attention and sleep. The goal is the smallest, most actionable set of alerts you can run, with everything else as data.

Alert on Symptoms, Not Causes

Symptom alerts come from SLIs — what users actually experience. Causal alerts ("CPU > 90%") are noisy and miss problems that show up in unexpected ways.

Bad alert (cause)	Good alert (symptom)
CPU > 90%	p99 latency > SLO threshold
Disk > 80%	Writes failing for > 5 min
Pod restart count > 5	Error rate > SLO burn
Memory at 95%	Service availability below SLO

Causal metrics still belong on dashboards — but they should not page. They help diagnose why the symptom alert fired.

The Three-Tier Severity Model

Severity	Channel	Response
Page (P1)	PagerDuty / Opsgenie phone call	Wakes a human; respond within 5–15 min
Ticket (P2)	Slack channel + Jira ticket	Look at within next business day
FYI (P3)	Email / Slack low-traffic channel	Information only, may indicate trend

Pages are sacred. If a page does not require immediate human action, downgrade it to a ticket. If you cannot tell what action is required, the alert is broken.

Every Alert Needs a Runbook

The annotation on every alert rule should include:

What the alert means in business terms.
How to confirm the problem (dashboard URL, log query).
The first three diagnostic steps.
Common causes and known fixes.
How to escalate or roll back.
Owner team / Slack channel.

- alert: ApiHighErrorRate
  expr: ...
  for: 5m
  annotations:
    summary: "Checkout API error rate above SLO"
    description: "Error rate is {{ $value }}%; SLO is 99.9%."
    runbook: "https://runbooks/checkout-api/high-error-rate"
    dashboard: "https://grafana/d/checkout/overview"

Rule: no alert ships to production without a runbook URL. This forces you to think through what the on-call engineer will actually do.

Tuning Alerts

Two failure modes:

Too quiet — real incidents go undetected. Fix by adding burn-rate alerts on SLOs.
Too noisy — alert fatigue, real ones get ignored. Fix by:
- Adding a for: duration so brief blips do not page.
- Multi-window confirmation (5m AND 1h both elevated).
- Routing low-severity to ticket, not page.
- Deleting alerts that fire weekly without action.

Track page count per week. A healthy team gets fewer than 2–5 actionable pages per week per on-call rotation.

Routing and Deduplication

Alertmanager (or PagerDuty / Opsgenie) handles:

Grouping — collapse 200 pod-restart alerts into one notification.
Routing — service team on-call, not the platform team.
Inhibition — if the cluster is down, suppress per-service alerts (the symptom is obvious).
Silences — temporary mutes during planned maintenance.
Escalation — if not acknowledged in N minutes, page the secondary or manager.

route:
  group_by: ['alertname', 'service']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: pager-default
  routes:
    - matchers: [severity="page", team="checkout"]
      receiver: pager-checkout
    - matchers: [severity="ticket"]
      receiver: slack-tickets

inhibit_rules:
  - source_matchers: [alertname=ClusterDown]
    target_matchers: [severity="page"]
    equal: [cluster]

On-Call Practices

Rotation size of 6–8 engineers; primary + secondary; one-week shifts.
Compensation policy (extra pay or time off) recognises off-hours work.
Follow-the-sun rotations across regions where headcount allows — minimises night pages.
Hand-off rituals: brief outgoing engineer, review last week's pages.
Right of refusal: noisy alerts can be silenced by on-call and triaged the next day.
Onboarding: new engineers shadow before primary; they own the runbook updates.

The Incident Response Loop

Detect — alert fires.
Acknowledge — page tool acknowledged within minutes.
Declare — open an incident channel; appoint an Incident Commander.
Contain — roll back, throttle, fail over, scale up — stop the bleeding.
Diagnose — use metrics, traces, logs to find the cause.
Resolve — apply the fix; confirm metrics returned to normal.
Close — communicate resolution; schedule postmortem.

The Incident Commander coordinates; they do not do the technical work. Communications, scribe, and tech leads are separate roles for big incidents.

Blameless Postmortems

Within a few business days, write up:

Timeline of what happened.
Customer impact (numbers, duration, blast radius).
Contributing factors (technical, organisational, process).
What went well, what went poorly, where we got lucky.
Action items with owners and deadlines.

Blameless means assuming everyone acted reasonably given what they knew at the time. The goal is system improvement, not finger-pointing. Engineers will only volunteer "I deployed the bad config at 14:30" in a culture where doing so is safe.

Status Pages and Comms

Public status page (statuspage.io, Atlassian, instatus) for customer-impacting incidents.
Update every 30–60 minutes during an incident, even with "still investigating".
Internal incident channel for engineering coordination, separate from customer comms.
A clear template for incident updates — most teams find that consistent format reduces panic.

The Health Indicator That Matters

Mature observability is measurable in three numbers:

Time to detect (TTD) — incident start to alert.
Time to acknowledge (TTA) — alert to human responding.
Time to resolve (TTR) — incident start to fix deployed.

If TTD is hours, your alerts need work. If TTR is hours, your runbooks and tooling do. Track these per incident; trend them over quarters.

Course Conclusion

You can now design and operate an observability stack: pick the right signals, instrument with OpenTelemetry, run Prometheus + Grafana or a commercial APM, define SLIs/SLOs/error budgets, and respond to incidents like a team that has done it before. The cert exams (DOP-C02, AZ-400, Google PCA / SRE) drill on exactly these concepts.

The remaining work is hands-on: instrument a real service, build the dashboards, write the runbooks, take a few on-call shifts. The fundamentals do not change — only your fluency improves with practice.