Runbooks, Playbooks, and Decision Trees — Incident Management & On-Call Engineering | CertQnA

The engineer who built the payments service knows exactly what to do when the database pool fills up. The engineer paged at 3am who normally owns the search service does not. Runbooks and playbooks bridge this gap — packaging the domain expert's knowledge into a format that any competent engineer can execute under pressure.

Runbooks vs Playbooks

	Runbook	Playbook
Purpose	Execute a specific operational task	Navigate a complex incident scenario
Structure	Step-by-step procedure	Decision tree + linked runbooks
Trigger	Called from an alert or playbook	Triggered by an incident type
Examples	"Restart the payments service", "Rotate the API key", "Promote the DB replica"	"Payments outage", "Data breach response", "Database unavailable"

A playbook says "if checkout is down, start here, and follow these decision paths." A runbook says "to restart the payments service, do steps 1–8."

The Anatomy of a Good Runbook

# RB-047: Restart the Payments Service

**Severity context:** Appropriate for SEV3/4. For SEV1/2, call @alice or @bob before proceeding.
**Last reviewed:** 2026-04-10 by @alice
**Linked alerts:** payments-high-error-rate, payments-pod-crashlooping

---
## When to use this runbook
Use when payments-service pods are crash-looping or error rate is elevated and a 
simple restart is the intended mitigation (no DB or auth issues indicated).

## Prerequisites
- [ ] kubectl access to production cluster
- [ ] You are the declared IC or acting under IC direction

## Steps
1. Check current pod status:
   ```
   kubectl get pods -n production -l app=payments-service
   ```

2. View recent logs (last 5 minutes):
   ```
   kubectl logs -n production -l app=payments-service --since=5m
   ```
   
   **STOP:** If logs contain "data corruption" or "checksum mismatch", do NOT restart.
   Escalate to @alice immediately.

3. Initiate rolling restart:
   ```
   kubectl rollout restart deployment/payments-service -n production
   ```

4. Monitor rollout:
   ```
   kubectl rollout status deployment/payments-service -n production --timeout=120s
   ```

5. Verify error rate falling (check [Payments Dashboard](https://grafana.acme.io/d/payments)):
   - Success rate should return above 99% within 2 minutes
   - If not: proceed to RB-048 (Rollback Payments Service Deploy)

6. Update incident channel: "Rolling restart complete. Monitoring."

## Expected outcome
Pod crash-loop resolves. Error rate returns to <0.1%.

## If it doesn't work
→ Try: RB-048 (Rollback)
→ Escalate: #payments-team or @alice via PagerDuty

## Notes
- The payments service takes ~90s to start due to migration checks on boot.
- NEVER use `kubectl delete pod` — it does not drain gracefully and drops in-flight transactions.

Writing for the 3am Reader

The single most useful mental model for runbook writing: your reader is a competent engineer who is tired, stressed, alone, and has never operated this specific service before. Write for them.

Concrete implications:

Exact commands. Not "restart the service" — the exact kubectl rollout restart deployment/… command with namespace.
Warning boxes. Explicit STOP and WARNING callouts for destructive or irreversible actions.
Expected output. "You should see: Rolling update complete. 3/3 pods updated." Reduces the "is this working?" uncertainty.
Links to dashboards. Direct URL to the specific dashboard, not "check Grafana."
Escalation paths. Who to page if this runbook doesn't resolve it.

A Playbook Structure

# PB-012: Payments Service Degraded

**Severity:** SEV1/SEV2
**Owner:** Payments team
**Last incident using this playbook:** INC-2934 (2026-05-27)

## Symptoms
- payments-high-error-rate alert fired
- Users reporting checkout failures
- Statuspage notifications from Stripe

## Step 1: Assess scope
Run: kubectl get pods -n production -l app=payments-service
Run: Check [Payments Dashboard](https://grafana.acme.io/d/payments)

→ **All pods running, high error rate?** Go to Step 2 (Application error)
→ **Pods crashlooping?** Go to RB-047 (Restart)
→ **All pods in Pending/Error?** Go to Step 3 (Infrastructure issue)

## Step 2: Application Error (pods healthy)

Check if a deploy occurred in the last 30 minutes:
  `kubectl rollout history deployment/payments-service -n production`

→ **Recent deploy?** → RB-048 (Rollback Deploy)
→ **No recent deploy?** → Check DB connection pool (RB-052), then check Stripe status page

## Step 3: Infrastructure issue
Check node status: `kubectl get nodes`
Check Kubernetes events: `kubectl get events -n production --sort-by=.lastTimestamp | tail -30`

→ **Node not ready?** → Page SRE on-call for cluster investigation
→ **Resource quota exceeded?** → RB-053 (Resource cleanup)

## Communication
- SEV1: Update Statuspage immediately. Page @payments-team-lead.
- SEV2: Update Statuspage within 10 minutes.
- Customer-facing copy: "We are investigating issues with payment processing."

Keeping Runbooks Alive

Runbooks decay. The system that was complex six months ago was simplified; the command that worked has been replaced. Stale runbooks are dangerous — a responder following them wastes time and may take a wrong action.

Practices that keep runbooks current:

Runbook ownership. Each runbook has a named owner responsible for keeping it accurate.
Post-incident review. After every SEV1/2, the postmortem includes "was the runbook accurate? What needs updating?"
Quarterly review cadence. A calendar reminder for each runbook owner to verify their documents every quarter.
Update in the same PR as the change. Deploying a new restart procedure? The runbook update goes in the same pull request.
Runbook coverage metric. Track what percentage of alerts have a linked runbook. Target: 100% for SEV1/2 alerts; 80%+ overall.

Alerting-to-Runbook Linking

Every alert definition should include a direct runbook URL in its annotations:

# Prometheus alerting rule
- alert: PaymentsHighErrorRate
  expr: rate(payments_errors_total[5m]) / rate(payments_requests_total[5m]) > 0.05
  for: 2m
  labels:
    severity: warning
    team: payments
  annotations:
    summary: "Payments error rate above 5%"
    description: "Current rate: {{ $value | humanizePercentage }}"
    runbook_url: "https://wiki.acme.io/runbooks/RB-047"
    dashboard_url: "https://grafana.acme.io/d/payments"

PagerDuty and Opsgenie display these links in the alert notification — the first thing the responder sees is a direct link to what to do.

Decision Trees and Visual Formats

For complex scenarios, a prose playbook is harder to follow than a visual decision tree. Tools:

Mermaid diagrams — render in Confluence, Notion, GitHub, and many wikis
Flowcharts in Lucidchart / Miro — for printable or presentation-format playbooks
Decision tables — for scenarios with many variables that drive different actions

flowchart TD
  A[payments-high-error-rate fires] --> B{Pods healthy?}
  B -->|Yes| C{Recent deploy?}
  B -->|No| D[RB-047: Restart pods]
  C -->|Yes| E[RB-048: Rollback]
  C -->|No| F{DB connections > 90%?}
  F -->|Yes| G[RB-052: Reset DB pool]
  F -->|No| H[Check Stripe status page]

Runbook Anti-Patterns

Wall of text. Long prose paragraphs without numbered steps or visual breaks. Nobody reads this at 3am.
"Check the dashboard." Which dashboard? Which panel? Link it.
Missing escalation. A runbook with no "if this doesn't work" path leaves the responder stranded.
Outdated commands. Commands that refer to old hostnames, removed tools, or renamed services.
Writing for the expert. Assuming the reader knows the system architecture. They don't at 3am.

The next lesson addresses what happens after the incident closes: the blameless postmortem that turns incidents into lasting improvements.