An incident has been declared. What happens next? Without structure, this is the moment where well-intentioned engineers step on each other, multiple changes land simultaneously, and the path to resolution gets longer. This lesson covers the mechanics of running a response effectively.
Roles in Incident Response
| Role | Responsibility | Who fills it |
|---|---|---|
| Incident Commander (IC) | Coordinates the response; owns communication; delegates tasks; makes call to escalate, mitigate, and close | On-call engineer (small incidents); dedicated IC rotation (SEV1/2) |
| Technical Lead (TL) | Owns diagnosis and mitigation decisions; implements or delegates fixes | Senior engineer with relevant context |
| Communications Lead (Comms) | Writes status updates, manages Statuspage, communicates with CS/sales/exec | Separate from IC on SEV1; IC handles on smaller incidents |
| Scribe | Documents timeline, decisions, and actions in real time | Any available engineer; often a dedicated rotation slot |
| Subject Matter Expert (SME) | Domain knowledge on specific systems (DB, networking, auth) | Called in on demand |
On small teams, one person may fill multiple roles. But the IC role should never be merged with the TL role on a SEV1/2 — someone must have cognitive bandwidth for coordination while others debug.
The First 10 Minutes — Checklist
[ ] Acknowledge the alert (stop escalation timer)
[ ] Open incident channel: #inc-YYYY-shortdesc (e.g. #inc-2934-checkout-down)
[ ] Declare severity, state impact briefly in channel
[ ] Post initial update to Statuspage (if external)
[ ] Identify IC and TL; announce them in the channel
[ ] Scribe starts documenting the timeline
[ ] Begin diagnosis — do NOT make changes yet
[ ] Loop in relevant SMEs if root cause is unclear after 5 minutes of initial assessment
The goal of the first 10 minutes is not to fix the problem — it is to orient everyone, stop the clock on escalation, and start learning before acting.
Communication Structure
Scattered communication is one of the biggest time sinks during incidents. Enforce structure:
- One channel per incident.
#inc-2934-checkout-down. Archive after close. - Bridge call or video for SEV1/2. Text channels lose nuance; voice is faster for complex coordination.
- No DMs for incident information. If it matters, it goes in the channel where it can be seen.
- Status updates on a cadence. Every 15–30 minutes, even if "no change: still investigating root cause." Silence creates anxiety in stakeholders.
Update Template
# Incident Update — 14:47 UTC
**Status:** Investigating
**Severity:** SEV2
**Impact:** Approximately 20% of users experiencing checkout failures (payment page 502 errors)
**Summary:** We have identified elevated error rates in the payments service since 14:22 UTC.
Database connections appear saturated. Rolling restart of the payments service underway.
**Next update:** 15:00 UTC
Hypothesis-Driven Diagnosis
Unfocused investigation — "let's look at everything" — is slow. Use the scientific method:
- Gather data. What changed recently? (deployments, config changes, external events). What do metrics show?
- Form a hypothesis. "I think this is the new payments service deploy at 14:15."
- Test the hypothesis quickly. Can you confirm or refute it in under 5 minutes? If not, move to the next hypothesis.
- Act only when confident enough. False fixes cause additional changes that obscure the real root cause.
A common anti-pattern: changing multiple things simultaneously. If you restart the service, change a config flag, AND increase database pool size at the same time, you don't know what fixed it — making it impossible to build reliable runbooks.
Mitigation vs Fix
Distinguish clearly:
- Mitigation — restores acceptable service without necessarily resolving root cause. Fast; acceptable. Examples: rollback, failover, rate-limiting the affected cohort, toggling a feature flag.
- Fix — resolves root cause. May take longer. Do it after service is restored.
The priority during an incident is mitigation first. A 5-minute rollback that restores service is almost always preferable to a 30-minute forward-fix that risks making things worse.
Mitigation Toolkit
# Quick mitigation options — know these for your service
1. Rollback deployment
kubectl rollout undo deployment/payments-service
2. Feature flag disable
# Turn off new-checkout-flow
curl -X PATCH https://flags.acme.io/api/flags/new-checkout-flow -H "Authorization: Bearer $FLAGS_TOKEN" -d '{"enabled": false}'
3. Traffic shift (weighted routing)
# Route 100% to the stable region
aws route53 change-resource-record-sets --change-batch file://failover-to-us-east-1.json
4. Cache clear (if stale cache causing issues)
redis-cli -h cache.prod.acme FLUSHDB
5. Database connection pool restart
kubectl rollout restart deployment/payments-service
Keeping the Timeline
The scribe records everything. The timeline becomes the foundation for the postmortem:
14:22 UTC — Alert fired: payments-service error rate > 5% (threshold 1%)
14:24 UTC — @bob acknowledged. INC-2934 opened.
14:26 UTC — Severity SEV2 declared. @alice joins as IC. @carol joins as TL.
14:27 UTC — Statuspage updated: "Investigating payment processing issues"
14:30 UTC — TL: payments service CPU normal; DB connections at 98% of pool (max 100). New deploy at 14:15.
14:32 UTC — Hypothesis: connection leak in new deploy. Testing rollback.
14:35 UTC — Rollback initiated (kubectl rollout undo deployment/payments-service)
14:38 UTC — Error rate declining. DB connections dropping to 60%.
14:42 UTC — Error rate < 0.1%. Service restored. INC-2934 mitigated.
14:45 UTC — Statuspage: "Service restored. Investigating root cause."
15:00 UTC — INC-2934 resolved. Postmortem scheduled for 2026-05-29.
When to Escalate Mid-Incident
Escalate mid-incident when:
- No progress on mitigation after 20 minutes at current severity
- Impact has grown beyond initial assessment
- A different team's system is implicated (data loss, billing, auth)
- You need a decision only someone senior can make (take the system offline to prevent data corruption)
Over-escalating is fine. Under-escalating a SEV1 that runs for an hour without executive visibility is not.
Incident Close
Closing an incident requires:
- Confirming user impact has returned to normal (check SLI metrics, error rates)
- Any temporary mitigations still in place documented as action items
- A statement in the incident channel confirming resolution
- Statuspage updated to "Resolved"
- Postmortem date scheduled (within 72 hours for SEV1/2)
Do not declare an incident closed because the alert cleared — verify the underlying metrics are genuinely healthy.
Communication to Customers
External communication follows different rules from internal:
- Acknowledge early. Even "We are investigating an issue affecting X" is better than silence.
- State impact, not cause. "Payment processing is experiencing elevated errors" not "Our payments service database is running out of connections."
- Give time estimates conservatively. Under-promising and over-delivering preserves trust. Missing a stated estimate destroys it.
- Post-resolution summary. After resolution, update with a plain-language description of what happened, when, how it was mitigated, and what is being done to prevent recurrence.
The next lesson covers runbooks and playbooks — the pre-prepared documentation that makes all of this faster.