The Incident Response Process — Incident Management & On-Call Engineering | CertQnA

An incident has been declared. What happens next? Without structure, this is the moment where well-intentioned engineers step on each other, multiple changes land simultaneously, and the path to resolution gets longer. This lesson covers the mechanics of running a response effectively.

Roles in Incident Response

Role	Responsibility	Who fills it
Incident Commander (IC)	Coordinates the response; owns communication; delegates tasks; makes call to escalate, mitigate, and close	On-call engineer (small incidents); dedicated IC rotation (SEV1/2)
Technical Lead (TL)	Owns diagnosis and mitigation decisions; implements or delegates fixes	Senior engineer with relevant context
Communications Lead (Comms)	Writes status updates, manages Statuspage, communicates with CS/sales/exec	Separate from IC on SEV1; IC handles on smaller incidents
Scribe	Documents timeline, decisions, and actions in real time	Any available engineer; often a dedicated rotation slot
Subject Matter Expert (SME)	Domain knowledge on specific systems (DB, networking, auth)	Called in on demand

On small teams, one person may fill multiple roles. But the IC role should never be merged with the TL role on a SEV1/2 — someone must have cognitive bandwidth for coordination while others debug.

The First 10 Minutes — Checklist

[ ] Acknowledge the alert (stop escalation timer)
[ ] Open incident channel: #inc-YYYY-shortdesc (e.g. #inc-2934-checkout-down)
[ ] Declare severity, state impact briefly in channel
[ ] Post initial update to Statuspage (if external)
[ ] Identify IC and TL; announce them in the channel
[ ] Scribe starts documenting the timeline
[ ] Begin diagnosis — do NOT make changes yet
[ ] Loop in relevant SMEs if root cause is unclear after 5 minutes of initial assessment

The goal of the first 10 minutes is not to fix the problem — it is to orient everyone, stop the clock on escalation, and start learning before acting.

Communication Structure

Scattered communication is one of the biggest time sinks during incidents. Enforce structure:

One channel per incident. #inc-2934-checkout-down. Archive after close.
Bridge call or video for SEV1/2. Text channels lose nuance; voice is faster for complex coordination.
No DMs for incident information. If it matters, it goes in the channel where it can be seen.
Status updates on a cadence. Every 15–30 minutes, even if "no change: still investigating root cause." Silence creates anxiety in stakeholders.

Update Template

# Incident Update — 14:47 UTC

**Status:** Investigating
**Severity:** SEV2
**Impact:** Approximately 20% of users experiencing checkout failures (payment page 502 errors)
**Summary:** We have identified elevated error rates in the payments service since 14:22 UTC. 
Database connections appear saturated. Rolling restart of the payments service underway.

**Next update:** 15:00 UTC

Hypothesis-Driven Diagnosis

Unfocused investigation — "let's look at everything" — is slow. Use the scientific method:

Gather data. What changed recently? (deployments, config changes, external events). What do metrics show?
Form a hypothesis. "I think this is the new payments service deploy at 14:15."
Test the hypothesis quickly. Can you confirm or refute it in under 5 minutes? If not, move to the next hypothesis.
Act only when confident enough. False fixes cause additional changes that obscure the real root cause.

A common anti-pattern: changing multiple things simultaneously. If you restart the service, change a config flag, AND increase database pool size at the same time, you don't know what fixed it — making it impossible to build reliable runbooks.

Mitigation vs Fix

Distinguish clearly:

Mitigation — restores acceptable service without necessarily resolving root cause. Fast; acceptable. Examples: rollback, failover, rate-limiting the affected cohort, toggling a feature flag.
Fix — resolves root cause. May take longer. Do it after service is restored.

The priority during an incident is mitigation first. A 5-minute rollback that restores service is almost always preferable to a 30-minute forward-fix that risks making things worse.

Mitigation Toolkit

# Quick mitigation options — know these for your service

1. Rollback deployment
   kubectl rollout undo deployment/payments-service

2. Feature flag disable
   # Turn off new-checkout-flow
   curl -X PATCH https://flags.acme.io/api/flags/new-checkout-flow      -H "Authorization: Bearer $FLAGS_TOKEN"      -d '{"enabled": false}'

3. Traffic shift (weighted routing)
   # Route 100% to the stable region
   aws route53 change-resource-record-sets --change-batch file://failover-to-us-east-1.json

4. Cache clear (if stale cache causing issues)
   redis-cli -h cache.prod.acme FLUSHDB

5. Database connection pool restart
   kubectl rollout restart deployment/payments-service

Keeping the Timeline

The scribe records everything. The timeline becomes the foundation for the postmortem:

14:22 UTC — Alert fired: payments-service error rate > 5% (threshold 1%)
14:24 UTC — @bob acknowledged. INC-2934 opened.
14:26 UTC — Severity SEV2 declared. @alice joins as IC. @carol joins as TL.
14:27 UTC — Statuspage updated: "Investigating payment processing issues"
14:30 UTC — TL: payments service CPU normal; DB connections at 98% of pool (max 100). New deploy at 14:15.
14:32 UTC — Hypothesis: connection leak in new deploy. Testing rollback.
14:35 UTC — Rollback initiated (kubectl rollout undo deployment/payments-service)
14:38 UTC — Error rate declining. DB connections dropping to 60%.
14:42 UTC — Error rate < 0.1%. Service restored. INC-2934 mitigated.
14:45 UTC — Statuspage: "Service restored. Investigating root cause."
15:00 UTC — INC-2934 resolved. Postmortem scheduled for 2026-05-29.

When to Escalate Mid-Incident

Escalate mid-incident when:

No progress on mitigation after 20 minutes at current severity
Impact has grown beyond initial assessment
A different team's system is implicated (data loss, billing, auth)
You need a decision only someone senior can make (take the system offline to prevent data corruption)

Over-escalating is fine. Under-escalating a SEV1 that runs for an hour without executive visibility is not.

Incident Close

Closing an incident requires:

Confirming user impact has returned to normal (check SLI metrics, error rates)
Any temporary mitigations still in place documented as action items
A statement in the incident channel confirming resolution
Statuspage updated to "Resolved"
Postmortem date scheduled (within 72 hours for SEV1/2)

Do not declare an incident closed because the alert cleared — verify the underlying metrics are genuinely healthy.

Communication to Customers

External communication follows different rules from internal:

Acknowledge early. Even "We are investigating an issue affecting X" is better than silence.
State impact, not cause. "Payment processing is experiencing elevated errors" not "Our payments service database is running out of connections."
Give time estimates conservatively. Under-promising and over-delivering preserves trust. Missing a stated estimate destroys it.
Post-resolution summary. After resolution, update with a plain-language description of what happened, when, how it was mitigated, and what is being done to prevent recurrence.

The next lesson covers runbooks and playbooks — the pre-prepared documentation that makes all of this faster.