Every system that runs in production will, at some point, fail. The organisations that handle these failures well — quickly, calmly, and without repeating the same mistakes — have something in common: they have built systems and culture around incident management. Those that don't tend to experience the same outages over and over, burning out engineers and eroding customer trust.
What Is an Incident?
An incident is any unplanned event that disrupts — or risks disrupting — the normal operation of a service. The definition is deliberately broad:
- A database server that crashed and recovered in 45 seconds
- A latency spike that caused 5% of requests to time out for 3 minutes
- An erroneous deploy that sent 500 errors to a small percentage of users
- A security event where customer data may have been exposed
- A third-party API dependency outage that degraded your checkout flow
All of these are incidents. The difference lies in their severity, impact, and required response — not whether they count.
Why Structure Matters
In the absence of structure, incident response degrades predictably:
- Multiple engineers make conflicting changes simultaneously
- Communication fragments across Slack, email, and Zoom calls with different audiences
- The person with the most context stays on a call for 8 hours while others idle
- Mitigation is delayed because nobody is explicitly leading diagnosis
- After resolution, no learning occurs — the same incident happens again in 6 weeks
Structured incident management solves each of these with explicit roles, communication channels, and a defined lifecycle.
The Incident Lifecycle
Every incident follows the same arc:
| Phase | Goal | Key activity |
|---|---|---|
| Detection | Know something is wrong | Alerting fires; on-call acknowledges |
| Triage | Understand scope and severity | Assess user impact; assign severity; escalate if needed |
| Mitigation | Restore service — even partially | Rollback, failover, feature flag, traffic shift |
| Resolution | Confirm normal operation restored | Verify metrics; close the incident |
| Postmortem | Learn and prevent recurrence | Timeline reconstruction, root cause analysis, action items |
ITIL 4 calls these slightly different things (incident → major incident → problem management), but the lifecycle is identical in substance.
Key Metrics
- MTTA (Mean Time to Acknowledge). How long until the alert was seen. Reflects alert quality and on-call responsiveness.
- MTTD (Mean Time to Detect). How long from fault to alert. Reflects observability coverage.
- MTTR (Mean Time to Restore/Resolve). The total incident duration. The main reliability health metric.
- Incident frequency. How often incidents occur. Flat or rising frequency signals systemic reliability debt.
- Change failure rate. Percentage of deployments that cause an incident. DORA metric; target below 15%.
The SRE Mental Model
Google's Site Reliability Engineering book defined the field. The central insight: treating reliability as a software engineering problem — measurable, improvable — rather than an operational ritual.
Three pillars:
- SLIs, SLOs, SLAs. Quantify what good looks like. Give the team a data-driven signal for when they need to focus on reliability vs features.
- Error budgets. If you haven't consumed your error budget, you can deploy fearlessly. If you've nearly exhausted it, you slow down and invest in reliability.
- Toil reduction. Automate the repetitive operational work that doesn't produce lasting improvement, so engineers can spend time on high-leverage problems.
ITIL 4 vs SRE
| Concept | ITIL 4 | SRE |
|---|---|---|
| Unplanned disruption | Incident | Incident |
| Restoration of service | Incident Management | Incident Response |
| Root cause work | Problem Management | Postmortem → Action Items |
| Reliability targets | SLA | SLO (stricter; internal-only) |
| Operational documentation | Known error database | Runbook / Playbook |
| Change risk | Change Advisory Board | Error budget + progressive delivery |
ITIL is process-centric and enterprise-governance-oriented. SRE is engineering-centric and metrics-driven. Most organisations blend both — ITIL provides the audit trail and governance, SRE provides the engineering rigour.
Cultural Foundation
Process and tooling are necessary but insufficient. The best incident management cultures share:
- Psychological safety. Engineers report incidents and near-misses without fear of blame.
- Learning orientation. Incidents are treated as valuable information about system behaviour, not failures of individual engineers.
- Shared ownership. Everyone contributes to reliability — not just the team that receives the pages.
- Operational empathy. Product managers and leadership understand that feature velocity has a reliability cost.
The rest of this course builds the process and tooling on top of this foundation. The next lesson addresses the first structural decision: how to classify and escalate incidents.