Skip to content
5 min read·Lesson 1 of 8

Incident Management Fundamentals

What an incident is, why structured response matters, and how high-performing organisations think about reliability.

Every system that runs in production will, at some point, fail. The organisations that handle these failures well — quickly, calmly, and without repeating the same mistakes — have something in common: they have built systems and culture around incident management. Those that don't tend to experience the same outages over and over, burning out engineers and eroding customer trust.

What Is an Incident?

An incident is any unplanned event that disrupts — or risks disrupting — the normal operation of a service. The definition is deliberately broad:

  • A database server that crashed and recovered in 45 seconds
  • A latency spike that caused 5% of requests to time out for 3 minutes
  • An erroneous deploy that sent 500 errors to a small percentage of users
  • A security event where customer data may have been exposed
  • A third-party API dependency outage that degraded your checkout flow

All of these are incidents. The difference lies in their severity, impact, and required response — not whether they count.

Why Structure Matters

In the absence of structure, incident response degrades predictably:

  • Multiple engineers make conflicting changes simultaneously
  • Communication fragments across Slack, email, and Zoom calls with different audiences
  • The person with the most context stays on a call for 8 hours while others idle
  • Mitigation is delayed because nobody is explicitly leading diagnosis
  • After resolution, no learning occurs — the same incident happens again in 6 weeks

Structured incident management solves each of these with explicit roles, communication channels, and a defined lifecycle.

The Incident Lifecycle

Every incident follows the same arc:

PhaseGoalKey activity
DetectionKnow something is wrongAlerting fires; on-call acknowledges
TriageUnderstand scope and severityAssess user impact; assign severity; escalate if needed
MitigationRestore service — even partiallyRollback, failover, feature flag, traffic shift
ResolutionConfirm normal operation restoredVerify metrics; close the incident
PostmortemLearn and prevent recurrenceTimeline reconstruction, root cause analysis, action items

ITIL 4 calls these slightly different things (incident → major incident → problem management), but the lifecycle is identical in substance.

Key Metrics

  • MTTA (Mean Time to Acknowledge). How long until the alert was seen. Reflects alert quality and on-call responsiveness.
  • MTTD (Mean Time to Detect). How long from fault to alert. Reflects observability coverage.
  • MTTR (Mean Time to Restore/Resolve). The total incident duration. The main reliability health metric.
  • Incident frequency. How often incidents occur. Flat or rising frequency signals systemic reliability debt.
  • Change failure rate. Percentage of deployments that cause an incident. DORA metric; target below 15%.

The SRE Mental Model

Google's Site Reliability Engineering book defined the field. The central insight: treating reliability as a software engineering problem — measurable, improvable — rather than an operational ritual.

Three pillars:

  1. SLIs, SLOs, SLAs. Quantify what good looks like. Give the team a data-driven signal for when they need to focus on reliability vs features.
  2. Error budgets. If you haven't consumed your error budget, you can deploy fearlessly. If you've nearly exhausted it, you slow down and invest in reliability.
  3. Toil reduction. Automate the repetitive operational work that doesn't produce lasting improvement, so engineers can spend time on high-leverage problems.

ITIL 4 vs SRE

ConceptITIL 4SRE
Unplanned disruptionIncidentIncident
Restoration of serviceIncident ManagementIncident Response
Root cause workProblem ManagementPostmortem → Action Items
Reliability targetsSLASLO (stricter; internal-only)
Operational documentationKnown error databaseRunbook / Playbook
Change riskChange Advisory BoardError budget + progressive delivery

ITIL is process-centric and enterprise-governance-oriented. SRE is engineering-centric and metrics-driven. Most organisations blend both — ITIL provides the audit trail and governance, SRE provides the engineering rigour.

Cultural Foundation

Process and tooling are necessary but insufficient. The best incident management cultures share:

  • Psychological safety. Engineers report incidents and near-misses without fear of blame.
  • Learning orientation. Incidents are treated as valuable information about system behaviour, not failures of individual engineers.
  • Shared ownership. Everyone contributes to reliability — not just the team that receives the pages.
  • Operational empathy. Product managers and leadership understand that feature velocity has a reliability cost.

The rest of this course builds the process and tooling on top of this foundation. The next lesson addresses the first structural decision: how to classify and escalate incidents.

Key Takeaways

  • An incident is any unplanned disruption that degrades a service — from a brief blip to a multi-hour outage.
  • Structured incident management reduces mean-time-to-resolution (MTTR) and prevents ad-hoc chaos.
  • The incident lifecycle: detection → triage → mitigation → resolution → postmortem.
  • ITIL 4 and SRE both converge on the same core practices with different vocabularies.
  • Reliability is a feature — it must be actively engineered, not assumed.

Test your knowledge

Try exam-style practice questions to reinforce what you've learned.

Practice Questions →