Incident Management Fundamentals — Incident Management & On-Call Engineering | CertQnA

Every system that runs in production will, at some point, fail. The organisations that handle these failures well — quickly, calmly, and without repeating the same mistakes — have something in common: they have built systems and culture around incident management. Those that don't tend to experience the same outages over and over, burning out engineers and eroding customer trust.

What Is an Incident?

An incident is any unplanned event that disrupts — or risks disrupting — the normal operation of a service. The definition is deliberately broad:

A database server that crashed and recovered in 45 seconds
A latency spike that caused 5% of requests to time out for 3 minutes
An erroneous deploy that sent 500 errors to a small percentage of users
A security event where customer data may have been exposed
A third-party API dependency outage that degraded your checkout flow

All of these are incidents. The difference lies in their severity, impact, and required response — not whether they count.

Why Structure Matters

In the absence of structure, incident response degrades predictably:

Multiple engineers make conflicting changes simultaneously
Communication fragments across Slack, email, and Zoom calls with different audiences
The person with the most context stays on a call for 8 hours while others idle
Mitigation is delayed because nobody is explicitly leading diagnosis
After resolution, no learning occurs — the same incident happens again in 6 weeks

Structured incident management solves each of these with explicit roles, communication channels, and a defined lifecycle.

The Incident Lifecycle

Every incident follows the same arc:

Phase	Goal	Key activity
Detection	Know something is wrong	Alerting fires; on-call acknowledges
Triage	Understand scope and severity	Assess user impact; assign severity; escalate if needed
Mitigation	Restore service — even partially	Rollback, failover, feature flag, traffic shift
Resolution	Confirm normal operation restored	Verify metrics; close the incident
Postmortem	Learn and prevent recurrence	Timeline reconstruction, root cause analysis, action items

ITIL 4 calls these slightly different things (incident → major incident → problem management), but the lifecycle is identical in substance.

Key Metrics

MTTA (Mean Time to Acknowledge). How long until the alert was seen. Reflects alert quality and on-call responsiveness.
MTTD (Mean Time to Detect). How long from fault to alert. Reflects observability coverage.
MTTR (Mean Time to Restore/Resolve). The total incident duration. The main reliability health metric.
Incident frequency. How often incidents occur. Flat or rising frequency signals systemic reliability debt.
Change failure rate. Percentage of deployments that cause an incident. DORA metric; target below 15%.

The SRE Mental Model

Google's Site Reliability Engineering book defined the field. The central insight: treating reliability as a software engineering problem — measurable, improvable — rather than an operational ritual.

Three pillars:

SLIs, SLOs, SLAs. Quantify what good looks like. Give the team a data-driven signal for when they need to focus on reliability vs features.
Error budgets. If you haven't consumed your error budget, you can deploy fearlessly. If you've nearly exhausted it, you slow down and invest in reliability.
Toil reduction. Automate the repetitive operational work that doesn't produce lasting improvement, so engineers can spend time on high-leverage problems.

ITIL 4 vs SRE

Concept	ITIL 4	SRE
Unplanned disruption	Incident	Incident
Restoration of service	Incident Management	Incident Response
Root cause work	Problem Management	Postmortem → Action Items
Reliability targets	SLA	SLO (stricter; internal-only)
Operational documentation	Known error database	Runbook / Playbook
Change risk	Change Advisory Board	Error budget + progressive delivery

ITIL is process-centric and enterprise-governance-oriented. SRE is engineering-centric and metrics-driven. Most organisations blend both — ITIL provides the audit trail and governance, SRE provides the engineering rigour.

Cultural Foundation

Process and tooling are necessary but insufficient. The best incident management cultures share:

Psychological safety. Engineers report incidents and near-misses without fear of blame.
Learning orientation. Incidents are treated as valuable information about system behaviour, not failures of individual engineers.
Shared ownership. Everyone contributes to reliability — not just the team that receives the pages.
Operational empathy. Product managers and leadership understand that feature velocity has a reliability cost.

The rest of this course builds the process and tooling on top of this foundation. The next lesson addresses the first structural decision: how to classify and escalate incidents.