Not all incidents are equal. A SEV1 — complete checkout unavailability for an e-commerce platform — demands immediate all-hands response and executive communication. A SEV5 — a single broken link on a help page — can wait for the next sprint. The severity classification system is what routes each incident to the right level of response without requiring individual judgment every time.
A Standard Severity Framework
| Level | Name | User Impact | Response SLA | Stakeholders |
|---|---|---|---|---|
| SEV1 | Critical | Full outage or data loss; all users affected | Acknowledge <5 min; engage <15 min | On-call, IC, VP/C-level, comms team |
| SEV2 | Major | Core feature broken; significant percentage of users | Acknowledge <15 min; engage <30 min | On-call, IC, Engineering lead, Customer Success |
| SEV3 | Minor | Degraded performance; workaround exists | Acknowledge <30 min; resolve same shift | On-call, Team lead |
| SEV4 | Low | Minor UX issue; very small percentage of users | Acknowledge next business day | On-call engineer |
| SEV5 | Cosmetic | No functional impact | Scheduled in normal sprint | Engineering team |
These are starting points. Every organisation should calibrate to their own service criticality and team size. A startup's SEV2 might be a large enterprise's SEV4.
Classification Criteria — User Impact First
Severity must be driven by observed or likely user impact, not by what the alert says or how interesting the technical problem is.
Ask these questions:
- Is a critical user flow unavailable? (Login, checkout, payment, core API) → SEV1/2
- What percentage of users are affected? 100% → SEV1; 25–100% → SEV1/2; <10% → SEV3; <1% → SEV4
- Is data loss occurring or possible? Any data loss risk → minimum SEV2; confirmed loss → SEV1
- Does a workaround exist? No workaround → escalate severity; workaround exists → may down-classify
- Is there a security dimension? Potential breach → SEV1 regardless of user-visible impact
The Declare-Early Principle
When in doubt, declare higher severity. The cost of standing down from a SEV1 that turns out to be SEV2 is minimal — the cost of under-responding to an actual SEV1 is significant.
Common failure modes of under-declaring:
- On-call engineer investigates quietly for 30 minutes before declaring — customers already hitting the problem are calling support during this delay
- Treating a symptom as root cause ("the queue is full" declared as SEV3, but queue fullness was masking a database that ran out of disk — actual SEV1)
- Social pressure not to "overreact" — counterproductive and a cultural smell
"Classifying an incident is an estimate under uncertainty. You will sometimes be wrong. The system should make it easy to adjust, not punish the initial call."
Escalation Triggers
Escalation should be automatic, not a judgment call in the heat of an incident:
| Trigger | Action |
|---|---|
| No progress after N minutes at current severity | Escalate to next severity; page additional responders |
| Impact broader than initially assessed | Reclassify upward; notify new stakeholders |
| Financial transaction integrity at risk | Escalate to SEV1; engage Finance |
| Personal data at risk (GDPR, HIPAA) | Escalate to SEV1; engage Legal and DPO within SLA |
| On-call engineer unable to progress | Page secondary on-call or domain expert |
De-Escalation
De-escalation (downgrading severity during an incident) is less common but valid:
- Initial pages indicate SEV1; investigation reveals impact is limited to a small cohort → reclassify to SEV2/3 but continue working
- Mitigation has reduced user impact even without full resolution → adjust severity to reflect actual current state
Always document severity changes in the incident timeline with the reasoning.
Security Incident Classification
Security incidents require a parallel track:
| Priority | Description | Response |
|---|---|---|
| P1 | Active breach; data exfiltration in progress | Immediate: Security, Legal, DPO, Exec |
| P2 | Confirmed unauthorised access; no ongoing exfiltration | Security team + Engineering; preserve evidence |
| P3 | Suspected breach; unverified indicators of compromise | Security team investigates; IR plan activated |
| P4 | Security policy violation; no external threat | Security team + HR as appropriate |
Security incidents often require evidence preservation — do not immediately restart or terminate resources. This conflicts with the "restore service first" principle of standard incidents. The security IR runbook must explicitly address this tension.
Incident vs Service Request vs Problem
ITIL 4 distinguishes three things that get mixed up in practice:
- Incident — unplanned disruption; goal is restoration
- Service Request — a planned, expected request (provision access, deploy a feature) — not an incident
- Problem — the root cause underlying one or more incidents; managed separately over a longer timeframe
Problems are investigated via problem management — the blameless postmortem process covered in a later lesson. The same database disk-full issue that caused three incidents in six months is a problem; each individual incident is separate.
Documenting Your Severity Levels
Write down your severity definitions and publish them in your incident management runbook:
# Severity Definitions — Acme Corp
Last reviewed: 2026-Q1
## SEV1 — Critical (respond immediately 24/7)
- Full outage affecting >10% of users OR
- Any payment processing failure OR
- Any suspected data breach OR
- Any data loss (even partial)
## SEV2 — Major (respond within 30 min 24/7)
- Core feature unavailable for any users OR
- Degraded performance >2x normal latency for >25% of users OR
- Third-party integration completely down with no workaround
## SEV3 — Minor (respond within 4 hours during business hours)
- Non-core feature degraded OR
- Workaround available for most users OR
- <5% of users affected with no financial impact
## SEV4 / SEV5 — follow normal sprint process
Concrete definitions — with named examples from your own service — reduce classification disputes during high-stress incidents. The next lesson addresses who responds to these incidents and how the on-call system is designed.