Skip to content
5 min read·Lesson 2 of 8

Severity Levels and Incident Classification

Defining, applying, and communicating severity levels — the classification system that drives the right response to every incident.

Not all incidents are equal. A SEV1 — complete checkout unavailability for an e-commerce platform — demands immediate all-hands response and executive communication. A SEV5 — a single broken link on a help page — can wait for the next sprint. The severity classification system is what routes each incident to the right level of response without requiring individual judgment every time.

A Standard Severity Framework

LevelNameUser ImpactResponse SLAStakeholders
SEV1CriticalFull outage or data loss; all users affectedAcknowledge <5 min; engage <15 minOn-call, IC, VP/C-level, comms team
SEV2MajorCore feature broken; significant percentage of usersAcknowledge <15 min; engage <30 minOn-call, IC, Engineering lead, Customer Success
SEV3MinorDegraded performance; workaround existsAcknowledge <30 min; resolve same shiftOn-call, Team lead
SEV4LowMinor UX issue; very small percentage of usersAcknowledge next business dayOn-call engineer
SEV5CosmeticNo functional impactScheduled in normal sprintEngineering team

These are starting points. Every organisation should calibrate to their own service criticality and team size. A startup's SEV2 might be a large enterprise's SEV4.

Classification Criteria — User Impact First

Severity must be driven by observed or likely user impact, not by what the alert says or how interesting the technical problem is.

Ask these questions:

  1. Is a critical user flow unavailable? (Login, checkout, payment, core API) → SEV1/2
  2. What percentage of users are affected? 100% → SEV1; 25–100% → SEV1/2; <10% → SEV3; <1% → SEV4
  3. Is data loss occurring or possible? Any data loss risk → minimum SEV2; confirmed loss → SEV1
  4. Does a workaround exist? No workaround → escalate severity; workaround exists → may down-classify
  5. Is there a security dimension? Potential breach → SEV1 regardless of user-visible impact

The Declare-Early Principle

When in doubt, declare higher severity. The cost of standing down from a SEV1 that turns out to be SEV2 is minimal — the cost of under-responding to an actual SEV1 is significant.

Common failure modes of under-declaring:

  • On-call engineer investigates quietly for 30 minutes before declaring — customers already hitting the problem are calling support during this delay
  • Treating a symptom as root cause ("the queue is full" declared as SEV3, but queue fullness was masking a database that ran out of disk — actual SEV1)
  • Social pressure not to "overreact" — counterproductive and a cultural smell

"Classifying an incident is an estimate under uncertainty. You will sometimes be wrong. The system should make it easy to adjust, not punish the initial call."

Escalation Triggers

Escalation should be automatic, not a judgment call in the heat of an incident:

TriggerAction
No progress after N minutes at current severityEscalate to next severity; page additional responders
Impact broader than initially assessedReclassify upward; notify new stakeholders
Financial transaction integrity at riskEscalate to SEV1; engage Finance
Personal data at risk (GDPR, HIPAA)Escalate to SEV1; engage Legal and DPO within SLA
On-call engineer unable to progressPage secondary on-call or domain expert

De-Escalation

De-escalation (downgrading severity during an incident) is less common but valid:

  • Initial pages indicate SEV1; investigation reveals impact is limited to a small cohort → reclassify to SEV2/3 but continue working
  • Mitigation has reduced user impact even without full resolution → adjust severity to reflect actual current state

Always document severity changes in the incident timeline with the reasoning.

Security Incident Classification

Security incidents require a parallel track:

PriorityDescriptionResponse
P1Active breach; data exfiltration in progressImmediate: Security, Legal, DPO, Exec
P2Confirmed unauthorised access; no ongoing exfiltrationSecurity team + Engineering; preserve evidence
P3Suspected breach; unverified indicators of compromiseSecurity team investigates; IR plan activated
P4Security policy violation; no external threatSecurity team + HR as appropriate

Security incidents often require evidence preservation — do not immediately restart or terminate resources. This conflicts with the "restore service first" principle of standard incidents. The security IR runbook must explicitly address this tension.

Incident vs Service Request vs Problem

ITIL 4 distinguishes three things that get mixed up in practice:

  • Incident — unplanned disruption; goal is restoration
  • Service Request — a planned, expected request (provision access, deploy a feature) — not an incident
  • Problem — the root cause underlying one or more incidents; managed separately over a longer timeframe

Problems are investigated via problem management — the blameless postmortem process covered in a later lesson. The same database disk-full issue that caused three incidents in six months is a problem; each individual incident is separate.

Documenting Your Severity Levels

Write down your severity definitions and publish them in your incident management runbook:

# Severity Definitions — Acme Corp
Last reviewed: 2026-Q1

## SEV1 — Critical (respond immediately 24/7)
- Full outage affecting >10% of users OR
- Any payment processing failure OR
- Any suspected data breach OR
- Any data loss (even partial)

## SEV2 — Major (respond within 30 min 24/7)
- Core feature unavailable for any users OR
- Degraded performance >2x normal latency for >25% of users OR
- Third-party integration completely down with no workaround

## SEV3 — Minor (respond within 4 hours during business hours)
- Non-core feature degraded OR
- Workaround available for most users OR
- <5% of users affected with no financial impact

## SEV4 / SEV5 — follow normal sprint process

Concrete definitions — with named examples from your own service — reduce classification disputes during high-stress incidents. The next lesson addresses who responds to these incidents and how the on-call system is designed.

Key Takeaways

  • Severity levels (SEV1–SEV5) dictate response time, roles engaged, and communication requirements.
  • Classification must be based on observed user impact, not technical symptoms.
  • Err toward over-declaring severity — it is cheaper to stand down than to under-respond.
  • Security incidents require a parallel classification track with different escalation paths.
  • Review and revise your severity definitions annually — they drift from reality over time.

Test your knowledge

Try exam-style practice questions to reinforce what you've learned.

Practice Questions →