Severity Levels and Incident Classification — Incident Management & On-Call Engineering | CertQnA

Not all incidents are equal. A SEV1 — complete checkout unavailability for an e-commerce platform — demands immediate all-hands response and executive communication. A SEV5 — a single broken link on a help page — can wait for the next sprint. The severity classification system is what routes each incident to the right level of response without requiring individual judgment every time.

A Standard Severity Framework

Level	Name	User Impact	Response SLA	Stakeholders
SEV1	Critical	Full outage or data loss; all users affected	Acknowledge <5 min; engage <15 min	On-call, IC, VP/C-level, comms team
SEV2	Major	Core feature broken; significant percentage of users	Acknowledge <15 min; engage <30 min	On-call, IC, Engineering lead, Customer Success
SEV3	Minor	Degraded performance; workaround exists	Acknowledge <30 min; resolve same shift	On-call, Team lead
SEV4	Low	Minor UX issue; very small percentage of users	Acknowledge next business day	On-call engineer
SEV5	Cosmetic	No functional impact	Scheduled in normal sprint	Engineering team

These are starting points. Every organisation should calibrate to their own service criticality and team size. A startup's SEV2 might be a large enterprise's SEV4.

Classification Criteria — User Impact First

Severity must be driven by observed or likely user impact, not by what the alert says or how interesting the technical problem is.

Ask these questions:

Is a critical user flow unavailable? (Login, checkout, payment, core API) → SEV1/2
What percentage of users are affected? 100% → SEV1; 25–100% → SEV1/2; <10% → SEV3; <1% → SEV4
Is data loss occurring or possible? Any data loss risk → minimum SEV2; confirmed loss → SEV1
Does a workaround exist? No workaround → escalate severity; workaround exists → may down-classify
Is there a security dimension? Potential breach → SEV1 regardless of user-visible impact

The Declare-Early Principle

When in doubt, declare higher severity. The cost of standing down from a SEV1 that turns out to be SEV2 is minimal — the cost of under-responding to an actual SEV1 is significant.

Common failure modes of under-declaring:

On-call engineer investigates quietly for 30 minutes before declaring — customers already hitting the problem are calling support during this delay
Treating a symptom as root cause ("the queue is full" declared as SEV3, but queue fullness was masking a database that ran out of disk — actual SEV1)
Social pressure not to "overreact" — counterproductive and a cultural smell

"Classifying an incident is an estimate under uncertainty. You will sometimes be wrong. The system should make it easy to adjust, not punish the initial call."

Escalation Triggers

Escalation should be automatic, not a judgment call in the heat of an incident:

Trigger	Action
No progress after N minutes at current severity	Escalate to next severity; page additional responders
Impact broader than initially assessed	Reclassify upward; notify new stakeholders
Financial transaction integrity at risk	Escalate to SEV1; engage Finance
Personal data at risk (GDPR, HIPAA)	Escalate to SEV1; engage Legal and DPO within SLA
On-call engineer unable to progress	Page secondary on-call or domain expert

De-Escalation

De-escalation (downgrading severity during an incident) is less common but valid:

Initial pages indicate SEV1; investigation reveals impact is limited to a small cohort → reclassify to SEV2/3 but continue working
Mitigation has reduced user impact even without full resolution → adjust severity to reflect actual current state

Always document severity changes in the incident timeline with the reasoning.

Security Incident Classification

Security incidents require a parallel track:

Priority	Description	Response
P1	Active breach; data exfiltration in progress	Immediate: Security, Legal, DPO, Exec
P2	Confirmed unauthorised access; no ongoing exfiltration	Security team + Engineering; preserve evidence
P3	Suspected breach; unverified indicators of compromise	Security team investigates; IR plan activated
P4	Security policy violation; no external threat	Security team + HR as appropriate

Security incidents often require evidence preservation — do not immediately restart or terminate resources. This conflicts with the "restore service first" principle of standard incidents. The security IR runbook must explicitly address this tension.

Incident vs Service Request vs Problem

ITIL 4 distinguishes three things that get mixed up in practice:

Incident — unplanned disruption; goal is restoration
Service Request — a planned, expected request (provision access, deploy a feature) — not an incident
Problem — the root cause underlying one or more incidents; managed separately over a longer timeframe

Problems are investigated via problem management — the blameless postmortem process covered in a later lesson. The same database disk-full issue that caused three incidents in six months is a problem; each individual incident is separate.

Documenting Your Severity Levels

Write down your severity definitions and publish them in your incident management runbook:

# Severity Definitions — Acme Corp
Last reviewed: 2026-Q1

## SEV1 — Critical (respond immediately 24/7)
- Full outage affecting >10% of users OR
- Any payment processing failure OR
- Any suspected data breach OR
- Any data loss (even partial)

## SEV2 — Major (respond within 30 min 24/7)
- Core feature unavailable for any users OR
- Degraded performance >2x normal latency for >25% of users OR
- Third-party integration completely down with no workaround

## SEV3 — Minor (respond within 4 hours during business hours)
- Non-core feature degraded OR
- Workaround available for most users OR
- <5% of users affected with no financial impact

## SEV4 / SEV5 — follow normal sprint process

Concrete definitions — with named examples from your own service — reduce classification disputes during high-stress incidents. The next lesson addresses who responds to these incidents and how the on-call system is designed.