On-Call Design and Sustainable Rotations — Incident Management & On-Call Engineering | CertQnA

On-call is the mechanism by which an organisation ensures that someone is always available to respond to incidents. Done well, it is a manageable rotation with a defined scope, reasonable alert volume, and clear escalation paths. Done poorly, it is a grind that burns out engineers and drives attrition in exactly the people you most need to retain.

On-Call Fundamentals

Three questions define any on-call system:

Who is on-call? Who is the primary responder for this service right now?
What are they responsible for? Which alerts route to them, and what is their scope of action?
What happens if they don't respond? The escalation policy.

Rotation Models

Model	Structure	Best for
Follow-the-sun	Teams in different time zones cover their own business hours	Global teams with >3 time zone coverage
Weekly primary	One engineer takes all pages for one week	Small teams; predictable schedule
Weekly primary + secondary	Second engineer takes escalations; rotates independently	Most common; provides backup
Bi-weekly shift	Two weeks on, compensated time off	Services with low page volume
Squad-based	Entire sub-team rotates; individuals cover sub-week shifts	High-load services needing deep domain knowledge

Rotation Size Rules of Thumb

A sustainable rotation requires enough engineers to keep each person on-call infrequently enough to recover between shifts. Google's SRE book suggests a minimum of 8 engineers per rotation for 24×7 coverage with reasonable frequency. In practice:

5–6 engineers: acceptable for business-hours-only or very low page volume
8–12 engineers: sustainable 24×7 with weekly shifts and adequate recovery
<4 engineers: high burnout risk; consider shared rotation with an adjacent team or an on-call vendor supplement

If your team is too small for a safe rotation, that is a hiring or architectural problem — not an on-call scheduling problem.

Alert Quality: The Most Important Variable

The single greatest determinant of on-call quality is alert actionability. Every alert that pages an engineer at 3am must meet all three criteria:

Urgency — it requires action faster than could wait for business hours
Actionability — the responder can take a specific action to mitigate or resolve it
Accuracy — it reflects a real problem, not a spurious signal

Alerts that fail these criteria are noise. Track:

Actionability rate: what percentage of pages required actual action? Target: >90%
Pages per on-call shift: Google SRE targets <2 per 12-hour shift as a working limit; >5 is a reliability emergency
Out-of-hours pages: how many pages fire between 11pm and 7am local time? Each one is a sleep disruption

"An alert that fires when there is nothing actionable to do is not an alert — it is noise wearing a pager."

Escalation Policies

An escalation policy defines what happens automatically when the primary responder doesn't acknowledge within N minutes:

# Example PagerDuty escalation policy: Payments Service
Layer 1: Primary on-call  — ack within 5 min
Layer 2: Secondary on-call — if no ack after 5 min
Layer 3: Team lead         — if no ack after 10 min
Layer 4: Engineering manager — if no ack after 15 min (SEV1/2 only)
Repeat: 3 cycles, then page the entire on-call channel

Key design decisions:

Acknowledgement vs resolution. Escalation triggers on lack of acknowledgement — not lack of resolution. Once someone acknowledges, they own it until they explicitly hand off.
Escalation scope by severity. SEV4 and SEV5 should never escalate beyond Layer 1. SEV1 may need to reach VP/CEO for communication purposes.
Holiday and override coverage. Define who covers during planned absences. Tools handle override scheduling explicitly.

On-Call Compensation

On-call work is work. Standard models:

Flat stipend per rotation. Simple, predictable. May disincentivise alert remediation (more noise = more pay).
Per-page compensation. Incentivises quality but can lead to gaming.
Compensatory time off. Common in European organisations subject to working time regulations. One hour of after-hours paging = one hour of comp time.
Career recognition. On-call experience reflected in promotion criteria and performance review.

The right answer depends on your employment agreements, jurisdiction, and culture. The wrong answer is treating on-call as a free obligation.

On-Call Handoffs

The end of an on-call shift is a knowledge transfer moment. A good handoff:

Summarises any open or recently resolved incidents
Notes any known fragile systems or upcoming deployments
Documents any temporary mitigations still in place (a feature flag that was toggled but not explained is a trap for the next responder)
Passes ownership explicitly — the outgoing responder should not feel responsible once the handoff is complete

# On-Call Handoff Template
Shift: 2026-05-27 08:00 to 2026-05-28 08:00
Handing off to: @alice

Open incidents: None

Recent incidents:
- INC-2934 (SEV3) — Search API slowdown. Resolved at 14:32. Root cause: memory leak in indexer;
  workaround: rolling restart scheduled daily at 03:00 until fix deployed (INC-2934 action item).

Temporary states:
- Feature flag 'new-checkout-flow' is OFF in production. DO NOT enable without Product sign-off.
- us-east-2 RDS has elevated CPU (85%). Being investigated by DB team; alert threshold raised to 95% temporarily.

Upcoming changes:
- Payments service deploy at 10:00 (Alice's shift). Runbook: /runbooks/payments-deploy.md

Onboarding to On-Call

Dropping a new engineer into primary on-call without preparation creates incidents of its own. A graduated programme:

Shadow rotation: new engineer shadows the primary responder on their tooling for 1–2 rotations, observing without acting
Reverse shadow: the experienced engineer shadows the new engineer, who leads responses with backup immediately available
Graduated primary: new engineer takes primary with an experienced secondary available on rapid escalation
Full primary: standard rotation once competence is verified

Document the required readiness criteria: familiarity with runbooks, completed incident simulation exercises, access to all relevant dashboards and tools.

Measuring On-Call Health

# Weekly on-call metrics dashboard (example queries)

-- Pages per engineer per week
SELECT engineer, COUNT(*) pages, 
       SUM(CASE WHEN hour(fired_at) BETWEEN 23 AND 7 THEN 1 ELSE 0 END) oop_pages
FROM pagerduty_incidents
GROUP BY engineer, week(fired_at)

-- Alert actionability
SELECT COUNT(*) total,
       SUM(CASE WHEN was_actionable THEN 1 ELSE 0 END) actionable,
       ROUND(100.0 * SUM(CASE WHEN was_actionable THEN 1 ELSE 0 END) / COUNT(*), 1) pct
FROM pagerduty_incidents
WHERE created_at > NOW() - INTERVAL 30 DAY

Review these metrics monthly. If pages-per-shift or out-of-hours pages trend upward, treat it as a production incident of its own — it requires investigation and action items.

With rotation and escalation defined, the next lesson addresses how to actually run the response when an incident fires.