On-call is the mechanism by which an organisation ensures that someone is always available to respond to incidents. Done well, it is a manageable rotation with a defined scope, reasonable alert volume, and clear escalation paths. Done poorly, it is a grind that burns out engineers and drives attrition in exactly the people you most need to retain.
On-Call Fundamentals
Three questions define any on-call system:
- Who is on-call? Who is the primary responder for this service right now?
- What are they responsible for? Which alerts route to them, and what is their scope of action?
- What happens if they don't respond? The escalation policy.
Rotation Models
| Model | Structure | Best for |
|---|---|---|
| Follow-the-sun | Teams in different time zones cover their own business hours | Global teams with >3 time zone coverage |
| Weekly primary | One engineer takes all pages for one week | Small teams; predictable schedule |
| Weekly primary + secondary | Second engineer takes escalations; rotates independently | Most common; provides backup |
| Bi-weekly shift | Two weeks on, compensated time off | Services with low page volume |
| Squad-based | Entire sub-team rotates; individuals cover sub-week shifts | High-load services needing deep domain knowledge |
Rotation Size Rules of Thumb
A sustainable rotation requires enough engineers to keep each person on-call infrequently enough to recover between shifts. Google's SRE book suggests a minimum of 8 engineers per rotation for 24×7 coverage with reasonable frequency. In practice:
- 5–6 engineers: acceptable for business-hours-only or very low page volume
- 8–12 engineers: sustainable 24×7 with weekly shifts and adequate recovery
- <4 engineers: high burnout risk; consider shared rotation with an adjacent team or an on-call vendor supplement
If your team is too small for a safe rotation, that is a hiring or architectural problem — not an on-call scheduling problem.
Alert Quality: The Most Important Variable
The single greatest determinant of on-call quality is alert actionability. Every alert that pages an engineer at 3am must meet all three criteria:
- Urgency — it requires action faster than could wait for business hours
- Actionability — the responder can take a specific action to mitigate or resolve it
- Accuracy — it reflects a real problem, not a spurious signal
Alerts that fail these criteria are noise. Track:
- Actionability rate: what percentage of pages required actual action? Target: >90%
- Pages per on-call shift: Google SRE targets <2 per 12-hour shift as a working limit; >5 is a reliability emergency
- Out-of-hours pages: how many pages fire between 11pm and 7am local time? Each one is a sleep disruption
"An alert that fires when there is nothing actionable to do is not an alert — it is noise wearing a pager."
Escalation Policies
An escalation policy defines what happens automatically when the primary responder doesn't acknowledge within N minutes:
# Example PagerDuty escalation policy: Payments Service
Layer 1: Primary on-call — ack within 5 min
Layer 2: Secondary on-call — if no ack after 5 min
Layer 3: Team lead — if no ack after 10 min
Layer 4: Engineering manager — if no ack after 15 min (SEV1/2 only)
Repeat: 3 cycles, then page the entire on-call channel
Key design decisions:
- Acknowledgement vs resolution. Escalation triggers on lack of acknowledgement — not lack of resolution. Once someone acknowledges, they own it until they explicitly hand off.
- Escalation scope by severity. SEV4 and SEV5 should never escalate beyond Layer 1. SEV1 may need to reach VP/CEO for communication purposes.
- Holiday and override coverage. Define who covers during planned absences. Tools handle override scheduling explicitly.
On-Call Compensation
On-call work is work. Standard models:
- Flat stipend per rotation. Simple, predictable. May disincentivise alert remediation (more noise = more pay).
- Per-page compensation. Incentivises quality but can lead to gaming.
- Compensatory time off. Common in European organisations subject to working time regulations. One hour of after-hours paging = one hour of comp time.
- Career recognition. On-call experience reflected in promotion criteria and performance review.
The right answer depends on your employment agreements, jurisdiction, and culture. The wrong answer is treating on-call as a free obligation.
On-Call Handoffs
The end of an on-call shift is a knowledge transfer moment. A good handoff:
- Summarises any open or recently resolved incidents
- Notes any known fragile systems or upcoming deployments
- Documents any temporary mitigations still in place (a feature flag that was toggled but not explained is a trap for the next responder)
- Passes ownership explicitly — the outgoing responder should not feel responsible once the handoff is complete
# On-Call Handoff Template
Shift: 2026-05-27 08:00 to 2026-05-28 08:00
Handing off to: @alice
Open incidents: None
Recent incidents:
- INC-2934 (SEV3) — Search API slowdown. Resolved at 14:32. Root cause: memory leak in indexer;
workaround: rolling restart scheduled daily at 03:00 until fix deployed (INC-2934 action item).
Temporary states:
- Feature flag 'new-checkout-flow' is OFF in production. DO NOT enable without Product sign-off.
- us-east-2 RDS has elevated CPU (85%). Being investigated by DB team; alert threshold raised to 95% temporarily.
Upcoming changes:
- Payments service deploy at 10:00 (Alice's shift). Runbook: /runbooks/payments-deploy.md
Onboarding to On-Call
Dropping a new engineer into primary on-call without preparation creates incidents of its own. A graduated programme:
- Shadow rotation: new engineer shadows the primary responder on their tooling for 1–2 rotations, observing without acting
- Reverse shadow: the experienced engineer shadows the new engineer, who leads responses with backup immediately available
- Graduated primary: new engineer takes primary with an experienced secondary available on rapid escalation
- Full primary: standard rotation once competence is verified
Document the required readiness criteria: familiarity with runbooks, completed incident simulation exercises, access to all relevant dashboards and tools.
Measuring On-Call Health
# Weekly on-call metrics dashboard (example queries)
-- Pages per engineer per week
SELECT engineer, COUNT(*) pages,
SUM(CASE WHEN hour(fired_at) BETWEEN 23 AND 7 THEN 1 ELSE 0 END) oop_pages
FROM pagerduty_incidents
GROUP BY engineer, week(fired_at)
-- Alert actionability
SELECT COUNT(*) total,
SUM(CASE WHEN was_actionable THEN 1 ELSE 0 END) actionable,
ROUND(100.0 * SUM(CASE WHEN was_actionable THEN 1 ELSE 0 END) / COUNT(*), 1) pct
FROM pagerduty_incidents
WHERE created_at > NOW() - INTERVAL 30 DAY
Review these metrics monthly. If pages-per-shift or out-of-hours pages trend upward, treat it as a production incident of its own — it requires investigation and action items.
With rotation and escalation defined, the next lesson addresses how to actually run the response when an incident fires.