Incident management tooling has consolidated around a few dominant platforms. Understanding how to configure and use them well is a practical skill — misconfigured escalation policies cause missed alerts; a poorly maintained Statuspage erodes customer trust. This lesson covers the operational essentials.
The Tooling Stack
| Layer | Tools | Purpose |
|---|---|---|
| Observability | Prometheus, Datadog, New Relic, CloudWatch | Emit alerts based on metrics/logs/traces |
| Alert routing | PagerDuty, Opsgenie, VictorOps | Route alerts to the right on-call person; escalate; deduplicate |
| Incident coordination | Slack + PagerDuty, Incident.io, Rootly, FireHydrant | Incident channel management; timeline; task tracking |
| Customer communication | Atlassian Statuspage, Instatus, Cachet | Public-facing status page with incident updates |
| Postmortem | Confluence, Notion, Jira, FireHydrant | Document and track postmortem action items |
PagerDuty Core Concepts
Services
A Service in PagerDuty represents a logical system or team. Alerts from your monitoring stack integrate with a service. Each service has:
- An integration URL — where alerts are posted
- An escalation policy — who gets paged and in what order
- An alert grouping configuration — how alerts are deduplicated
# Example: Terraform configuration for PagerDuty service
resource "pagerduty_service" "payments" {
name = "Payments Service"
auto_resolve_timeout = 14400 # 4 hours
acknowledgement_timeout = 600 # 10 minutes until escalation
escalation_policy = pagerduty_escalation_policy.payments.id
alert_creation = "create_alerts_and_incidents"
alert_grouping_parameters {
type = "intelligent" # PagerDuty AI-based grouping
}
}
resource "pagerduty_service_integration" "prometheus" {
name = "Prometheus"
service = pagerduty_service.payments.id
vendor = data.pagerduty_vendor.prometheus.id
}
Escalation Policies
resource "pagerduty_escalation_policy" "payments" {
name = "Payments On-Call"
num_loops = 3 # repeat entire policy 3 times before stopping
rule {
escalation_delay_in_minutes = 5 # time before escalating to next level
target {
type = "schedule_reference"
id = pagerduty_schedule.payments_primary.id
}
}
rule {
escalation_delay_in_minutes = 10
target {
type = "schedule_reference"
id = pagerduty_schedule.payments_secondary.id
}
}
rule {
escalation_delay_in_minutes = 15
target {
type = "user_reference"
id = data.pagerduty_user.eng_manager.id
}
}
}
Schedules
resource "pagerduty_schedule" "payments_primary" {
name = "Payments Primary On-Call"
time_zone = "UTC"
layer {
name = "Weekly Rotation"
start = "2026-01-06T00:00:00+00:00"
rotation_virtual_start = "2026-01-06T00:00:00+00:00"
rotation_turn_length_seconds = 604800 # 1 week
users = [
data.pagerduty_user.alice.id,
data.pagerduty_user.bob.id,
data.pagerduty_user.carol.id,
data.pagerduty_user.dave.id,
]
}
}
Opsgenie — Key Differences from PagerDuty
| PagerDuty | Opsgenie | |
|---|---|---|
| Parent company | PagerDuty (independent) | Atlassian |
| Integration | Broad ecosystem; many native integrations | Deep Jira/Confluence integration |
| Routing rules | Event rules / Rulesets | Alert policies |
| Maintenance windows | Maintenance windows on service level | Maintenance on alert policy level |
| Pricing | Per user; scales steeply | More affordable at scale; bundled with Atlassian |
If your organisation already uses Jira and Confluence, Opsgenie often integrates more naturally. Choose based on ecosystem fit, not feature lists — both are mature and capable.
Alert Routing and Noise Reduction
The most impactful configuration work is reducing noise:
Deduplication
# PagerDuty Event Rules (EventOrchestration)
# If the alert description contains "payments" → route to Payments service
# If it contains "test" or "staging" → suppress
rule {
condition {
expression = "event.source matches part 'staging'"
}
actions {
suppress = true # don't page anyone for staging alerts
}
}
rule {
condition {
expression = "event.summary matches part 'PaymentsService'"
}
actions {
route_to = pagerduty_service.payments.id
priority = "P2"
}
}
Maintenance Windows
# Suppress alerts during planned maintenance
resource "pagerduty_maintenance_window" "deploy_window" {
start_time = "2026-06-01T02:00:00+00:00"
end_time = "2026-06-01T04:00:00+00:00"
description = "Scheduled database migration"
services = [pagerduty_service.payments.id]
}
Statuspage Configuration
Atlassian Statuspage (and alternatives like Instatus) provides the customer-facing view of your system health.
Components
Map your services to Statuspage components. Use customer-visible terminology, not internal names:
- Not: "payments-service-prod-us-east-1" → Yes: "Payment Processing"
- Not: "api-gateway-v2" → Yes: "API"
- Not: "cdn-cloudfront-distribution" → Yes: "Website"
Automated Updates
# PagerDuty + Statuspage integration
# When an incident is declared in PagerDuty, automatically update Statuspage
# Via Statuspage API
curl -X POST https://api.statuspage.io/v1/pages/PAGE_ID/incidents -H "Authorization: OAuth API_KEY" -H "Content-Type: application/json" -d '{
"incident": {
"name": "Payment Processing Degraded",
"status": "investigating",
"impact_override": "partial_outage",
"body": "We are investigating an issue affecting payment processing.",
"component_ids": ["COMPONENT_ID"],
"components": {
"COMPONENT_ID": "partial_outage"
}
}
}'
Status Values and When to Use Them
| Status | Use when |
|---|---|
| Operational | Normal; no known issues |
| Degraded Performance | Service is slow or experiencing elevated errors; workaround available |
| Partial Outage | Core feature unavailable for a subset of users |
| Major Outage | Core feature completely unavailable or data loss occurring |
| Under Maintenance | Planned maintenance window in progress |
Update Statuspage before customers contact support. If your support channel is seeing reports, Statuspage should already reflect the issue. Customers who know you are aware tolerate outages significantly better than customers who must discover the problem for themselves.
Incident.io and FireHydrant
Modern incident management platforms (Incident.io, FireHydrant, Rootly) extend PagerDuty/Opsgenie with:
- Automatic incident Slack channel creation
- Timeline tracking from the Slack channel (every message timestamped)
- Automated Statuspage updates based on declared severity
- Postmortem drafting from the incident timeline
- Task and action item tracking within the incident
The value proposition: reduce the cognitive overhead of running an incident by automating the coordination work. For teams running >5 SEV1/2 incidents per month, these tools often pay for themselves in reduced MTTR alone.
Integrating the Stack
A mature tooling integration looks like:
Prometheus alert fires
→ PagerDuty (via alertmanager webhook)
→ Pages on-call engineer (mobile + SMS)
→ Creates Jira ticket (via PagerDuty + Jira integration)
→ Posts to #incidents Slack channel
→ Engineer triggers incident.io command in Slack:
/incident declare SEV2 "Payments degraded"
→ Creates #inc-XXXX-payments-degraded channel
→ Pins runbook link (from PagerDuty service config)
→ Updates Statuspage to "Investigating"
→ Notifies @payments-team-lead
→ Resolution:
/incident resolve
→ Closes PagerDuty incident
→ Updates Statuspage to "Resolved"
→ Creates postmortem draft in Confluence
→ Schedules postmortem meeting invite
Choosing Your Stack
The right tooling depends on scale and budget:
- Small team (<10 eng): PagerDuty (Free/Team tier) + Statuspage (Starter) + Slack manually
- Mid-size (10–50 eng): PagerDuty Business + Statuspage + incident.io or FireHydrant
- Large (>50 eng): Full PagerDuty/Opsgenie enterprise + dedicated incident management platform + Statuspage Enterprise with automations
The next — and final — lesson zooms out to the broader SRE framework: how error budgets and SLOs connect incident management to engineering strategy.