Skip to content
6 min read·Lesson 7 of 8

Tooling: PagerDuty, Opsgenie, and Statuspage

Configuring and using the core incident management toolchain — alert routing, escalation policies, on-call schedules, and customer communication.

Incident management tooling has consolidated around a few dominant platforms. Understanding how to configure and use them well is a practical skill — misconfigured escalation policies cause missed alerts; a poorly maintained Statuspage erodes customer trust. This lesson covers the operational essentials.

The Tooling Stack

LayerToolsPurpose
ObservabilityPrometheus, Datadog, New Relic, CloudWatchEmit alerts based on metrics/logs/traces
Alert routingPagerDuty, Opsgenie, VictorOpsRoute alerts to the right on-call person; escalate; deduplicate
Incident coordinationSlack + PagerDuty, Incident.io, Rootly, FireHydrantIncident channel management; timeline; task tracking
Customer communicationAtlassian Statuspage, Instatus, CachetPublic-facing status page with incident updates
PostmortemConfluence, Notion, Jira, FireHydrantDocument and track postmortem action items

PagerDuty Core Concepts

Services

A Service in PagerDuty represents a logical system or team. Alerts from your monitoring stack integrate with a service. Each service has:

  • An integration URL — where alerts are posted
  • An escalation policy — who gets paged and in what order
  • An alert grouping configuration — how alerts are deduplicated
# Example: Terraform configuration for PagerDuty service
resource "pagerduty_service" "payments" {
  name                    = "Payments Service"
  auto_resolve_timeout    = 14400  # 4 hours
  acknowledgement_timeout = 600    # 10 minutes until escalation
  escalation_policy       = pagerduty_escalation_policy.payments.id
  alert_creation          = "create_alerts_and_incidents"

  alert_grouping_parameters {
    type = "intelligent"  # PagerDuty AI-based grouping
  }
}

resource "pagerduty_service_integration" "prometheus" {
  name    = "Prometheus"
  service = pagerduty_service.payments.id
  vendor  = data.pagerduty_vendor.prometheus.id
}

Escalation Policies

resource "pagerduty_escalation_policy" "payments" {
  name      = "Payments On-Call"
  num_loops = 3  # repeat entire policy 3 times before stopping

  rule {
    escalation_delay_in_minutes = 5  # time before escalating to next level
    target {
      type = "schedule_reference"
      id   = pagerduty_schedule.payments_primary.id
    }
  }

  rule {
    escalation_delay_in_minutes = 10
    target {
      type = "schedule_reference"
      id   = pagerduty_schedule.payments_secondary.id
    }
  }

  rule {
    escalation_delay_in_minutes = 15
    target {
      type = "user_reference"
      id   = data.pagerduty_user.eng_manager.id
    }
  }
}

Schedules

resource "pagerduty_schedule" "payments_primary" {
  name      = "Payments Primary On-Call"
  time_zone = "UTC"

  layer {
    name                         = "Weekly Rotation"
    start                        = "2026-01-06T00:00:00+00:00"
    rotation_virtual_start       = "2026-01-06T00:00:00+00:00"
    rotation_turn_length_seconds = 604800  # 1 week

    users = [
      data.pagerduty_user.alice.id,
      data.pagerduty_user.bob.id,
      data.pagerduty_user.carol.id,
      data.pagerduty_user.dave.id,
    ]
  }
}

Opsgenie — Key Differences from PagerDuty

PagerDutyOpsgenie
Parent companyPagerDuty (independent)Atlassian
IntegrationBroad ecosystem; many native integrationsDeep Jira/Confluence integration
Routing rulesEvent rules / RulesetsAlert policies
Maintenance windowsMaintenance windows on service levelMaintenance on alert policy level
PricingPer user; scales steeplyMore affordable at scale; bundled with Atlassian

If your organisation already uses Jira and Confluence, Opsgenie often integrates more naturally. Choose based on ecosystem fit, not feature lists — both are mature and capable.

Alert Routing and Noise Reduction

The most impactful configuration work is reducing noise:

Deduplication

# PagerDuty Event Rules (EventOrchestration)
# If the alert description contains "payments" → route to Payments service
# If it contains "test" or "staging" → suppress

rule {
  condition {
    expression = "event.source matches part 'staging'"
  }
  actions {
    suppress = true  # don't page anyone for staging alerts
  }
}

rule {
  condition {
    expression = "event.summary matches part 'PaymentsService'"
  }
  actions {
    route_to = pagerduty_service.payments.id
    priority  = "P2"
  }
}

Maintenance Windows

# Suppress alerts during planned maintenance
resource "pagerduty_maintenance_window" "deploy_window" {
  start_time  = "2026-06-01T02:00:00+00:00"
  end_time    = "2026-06-01T04:00:00+00:00"
  description = "Scheduled database migration"
  services    = [pagerduty_service.payments.id]
}

Statuspage Configuration

Atlassian Statuspage (and alternatives like Instatus) provides the customer-facing view of your system health.

Components

Map your services to Statuspage components. Use customer-visible terminology, not internal names:

  • Not: "payments-service-prod-us-east-1" → Yes: "Payment Processing"
  • Not: "api-gateway-v2" → Yes: "API"
  • Not: "cdn-cloudfront-distribution" → Yes: "Website"

Automated Updates

# PagerDuty + Statuspage integration
# When an incident is declared in PagerDuty, automatically update Statuspage

# Via Statuspage API
curl -X POST https://api.statuspage.io/v1/pages/PAGE_ID/incidents   -H "Authorization: OAuth API_KEY"   -H "Content-Type: application/json"   -d '{
    "incident": {
      "name": "Payment Processing Degraded",
      "status": "investigating",
      "impact_override": "partial_outage",
      "body": "We are investigating an issue affecting payment processing.",
      "component_ids": ["COMPONENT_ID"],
      "components": {
        "COMPONENT_ID": "partial_outage"
      }
    }
  }'

Status Values and When to Use Them

StatusUse when
OperationalNormal; no known issues
Degraded PerformanceService is slow or experiencing elevated errors; workaround available
Partial OutageCore feature unavailable for a subset of users
Major OutageCore feature completely unavailable or data loss occurring
Under MaintenancePlanned maintenance window in progress

Update Statuspage before customers contact support. If your support channel is seeing reports, Statuspage should already reflect the issue. Customers who know you are aware tolerate outages significantly better than customers who must discover the problem for themselves.

Incident.io and FireHydrant

Modern incident management platforms (Incident.io, FireHydrant, Rootly) extend PagerDuty/Opsgenie with:

  • Automatic incident Slack channel creation
  • Timeline tracking from the Slack channel (every message timestamped)
  • Automated Statuspage updates based on declared severity
  • Postmortem drafting from the incident timeline
  • Task and action item tracking within the incident

The value proposition: reduce the cognitive overhead of running an incident by automating the coordination work. For teams running >5 SEV1/2 incidents per month, these tools often pay for themselves in reduced MTTR alone.

Integrating the Stack

A mature tooling integration looks like:

Prometheus alert fires
    → PagerDuty (via alertmanager webhook)
        → Pages on-call engineer (mobile + SMS)
        → Creates Jira ticket (via PagerDuty + Jira integration)
        → Posts to #incidents Slack channel
        → Engineer triggers incident.io command in Slack:
              /incident declare SEV2 "Payments degraded"
                  → Creates #inc-XXXX-payments-degraded channel
                  → Pins runbook link (from PagerDuty service config)
                  → Updates Statuspage to "Investigating"
                  → Notifies @payments-team-lead
    → Resolution:
        /incident resolve
            → Closes PagerDuty incident
            → Updates Statuspage to "Resolved"
            → Creates postmortem draft in Confluence
            → Schedules postmortem meeting invite

Choosing Your Stack

The right tooling depends on scale and budget:

  • Small team (<10 eng): PagerDuty (Free/Team tier) + Statuspage (Starter) + Slack manually
  • Mid-size (10–50 eng): PagerDuty Business + Statuspage + incident.io or FireHydrant
  • Large (>50 eng): Full PagerDuty/Opsgenie enterprise + dedicated incident management platform + Statuspage Enterprise with automations

The next — and final — lesson zooms out to the broader SRE framework: how error budgets and SLOs connect incident management to engineering strategy.

Key Takeaways

  • PagerDuty and Opsgenie solve the same core problem — reliable alert routing with on-call scheduling.
  • Services, escalation policies, and schedules are the three pillars of PagerDuty configuration.
  • Alert noise is reduced through deduplication, suppression windows, and intelligent routing.
  • Statuspage is the customer-facing communication layer — update it before your customers tweet.
  • On-call tooling must integrate with your observability stack for a single pane of glass.

Test your knowledge

Try exam-style practice questions to reinforce what you've learned.

Practice Questions →