Tooling: PagerDuty, Opsgenie, and Statuspage — Incident Management & On-Call Engineering | CertQnA

Incident management tooling has consolidated around a few dominant platforms. Understanding how to configure and use them well is a practical skill — misconfigured escalation policies cause missed alerts; a poorly maintained Statuspage erodes customer trust. This lesson covers the operational essentials.

The Tooling Stack

Layer	Tools	Purpose
Observability	Prometheus, Datadog, New Relic, CloudWatch	Emit alerts based on metrics/logs/traces
Alert routing	PagerDuty, Opsgenie, VictorOps	Route alerts to the right on-call person; escalate; deduplicate
Incident coordination	Slack + PagerDuty, Incident.io, Rootly, FireHydrant	Incident channel management; timeline; task tracking
Customer communication	Atlassian Statuspage, Instatus, Cachet	Public-facing status page with incident updates
Postmortem	Confluence, Notion, Jira, FireHydrant	Document and track postmortem action items

PagerDuty Core Concepts

Services

A Service in PagerDuty represents a logical system or team. Alerts from your monitoring stack integrate with a service. Each service has:

An integration URL — where alerts are posted
An escalation policy — who gets paged and in what order
An alert grouping configuration — how alerts are deduplicated

# Example: Terraform configuration for PagerDuty service
resource "pagerduty_service" "payments" {
  name                    = "Payments Service"
  auto_resolve_timeout    = 14400  # 4 hours
  acknowledgement_timeout = 600    # 10 minutes until escalation
  escalation_policy       = pagerduty_escalation_policy.payments.id
  alert_creation          = "create_alerts_and_incidents"

  alert_grouping_parameters {
    type = "intelligent"  # PagerDuty AI-based grouping
  }
}

resource "pagerduty_service_integration" "prometheus" {
  name    = "Prometheus"
  service = pagerduty_service.payments.id
  vendor  = data.pagerduty_vendor.prometheus.id
}

Escalation Policies

resource "pagerduty_escalation_policy" "payments" {
  name      = "Payments On-Call"
  num_loops = 3  # repeat entire policy 3 times before stopping

  rule {
    escalation_delay_in_minutes = 5  # time before escalating to next level
    target {
      type = "schedule_reference"
      id   = pagerduty_schedule.payments_primary.id
    }
  }

  rule {
    escalation_delay_in_minutes = 10
    target {
      type = "schedule_reference"
      id   = pagerduty_schedule.payments_secondary.id
    }
  }

  rule {
    escalation_delay_in_minutes = 15
    target {
      type = "user_reference"
      id   = data.pagerduty_user.eng_manager.id
    }
  }
}

Schedules

resource "pagerduty_schedule" "payments_primary" {
  name      = "Payments Primary On-Call"
  time_zone = "UTC"

  layer {
    name                         = "Weekly Rotation"
    start                        = "2026-01-06T00:00:00+00:00"
    rotation_virtual_start       = "2026-01-06T00:00:00+00:00"
    rotation_turn_length_seconds = 604800  # 1 week

    users = [
      data.pagerduty_user.alice.id,
      data.pagerduty_user.bob.id,
      data.pagerduty_user.carol.id,
      data.pagerduty_user.dave.id,
    ]
  }
}

Opsgenie — Key Differences from PagerDuty

	PagerDuty	Opsgenie
Parent company	PagerDuty (independent)	Atlassian
Integration	Broad ecosystem; many native integrations	Deep Jira/Confluence integration
Routing rules	Event rules / Rulesets	Alert policies
Maintenance windows	Maintenance windows on service level	Maintenance on alert policy level
Pricing	Per user; scales steeply	More affordable at scale; bundled with Atlassian

If your organisation already uses Jira and Confluence, Opsgenie often integrates more naturally. Choose based on ecosystem fit, not feature lists — both are mature and capable.

Alert Routing and Noise Reduction

The most impactful configuration work is reducing noise:

Deduplication

# PagerDuty Event Rules (EventOrchestration)
# If the alert description contains "payments" → route to Payments service
# If it contains "test" or "staging" → suppress

rule {
  condition {
    expression = "event.source matches part 'staging'"
  }
  actions {
    suppress = true  # don't page anyone for staging alerts
  }
}

rule {
  condition {
    expression = "event.summary matches part 'PaymentsService'"
  }
  actions {
    route_to = pagerduty_service.payments.id
    priority  = "P2"
  }
}

Maintenance Windows

# Suppress alerts during planned maintenance
resource "pagerduty_maintenance_window" "deploy_window" {
  start_time  = "2026-06-01T02:00:00+00:00"
  end_time    = "2026-06-01T04:00:00+00:00"
  description = "Scheduled database migration"
  services    = [pagerduty_service.payments.id]
}

Statuspage Configuration

Atlassian Statuspage (and alternatives like Instatus) provides the customer-facing view of your system health.

Components

Map your services to Statuspage components. Use customer-visible terminology, not internal names:

Not: "payments-service-prod-us-east-1" → Yes: "Payment Processing"
Not: "api-gateway-v2" → Yes: "API"
Not: "cdn-cloudfront-distribution" → Yes: "Website"

Automated Updates

# PagerDuty + Statuspage integration
# When an incident is declared in PagerDuty, automatically update Statuspage

# Via Statuspage API
curl -X POST https://api.statuspage.io/v1/pages/PAGE_ID/incidents   -H "Authorization: OAuth API_KEY"   -H "Content-Type: application/json"   -d '{
    "incident": {
      "name": "Payment Processing Degraded",
      "status": "investigating",
      "impact_override": "partial_outage",
      "body": "We are investigating an issue affecting payment processing.",
      "component_ids": ["COMPONENT_ID"],
      "components": {
        "COMPONENT_ID": "partial_outage"
      }
    }
  }'

Status Values and When to Use Them

Status	Use when
Operational	Normal; no known issues
Degraded Performance	Service is slow or experiencing elevated errors; workaround available
Partial Outage	Core feature unavailable for a subset of users
Major Outage	Core feature completely unavailable or data loss occurring
Under Maintenance	Planned maintenance window in progress

Update Statuspage before customers contact support. If your support channel is seeing reports, Statuspage should already reflect the issue. Customers who know you are aware tolerate outages significantly better than customers who must discover the problem for themselves.

Incident.io and FireHydrant

Modern incident management platforms (Incident.io, FireHydrant, Rootly) extend PagerDuty/Opsgenie with:

Automatic incident Slack channel creation
Timeline tracking from the Slack channel (every message timestamped)
Automated Statuspage updates based on declared severity
Postmortem drafting from the incident timeline
Task and action item tracking within the incident

The value proposition: reduce the cognitive overhead of running an incident by automating the coordination work. For teams running >5 SEV1/2 incidents per month, these tools often pay for themselves in reduced MTTR alone.

Integrating the Stack

A mature tooling integration looks like:

Prometheus alert fires
    → PagerDuty (via alertmanager webhook)
        → Pages on-call engineer (mobile + SMS)
        → Creates Jira ticket (via PagerDuty + Jira integration)
        → Posts to #incidents Slack channel
        → Engineer triggers incident.io command in Slack:
              /incident declare SEV2 "Payments degraded"
                  → Creates #inc-XXXX-payments-degraded channel
                  → Pins runbook link (from PagerDuty service config)
                  → Updates Statuspage to "Investigating"
                  → Notifies @payments-team-lead
    → Resolution:
        /incident resolve
            → Closes PagerDuty incident
            → Updates Statuspage to "Resolved"
            → Creates postmortem draft in Confluence
            → Schedules postmortem meeting invite

Choosing Your Stack

The right tooling depends on scale and budget:

Small team (<10 eng): PagerDuty (Free/Team tier) + Statuspage (Starter) + Slack manually
Mid-size (10–50 eng): PagerDuty Business + Statuspage + incident.io or FireHydrant
Large (>50 eng): Full PagerDuty/Opsgenie enterprise + dedicated incident management platform + Statuspage Enterprise with automations

The next — and final — lesson zooms out to the broader SRE framework: how error budgets and SLOs connect incident management to engineering strategy.