Skip to content
🚨

Incident Management & On-Call Engineering

Respond, resolve, and learn — build the systems and culture that keep production reliable.

Intermediate0.8 hours8 lessons
Start Course →

What You'll Learn

  • Define severity levels and incident classification criteria
  • Design fair, sustainable on-call rotations with escalation policies
  • Lead an incident response with a clear communication structure
  • Write effective runbooks and playbooks that accelerate mitigation
  • Facilitate blameless postmortems that generate lasting improvements
  • Measure and act on error budgets within an SRE framework
  • Configure PagerDuty and Opsgenie for real-world alerting
  • Build an incident management culture — not just processes

Prerequisites

  • Experience deploying and operating cloud services
  • Familiarity with observability concepts (metrics, logs, traces)
  • Basic understanding of SLIs, SLOs, and SLAs is helpful

Course Curriculum

Practice for the Real Exam

After completing this course, test yourself with exam-style practice questions.