Skip to content
7 min read·Lesson 8 of 8

SRE, Error Budgets, and Reliability Engineering

Connecting incident management to the broader SRE framework — SLIs, SLOs, error budgets, and how they drive engineering decisions.

Incident management tells you how to respond when things go wrong. SRE tells you how to define "wrong" quantitatively and make engineering decisions based on data rather than intuition. The two practices are inseparable in high-performing organisations.

The SRE Vocabulary

Service Level Indicator (SLI)

An SLI is a quantitative measure of some aspect of the service that matters to users. Good SLIs:

  • Measure what users experience, not internal system state
  • Are expressed as a ratio: good events / total events
  • Are measurable at reasonable cost
Service typeCommon SLIs
HTTP APIRequest success rate (<500 errors / total), p99 latency
Data pipelinePipeline freshness (data age), row accuracy rate
StorageRead/write durability, object availability
Batch jobCompletion rate, execution time
QueueMessage processing latency, consumer lag

Service Level Objective (SLO)

An SLO is the target value for an SLI over a rolling time window:

  • Payments API success rate ≥ 99.9% over a rolling 28-day window
  • Search API p99 latency ≤ 500ms for 99.5% of requests
  • Checkout page availability ≥ 99.95%

SLOs are internal targets. They are deliberately set tighter than SLAs to give an early warning buffer before contractual obligations are breached.

Service Level Agreement (SLA)

An SLA is a contractual commitment to customers that carries penalties for breach (refunds, credits). SLAs should always be weaker than SLOs:

SLO: 99.9% availability (our internal target)
SLA: 99.5% availability (our customer commitment)

Buffer: 0.4% — headroom to absorb incidents without breaching the SLA

Setting SLAs tighter than SLOs means the first incident that reaches customers also breaches the SLA — financially and reputationally damaging.

Error Budgets

The error budget is the permitted amount of unreliability in a given window:

SLO = 99.9% success rate over 28 days
Error budget = 100% - 99.9% = 0.1%

28 days = 28 × 24 × 60 = 40,320 minutes
0.1% × 40,320 = 40.3 minutes of permitted downtime

Every incident burns budget. The question is not "did an incident occur?" but "how much of the error budget did it consume?"

Error Budget Policy

The error budget policy defines what happens as the budget is consumed:

Budget remainingPolicy
> 50%Normal operations; feature velocity is fine
25–50%Heightened awareness; review upcoming risky deploys
10–25%Feature freeze for high-risk changes; focus sprint on reliability
< 10%Emergency: stop new feature deploys; all engineering on reliability work
0% (exhausted)Full freeze until SLO is restored; potential postmortem on reliability strategy

The error budget converts an engineering culture question ("should we slow down?") into a data-driven decision ("our budget is at 8%; the policy says we freeze").

Calculating Error Budget Burn Rate

Burn rate measures how quickly you are consuming the budget relative to the time window:

Burn rate = 1  → consuming the budget at exactly the rate that depletes it at window end
Burn rate > 1 → consuming faster than the window allows (danger)
Burn rate < 1 → consuming slower than the window (fine)

Example:
SLO: 99.9%, window: 30 days, budget: 43.2 minutes
An incident lasting 2 hours = 120 minutes
Burn rate during incident = 120 / 43.2 = 2.78×

Google's alerting approach alerts on burn rate, not just threshold breaches. An alert that fires when burn rate exceeds 2× for 1 hour fires early enough to prevent budget exhaustion.

SLO-Based Alerting

# Multi-window, multi-burn-rate alert (Google's recommended approach)

groups:
  - name: slo-alerts
    rules:
      # Fast burn — fires quickly for high-urgency situations
      - alert: PaymentsSLOFastBurn
        expr: |
          (
            sum(rate(http_requests_total{service="payments",status=~"5.."}[1h]))
            / sum(rate(http_requests_total{service="payments"}[1h]))
          ) > 14.4 * 0.001  # 14.4× burn rate over 1h window
        labels:
          severity: critical
          team: payments
        annotations:
          summary: "Fast error budget burn"
          description: "Burning error budget 14.4× faster than sustainable"
          runbook_url: "https://wiki.acme.io/runbooks/PB-012"

      # Slow burn — fires for sustained elevated errors
      - alert: PaymentsSLOSlowBurn
        expr: |
          (
            sum(rate(http_requests_total{service="payments",status=~"5.."}[6h]))
            / sum(rate(http_requests_total{service="payments"}[6h]))
          ) > 6 * 0.001
          and
          (
            sum(rate(http_requests_total{service="payments",status=~"5.."}[30m]))
            / sum(rate(http_requests_total{service="payments"}[30m]))
          ) > 6 * 0.001
        labels:
          severity: warning
          team: payments
        annotations:
          summary: "Slow sustained error budget burn"

Toil: What SREs Reduce

Toil is the operational work that:

  • Is manual — a human must do it
  • Is repetitive — the same task, over and over
  • Is automatable — a computer could do it
  • Has no lasting value — it doesn't improve the system, just maintains status quo
  • Scales linearly with service growth

Examples of toil:

  • Manually restarting a service every Monday because it has a memory leak
  • Rotating AWS credentials by logging into consoles across 15 accounts
  • Manually reviewing and approving certificate renewals
  • Copying data from one system to another via a script you run manually

Google SRE caps toil at 50% of an SRE's time. Above that, the role devolves into operations work with no engineering improvement.

The Reliability Virtuous Cycle

The full cycle:

  1. Define SLOs — what does "good" mean?
  2. Measure SLIs — are we good right now?
  3. Incidents exhaust budget — something bad happened
  4. Postmortem produces action items — we understand why
  5. Action items reduce toil and improve reliability — we fix the system
  6. SLO compliance improves — we are measurably better
  7. Repeat

Practical SLO Implementation

Starting from scratch:

  1. Identify your critical user journeys. What are the 3–5 things users must be able to do? Login, search, checkout, API call.
  2. Define one SLI per journey. Express as: good requests / total requests.
  3. Set initial SLOs conservatively. Start at what you already achieve minus a small buffer. If your current availability is 99.7%, start at 99.5%.
  4. Measure for 30 days. Don't act on SLO data until you trust it.
  5. Review with stakeholders. Align engineering, product, and business on what the SLO means.
  6. Tighten over time. As reliability improves, raise the SLO to match.

SRE Team Models

ModelStructureBest for
Embedded SRE1–2 SREs within a product team; org-wide SRE chapterEnterprises where product ownership is distributed
Central SREDedicated SRE team serving all product teamsOrganisations with shared infrastructure teams
SRE as serviceProduct team applies to SRE for coverage; SRE owns the bar for acceptanceGoogle's model; requires mature product teams
Platform SRESREs own the platform (K8s, CI/CD, observability); product teams own their servicesPlatform engineering organisations

Bringing It Together

The practices in this course form a complete system:

  • SLOs define the goal — clarity on what reliability means for each service
  • Alerting fires when budget is burning — earlier, more accurate signals
  • On-call responds using runbooks — fast, consistent mitigation
  • Incident management coordinates the response — reduces chaos, improves MTTR
  • Postmortems generate improvements — the system gets better over time
  • Error budget policy governs feature vs reliability tradeoffs — data-driven decisions

Teams that operate this way — measurably, with accountability and learning — are the ones that achieve high reliability at high deployment frequency. Reliability and velocity are not in conflict when the infrastructure for learning is in place.

Further Reading

  • Site Reliability Engineering — Beyer, Jones, Petoff, Murphy (O'Reilly; free online)
  • The Site Reliability Workbook — Google SRE team (O'Reilly; free online)
  • ITIL 4 Foundation — AXELOS official guide
  • Seeking SRE — Blank-Edelman (O'Reilly) — essays from SRE practitioners
  • PagerDuty Incident Response Guide — response.pagerduty.com
  • Postmortem examples: github.com/danluu/post-mortems

Key Takeaways

  • SLIs measure what you care about; SLOs define acceptable performance; SLAs are contractual commitments.
  • An error budget is 100% minus the SLO — the permitted failure allocation in a given period.
  • Burning through the error budget triggers a reliability focus: slow deployments, freeze features, improve toil.
  • Toil is manual, repetitive operational work that does not produce lasting improvement — SRE aims to reduce it.
  • CRE (customer reliability engineering) is SRE applied externally: helping customers achieve their reliability goals.
🎉

Course Complete!

You've finished Incident Management & On-Call Engineering. Now put your knowledge to the test with real exam-style practice questions.