SRE, Error Budgets, and Reliability Engineering

Incident management tells you how to respond when things go wrong. SRE tells you how to define "wrong" quantitatively and make engineering decisions based on data rather than intuition. The two practices are inseparable in high-performing organisations.

The SRE Vocabulary

Service Level Indicator (SLI)

An SLI is a quantitative measure of some aspect of the service that matters to users. Good SLIs:

Measure what users experience, not internal system state
Are expressed as a ratio: good events / total events
Are measurable at reasonable cost

Service type	Common SLIs
HTTP API	Request success rate (<500 errors / total), p99 latency
Data pipeline	Pipeline freshness (data age), row accuracy rate
Storage	Read/write durability, object availability
Batch job	Completion rate, execution time
Queue	Message processing latency, consumer lag

Service Level Objective (SLO)

An SLO is the target value for an SLI over a rolling time window:

Payments API success rate ≥ 99.9% over a rolling 28-day window
Search API p99 latency ≤ 500ms for 99.5% of requests
Checkout page availability ≥ 99.95%

SLOs are internal targets. They are deliberately set tighter than SLAs to give an early warning buffer before contractual obligations are breached.

Service Level Agreement (SLA)

An SLA is a contractual commitment to customers that carries penalties for breach (refunds, credits). SLAs should always be weaker than SLOs:

SLO: 99.9% availability (our internal target)
SLA: 99.5% availability (our customer commitment)

Buffer: 0.4% — headroom to absorb incidents without breaching the SLA

Setting SLAs tighter than SLOs means the first incident that reaches customers also breaches the SLA — financially and reputationally damaging.

Error Budgets

The error budget is the permitted amount of unreliability in a given window:

SLO = 99.9% success rate over 28 days
Error budget = 100% - 99.9% = 0.1%

28 days = 28 × 24 × 60 = 40,320 minutes
0.1% × 40,320 = 40.3 minutes of permitted downtime

Every incident burns budget. The question is not "did an incident occur?" but "how much of the error budget did it consume?"

Error Budget Policy

The error budget policy defines what happens as the budget is consumed:

Budget remaining	Policy
> 50%	Normal operations; feature velocity is fine
25–50%	Heightened awareness; review upcoming risky deploys
10–25%	Feature freeze for high-risk changes; focus sprint on reliability
< 10%	Emergency: stop new feature deploys; all engineering on reliability work
0% (exhausted)	Full freeze until SLO is restored; potential postmortem on reliability strategy

The error budget converts an engineering culture question ("should we slow down?") into a data-driven decision ("our budget is at 8%; the policy says we freeze").

Calculating Error Budget Burn Rate

Burn rate measures how quickly you are consuming the budget relative to the time window:

Burn rate = 1  → consuming the budget at exactly the rate that depletes it at window end
Burn rate > 1 → consuming faster than the window allows (danger)
Burn rate < 1 → consuming slower than the window (fine)

Example:
SLO: 99.9%, window: 30 days, budget: 43.2 minutes
An incident lasting 2 hours = 120 minutes
Burn rate during incident = 120 / 43.2 = 2.78×

Google's alerting approach alerts on burn rate, not just threshold breaches. An alert that fires when burn rate exceeds 2× for 1 hour fires early enough to prevent budget exhaustion.

SLO-Based Alerting

# Multi-window, multi-burn-rate alert (Google's recommended approach)

groups:
  - name: slo-alerts
    rules:
      # Fast burn — fires quickly for high-urgency situations
      - alert: PaymentsSLOFastBurn
        expr: |
          (
            sum(rate(http_requests_total{service="payments",status=~"5.."}[1h]))
            / sum(rate(http_requests_total{service="payments"}[1h]))
          ) > 14.4 * 0.001  # 14.4× burn rate over 1h window
        labels:
          severity: critical
          team: payments
        annotations:
          summary: "Fast error budget burn"
          description: "Burning error budget 14.4× faster than sustainable"
          runbook_url: "https://wiki.acme.io/runbooks/PB-012"

      # Slow burn — fires for sustained elevated errors
      - alert: PaymentsSLOSlowBurn
        expr: |
          (
            sum(rate(http_requests_total{service="payments",status=~"5.."}[6h]))
            / sum(rate(http_requests_total{service="payments"}[6h]))
          ) > 6 * 0.001
          and
          (
            sum(rate(http_requests_total{service="payments",status=~"5.."}[30m]))
            / sum(rate(http_requests_total{service="payments"}[30m]))
          ) > 6 * 0.001
        labels:
          severity: warning
          team: payments
        annotations:
          summary: "Slow sustained error budget burn"

Toil: What SREs Reduce

Toil is the operational work that:

Is manual — a human must do it
Is repetitive — the same task, over and over
Is automatable — a computer could do it
Has no lasting value — it doesn't improve the system, just maintains status quo
Scales linearly with service growth

Examples of toil:

Manually restarting a service every Monday because it has a memory leak
Rotating AWS credentials by logging into consoles across 15 accounts
Manually reviewing and approving certificate renewals
Copying data from one system to another via a script you run manually

Google SRE caps toil at 50% of an SRE's time. Above that, the role devolves into operations work with no engineering improvement.

The Reliability Virtuous Cycle

The full cycle:

Define SLOs — what does "good" mean?
Measure SLIs — are we good right now?
Incidents exhaust budget — something bad happened
Postmortem produces action items — we understand why
Action items reduce toil and improve reliability — we fix the system
SLO compliance improves — we are measurably better
Repeat

Practical SLO Implementation

Starting from scratch:

Identify your critical user journeys. What are the 3–5 things users must be able to do? Login, search, checkout, API call.
Define one SLI per journey. Express as: good requests / total requests.
Set initial SLOs conservatively. Start at what you already achieve minus a small buffer. If your current availability is 99.7%, start at 99.5%.
Measure for 30 days. Don't act on SLO data until you trust it.
Review with stakeholders. Align engineering, product, and business on what the SLO means.
Tighten over time. As reliability improves, raise the SLO to match.

SRE Team Models

Model	Structure	Best for
Embedded SRE	1–2 SREs within a product team; org-wide SRE chapter	Enterprises where product ownership is distributed
Central SRE	Dedicated SRE team serving all product teams	Organisations with shared infrastructure teams
SRE as service	Product team applies to SRE for coverage; SRE owns the bar for acceptance	Google's model; requires mature product teams
Platform SRE	SREs own the platform (K8s, CI/CD, observability); product teams own their services	Platform engineering organisations

Bringing It Together

The practices in this course form a complete system:

SLOs define the goal — clarity on what reliability means for each service
Alerting fires when budget is burning — earlier, more accurate signals
On-call responds using runbooks — fast, consistent mitigation
Incident management coordinates the response — reduces chaos, improves MTTR
Postmortems generate improvements — the system gets better over time
Error budget policy governs feature vs reliability tradeoffs — data-driven decisions

Teams that operate this way — measurably, with accountability and learning — are the ones that achieve high reliability at high deployment frequency. Reliability and velocity are not in conflict when the infrastructure for learning is in place.