Incident management tells you how to respond when things go wrong. SRE tells you how to define "wrong" quantitatively and make engineering decisions based on data rather than intuition. The two practices are inseparable in high-performing organisations.
The SRE Vocabulary
Service Level Indicator (SLI)
An SLI is a quantitative measure of some aspect of the service that matters to users. Good SLIs:
- Measure what users experience, not internal system state
- Are expressed as a ratio: good events / total events
- Are measurable at reasonable cost
| Service type | Common SLIs |
|---|---|
| HTTP API | Request success rate (<500 errors / total), p99 latency |
| Data pipeline | Pipeline freshness (data age), row accuracy rate |
| Storage | Read/write durability, object availability |
| Batch job | Completion rate, execution time |
| Queue | Message processing latency, consumer lag |
Service Level Objective (SLO)
An SLO is the target value for an SLI over a rolling time window:
- Payments API success rate ≥ 99.9% over a rolling 28-day window
- Search API p99 latency ≤ 500ms for 99.5% of requests
- Checkout page availability ≥ 99.95%
SLOs are internal targets. They are deliberately set tighter than SLAs to give an early warning buffer before contractual obligations are breached.
Service Level Agreement (SLA)
An SLA is a contractual commitment to customers that carries penalties for breach (refunds, credits). SLAs should always be weaker than SLOs:
SLO: 99.9% availability (our internal target)
SLA: 99.5% availability (our customer commitment)
Buffer: 0.4% — headroom to absorb incidents without breaching the SLA
Setting SLAs tighter than SLOs means the first incident that reaches customers also breaches the SLA — financially and reputationally damaging.
Error Budgets
The error budget is the permitted amount of unreliability in a given window:
SLO = 99.9% success rate over 28 days
Error budget = 100% - 99.9% = 0.1%
28 days = 28 × 24 × 60 = 40,320 minutes
0.1% × 40,320 = 40.3 minutes of permitted downtime
Every incident burns budget. The question is not "did an incident occur?" but "how much of the error budget did it consume?"
Error Budget Policy
The error budget policy defines what happens as the budget is consumed:
| Budget remaining | Policy |
|---|---|
| > 50% | Normal operations; feature velocity is fine |
| 25–50% | Heightened awareness; review upcoming risky deploys |
| 10–25% | Feature freeze for high-risk changes; focus sprint on reliability |
| < 10% | Emergency: stop new feature deploys; all engineering on reliability work |
| 0% (exhausted) | Full freeze until SLO is restored; potential postmortem on reliability strategy |
The error budget converts an engineering culture question ("should we slow down?") into a data-driven decision ("our budget is at 8%; the policy says we freeze").
Calculating Error Budget Burn Rate
Burn rate measures how quickly you are consuming the budget relative to the time window:
Burn rate = 1 → consuming the budget at exactly the rate that depletes it at window end
Burn rate > 1 → consuming faster than the window allows (danger)
Burn rate < 1 → consuming slower than the window (fine)
Example:
SLO: 99.9%, window: 30 days, budget: 43.2 minutes
An incident lasting 2 hours = 120 minutes
Burn rate during incident = 120 / 43.2 = 2.78×
Google's alerting approach alerts on burn rate, not just threshold breaches. An alert that fires when burn rate exceeds 2× for 1 hour fires early enough to prevent budget exhaustion.
SLO-Based Alerting
# Multi-window, multi-burn-rate alert (Google's recommended approach)
groups:
- name: slo-alerts
rules:
# Fast burn — fires quickly for high-urgency situations
- alert: PaymentsSLOFastBurn
expr: |
(
sum(rate(http_requests_total{service="payments",status=~"5.."}[1h]))
/ sum(rate(http_requests_total{service="payments"}[1h]))
) > 14.4 * 0.001 # 14.4× burn rate over 1h window
labels:
severity: critical
team: payments
annotations:
summary: "Fast error budget burn"
description: "Burning error budget 14.4× faster than sustainable"
runbook_url: "https://wiki.acme.io/runbooks/PB-012"
# Slow burn — fires for sustained elevated errors
- alert: PaymentsSLOSlowBurn
expr: |
(
sum(rate(http_requests_total{service="payments",status=~"5.."}[6h]))
/ sum(rate(http_requests_total{service="payments"}[6h]))
) > 6 * 0.001
and
(
sum(rate(http_requests_total{service="payments",status=~"5.."}[30m]))
/ sum(rate(http_requests_total{service="payments"}[30m]))
) > 6 * 0.001
labels:
severity: warning
team: payments
annotations:
summary: "Slow sustained error budget burn"
Toil: What SREs Reduce
Toil is the operational work that:
- Is manual — a human must do it
- Is repetitive — the same task, over and over
- Is automatable — a computer could do it
- Has no lasting value — it doesn't improve the system, just maintains status quo
- Scales linearly with service growth
Examples of toil:
- Manually restarting a service every Monday because it has a memory leak
- Rotating AWS credentials by logging into consoles across 15 accounts
- Manually reviewing and approving certificate renewals
- Copying data from one system to another via a script you run manually
Google SRE caps toil at 50% of an SRE's time. Above that, the role devolves into operations work with no engineering improvement.
The Reliability Virtuous Cycle
The full cycle:
- Define SLOs — what does "good" mean?
- Measure SLIs — are we good right now?
- Incidents exhaust budget — something bad happened
- Postmortem produces action items — we understand why
- Action items reduce toil and improve reliability — we fix the system
- SLO compliance improves — we are measurably better
- Repeat
Practical SLO Implementation
Starting from scratch:
- Identify your critical user journeys. What are the 3–5 things users must be able to do? Login, search, checkout, API call.
- Define one SLI per journey. Express as: good requests / total requests.
- Set initial SLOs conservatively. Start at what you already achieve minus a small buffer. If your current availability is 99.7%, start at 99.5%.
- Measure for 30 days. Don't act on SLO data until you trust it.
- Review with stakeholders. Align engineering, product, and business on what the SLO means.
- Tighten over time. As reliability improves, raise the SLO to match.
SRE Team Models
| Model | Structure | Best for |
|---|---|---|
| Embedded SRE | 1–2 SREs within a product team; org-wide SRE chapter | Enterprises where product ownership is distributed |
| Central SRE | Dedicated SRE team serving all product teams | Organisations with shared infrastructure teams |
| SRE as service | Product team applies to SRE for coverage; SRE owns the bar for acceptance | Google's model; requires mature product teams |
| Platform SRE | SREs own the platform (K8s, CI/CD, observability); product teams own their services | Platform engineering organisations |
Bringing It Together
The practices in this course form a complete system:
- SLOs define the goal — clarity on what reliability means for each service
- Alerting fires when budget is burning — earlier, more accurate signals
- On-call responds using runbooks — fast, consistent mitigation
- Incident management coordinates the response — reduces chaos, improves MTTR
- Postmortems generate improvements — the system gets better over time
- Error budget policy governs feature vs reliability tradeoffs — data-driven decisions
Teams that operate this way — measurably, with accountability and learning — are the ones that achieve high reliability at high deployment frequency. Reliability and velocity are not in conflict when the infrastructure for learning is in place.
Further Reading
- Site Reliability Engineering — Beyer, Jones, Petoff, Murphy (O'Reilly; free online)
- The Site Reliability Workbook — Google SRE team (O'Reilly; free online)
- ITIL 4 Foundation — AXELOS official guide
- Seeking SRE — Blank-Edelman (O'Reilly) — essays from SRE practitioners
- PagerDuty Incident Response Guide — response.pagerduty.com
- Postmortem examples: github.com/danluu/post-mortems