Reliability without targets is wishful thinking. The Service Level Indicator (SLI) / Objective (SLO) / Error Budget framework, popularised by Google's SRE book, gives you a numeric, agreed contract between operations, product, and the business.
Definitions
| Term | Definition | Example |
|---|---|---|
| SLI | A measurement of user-experienced service quality | % of HTTP requests that succeed in <500ms |
| SLO | A target for an SLI over a window | 99.9% over 28 days |
| SLA | A contractual SLO with consequences (refunds, credits) | "99.9% uptime or you get 10% credit" |
| Error budget | 1 − SLO. The amount of unreliability you are allowed. | 0.1% = 40 minutes per 28 days |
SLIs are technical; SLOs are agreements; SLAs involve lawyers.
Picking Good SLIs
An SLI should reflect what users care about. Heuristics:
- Measured at the boundary closest to the user (load balancer, edge, browser).
- Expressed as a ratio: good events / total events.
- Capturing both availability (does it work?) and latency (is it fast enough?).
For a typical request/response service, two SLIs cover most needs:
# Availability: % of requests that returned a non-5xx
sum(rate(http_requests_total{status!~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
# Latency: % of requests served in under 500ms
sum(rate(http_request_duration_seconds_bucket{le="0.5"}[5m]))
/
sum(rate(http_request_duration_seconds_count[5m]))
For batch jobs: % of jobs completed within their deadline. For streaming: lag below threshold.
Picking Realistic SLOs
The biggest mistake teams make is setting SLOs at 99.99% just because it sounds impressive. Aim where the business genuinely needs you to be — and where you can plausibly run.
| SLO | Allowed downtime / 30 days | Practical implication |
|---|---|---|
| 99% | ~7 hours | Internal tools, batch systems |
| 99.5% | ~3.6 hours | Most consumer apps |
| 99.9% | ~43 minutes | Important customer-facing services |
| 99.95% | ~22 minutes | High-availability systems |
| 99.99% | ~4.3 minutes | Critical infrastructure (DNS, payments) |
Each "9" roughly multiplies the engineering and infrastructure cost by 10. Picking the right number is a business conversation, not a technical one.
Error Budget = License to Innovate
If your SLO is 99.9% over 30 days, your error budget is 0.1% — about 43 minutes of "downtime" or equivalent error rate. Spend that budget on:
- Risky deploys (new architecture, library upgrades)
- Chaos engineering experiments
- Maintenance windows
- Aggressive feature rollouts
The budget creates a healthy contract: SRE will not block deploys as long as the budget is healthy. When the budget is exhausted, deploys freeze until reliability work is done.
This converts reliability from a vague aspiration into a tradable resource. Product gets predictable velocity; SRE gets agreement on when to slow down.
Burn Rate Alerting
Alerting on raw threshold ("error rate > 1%") is too noisy. Alert on the burn rate — how fast you are consuming the budget.
Multi-window, multi-burn-rate alerts (Google's recommendation):
| Burn rate | Time to exhaust 30-day budget | Alert |
|---|---|---|
| 14.4× | 2 hours | Page |
| 6× | 5 hours | Page |
| 3× | 1 day | Ticket |
| 1× | 30 days (steady) | None |
Alert when both a 1-hour and 5-minute window show high burn — fast enough to catch real incidents, immune to brief blips.
- alert: ApiSloFastBurn
expr: |
(
slo:error_rate:rate5m > 14.4 * 0.001
) and (
slo:error_rate:rate1h > 14.4 * 0.001
)
for: 2m
labels: { severity: page }
Running on Error Budget Policy
Write the policy down. Example:
- Healthy budget (>30% remaining): normal velocity. Risky changes allowed.
- Warning (10–30%): reduce risk in deploys, prioritise reliability work.
- Critical (<10%): freeze new features, focus on burning down errors.
- Exhausted: freeze all non-critical changes; postmortem and structural fixes.
An agreed policy makes the rule impersonal — "the budget says so" — which is much easier than ad-hoc arguments per incident.
Common Mistakes
- Choosing SLIs that do not reflect users. "CPU below 80%" is not an SLI. "Requests answered correctly within 500ms" is.
- Setting SLOs at 100%. Removes the budget mechanic and is unreachable.
- Too many SLOs. Five SLIs per service is plenty — usually two or three.
- Reporting but not using. If product velocity does not change when the budget is empty, the framework is theatre.
- Forgetting maintenance windows. Planned downtime still burns budget unless you carve out an exclusion in the SLI.
For Cert Exams
The Google Professional Cloud Architect, AWS DevOps Pro, and Azure AZ-400 all reference SLI/SLO/error-budget concepts. Common exam patterns:
- Pick a metric that measures user experience (not infrastructure utilisation).
- Compute downtime allowed for a given SLO over a window.
- Choose multi-burn-rate alerting over static thresholds.
- Recognise that reducing SLO from 99.99% to 99.9% saves enormous engineering cost — and is the right answer when the business does not actually need the higher number.