Skip to content
5 min read·Lesson 9 of 10

SLIs, SLOs, and Error Budgets

The Google SRE framework for reliability that the team will actually use. How to pick SLIs, set realistic SLOs, and run on error budgets.

Reliability without targets is wishful thinking. The Service Level Indicator (SLI) / Objective (SLO) / Error Budget framework, popularised by Google's SRE book, gives you a numeric, agreed contract between operations, product, and the business.

Definitions

TermDefinitionExample
SLIA measurement of user-experienced service quality% of HTTP requests that succeed in <500ms
SLOA target for an SLI over a window99.9% over 28 days
SLAA contractual SLO with consequences (refunds, credits)"99.9% uptime or you get 10% credit"
Error budget1 − SLO. The amount of unreliability you are allowed.0.1% = 40 minutes per 28 days

SLIs are technical; SLOs are agreements; SLAs involve lawyers.

Picking Good SLIs

An SLI should reflect what users care about. Heuristics:

  • Measured at the boundary closest to the user (load balancer, edge, browser).
  • Expressed as a ratio: good events / total events.
  • Capturing both availability (does it work?) and latency (is it fast enough?).

For a typical request/response service, two SLIs cover most needs:

# Availability: % of requests that returned a non-5xx
sum(rate(http_requests_total{status!~"5.."}[5m]))
  /
sum(rate(http_requests_total[5m]))

# Latency: % of requests served in under 500ms
sum(rate(http_request_duration_seconds_bucket{le="0.5"}[5m]))
  /
sum(rate(http_request_duration_seconds_count[5m]))

For batch jobs: % of jobs completed within their deadline. For streaming: lag below threshold.

Picking Realistic SLOs

The biggest mistake teams make is setting SLOs at 99.99% just because it sounds impressive. Aim where the business genuinely needs you to be — and where you can plausibly run.

SLOAllowed downtime / 30 daysPractical implication
99%~7 hoursInternal tools, batch systems
99.5%~3.6 hoursMost consumer apps
99.9%~43 minutesImportant customer-facing services
99.95%~22 minutesHigh-availability systems
99.99%~4.3 minutesCritical infrastructure (DNS, payments)

Each "9" roughly multiplies the engineering and infrastructure cost by 10. Picking the right number is a business conversation, not a technical one.

Error Budget = License to Innovate

If your SLO is 99.9% over 30 days, your error budget is 0.1% — about 43 minutes of "downtime" or equivalent error rate. Spend that budget on:

  • Risky deploys (new architecture, library upgrades)
  • Chaos engineering experiments
  • Maintenance windows
  • Aggressive feature rollouts

The budget creates a healthy contract: SRE will not block deploys as long as the budget is healthy. When the budget is exhausted, deploys freeze until reliability work is done.

This converts reliability from a vague aspiration into a tradable resource. Product gets predictable velocity; SRE gets agreement on when to slow down.

Burn Rate Alerting

Alerting on raw threshold ("error rate > 1%") is too noisy. Alert on the burn rate — how fast you are consuming the budget.

Multi-window, multi-burn-rate alerts (Google's recommendation):

Burn rateTime to exhaust 30-day budgetAlert
14.4×2 hoursPage
5 hoursPage
1 dayTicket
30 days (steady)None

Alert when both a 1-hour and 5-minute window show high burn — fast enough to catch real incidents, immune to brief blips.

- alert: ApiSloFastBurn
  expr: |
    (
      slo:error_rate:rate5m > 14.4 * 0.001
    ) and (
      slo:error_rate:rate1h > 14.4 * 0.001
    )
  for: 2m
  labels: { severity: page }

Running on Error Budget Policy

Write the policy down. Example:

  • Healthy budget (>30% remaining): normal velocity. Risky changes allowed.
  • Warning (10–30%): reduce risk in deploys, prioritise reliability work.
  • Critical (<10%): freeze new features, focus on burning down errors.
  • Exhausted: freeze all non-critical changes; postmortem and structural fixes.

An agreed policy makes the rule impersonal — "the budget says so" — which is much easier than ad-hoc arguments per incident.

Common Mistakes

  • Choosing SLIs that do not reflect users. "CPU below 80%" is not an SLI. "Requests answered correctly within 500ms" is.
  • Setting SLOs at 100%. Removes the budget mechanic and is unreachable.
  • Too many SLOs. Five SLIs per service is plenty — usually two or three.
  • Reporting but not using. If product velocity does not change when the budget is empty, the framework is theatre.
  • Forgetting maintenance windows. Planned downtime still burns budget unless you carve out an exclusion in the SLI.

For Cert Exams

The Google Professional Cloud Architect, AWS DevOps Pro, and Azure AZ-400 all reference SLI/SLO/error-budget concepts. Common exam patterns:

  • Pick a metric that measures user experience (not infrastructure utilisation).
  • Compute downtime allowed for a given SLO over a window.
  • Choose multi-burn-rate alerting over static thresholds.
  • Recognise that reducing SLO from 99.99% to 99.9% saves enormous engineering cost — and is the right answer when the business does not actually need the higher number.

Key Takeaways

  • An SLI is a metric measuring user-perceived service quality.
  • An SLO is a target for that SLI over a window (e.g. 99.9% over 30 days).
  • The error budget is 100% minus the SLO; it is your allowance for risk.
  • When the budget is healthy, ship faster; when it is exhausted, freeze and stabilise.
  • 100% reliability is the wrong target — it is unreachable and prevents change.

Test your knowledge

Try exam-style practice questions to reinforce what you've learned.

Practice Questions →