SLIs, SLOs, and Error Budgets — Observability and Monitoring | CertQnA

Reliability without targets is wishful thinking. The Service Level Indicator (SLI) / Objective (SLO) / Error Budget framework, popularised by Google's SRE book, gives you a numeric, agreed contract between operations, product, and the business.

Definitions

Term	Definition	Example
SLI	A measurement of user-experienced service quality	% of HTTP requests that succeed in <500ms
SLO	A target for an SLI over a window	99.9% over 28 days
SLA	A contractual SLO with consequences (refunds, credits)	"99.9% uptime or you get 10% credit"
Error budget	1 − SLO. The amount of unreliability you are allowed.	0.1% = 40 minutes per 28 days

SLIs are technical; SLOs are agreements; SLAs involve lawyers.

Picking Good SLIs

An SLI should reflect what users care about. Heuristics:

Measured at the boundary closest to the user (load balancer, edge, browser).
Expressed as a ratio: good events / total events.
Capturing both availability (does it work?) and latency (is it fast enough?).

For a typical request/response service, two SLIs cover most needs:

# Availability: % of requests that returned a non-5xx
sum(rate(http_requests_total{status!~"5.."}[5m]))
  /
sum(rate(http_requests_total[5m]))

# Latency: % of requests served in under 500ms
sum(rate(http_request_duration_seconds_bucket{le="0.5"}[5m]))
  /
sum(rate(http_request_duration_seconds_count[5m]))

For batch jobs: % of jobs completed within their deadline. For streaming: lag below threshold.

Picking Realistic SLOs

The biggest mistake teams make is setting SLOs at 99.99% just because it sounds impressive. Aim where the business genuinely needs you to be — and where you can plausibly run.

SLO	Allowed downtime / 30 days	Practical implication
99%	~7 hours	Internal tools, batch systems
99.5%	~3.6 hours	Most consumer apps
99.9%	~43 minutes	Important customer-facing services
99.95%	~22 minutes	High-availability systems
99.99%	~4.3 minutes	Critical infrastructure (DNS, payments)

Each "9" roughly multiplies the engineering and infrastructure cost by 10. Picking the right number is a business conversation, not a technical one.

Error Budget = License to Innovate

If your SLO is 99.9% over 30 days, your error budget is 0.1% — about 43 minutes of "downtime" or equivalent error rate. Spend that budget on:

Risky deploys (new architecture, library upgrades)
Chaos engineering experiments
Maintenance windows
Aggressive feature rollouts

The budget creates a healthy contract: SRE will not block deploys as long as the budget is healthy. When the budget is exhausted, deploys freeze until reliability work is done.

This converts reliability from a vague aspiration into a tradable resource. Product gets predictable velocity; SRE gets agreement on when to slow down.

Burn Rate Alerting

Alerting on raw threshold ("error rate > 1%") is too noisy. Alert on the burn rate — how fast you are consuming the budget.

Multi-window, multi-burn-rate alerts (Google's recommendation):

Burn rate	Time to exhaust 30-day budget	Alert
14.4×	2 hours	Page
6×	5 hours	Page
3×	1 day	Ticket
1×	30 days (steady)	None

Alert when both a 1-hour and 5-minute window show high burn — fast enough to catch real incidents, immune to brief blips.

- alert: ApiSloFastBurn
  expr: |
    (
      slo:error_rate:rate5m > 14.4 * 0.001
    ) and (
      slo:error_rate:rate1h > 14.4 * 0.001
    )
  for: 2m
  labels: { severity: page }

Running on Error Budget Policy

Write the policy down. Example:

Healthy budget (>30% remaining): normal velocity. Risky changes allowed.
Warning (10–30%): reduce risk in deploys, prioritise reliability work.
Critical (<10%): freeze new features, focus on burning down errors.
Exhausted: freeze all non-critical changes; postmortem and structural fixes.

An agreed policy makes the rule impersonal — "the budget says so" — which is much easier than ad-hoc arguments per incident.

Common Mistakes

Choosing SLIs that do not reflect users. "CPU below 80%" is not an SLI. "Requests answered correctly within 500ms" is.
Setting SLOs at 100%. Removes the budget mechanic and is unreachable.
Too many SLOs. Five SLIs per service is plenty — usually two or three.
Reporting but not using. If product velocity does not change when the budget is empty, the framework is theatre.
Forgetting maintenance windows. Planned downtime still burns budget unless you carve out an exclusion in the SLI.

For Cert Exams

The Google Professional Cloud Architect, AWS DevOps Pro, and Azure AZ-400 all reference SLI/SLO/error-budget concepts. Common exam patterns:

Pick a metric that measures user experience (not infrastructure utilisation).
Compute downtime allowed for a given SLO over a window.
Choose multi-burn-rate alerting over static thresholds.
Recognise that reducing SLO from 99.99% to 99.9% saves enormous engineering cost — and is the right answer when the business does not actually need the higher number.