Skip to content
7 min read·Lesson 9 of 10

SRE Principles: SLIs, SLOs, and Error Budgets

Understand the Site Reliability Engineering model — how SLIs, SLOs, and error budgets provide a data-driven framework for balancing reliability and feature velocity.

Site Reliability Engineering (SRE) was created at Google in 2003 by Ben Treynor Sloss. The core premise: reliability is a feature, and it should be managed with the same rigour as any other software feature — defined, measured, and traded off against competing priorities.

What is SRE?

SRE is what happens when you ask a software engineer to design an operations function. SREs write software to manage systems, automate toil, and build tooling that makes the system self-healing. They are embedded in product teams and own the reliability of the services they support.

Key SRE responsibilities:

  • Defining and tracking SLOs
  • On-call rotation and incident response
  • Postmortem analysis and follow-through
  • Reducing toil through automation
  • Capacity planning and performance engineering

The SLI / SLO / SLA Hierarchy

SLI — Service Level Indicator

An SLI is a quantitative measure of service behaviour. Common SLIs:

  • Availability: Proportion of requests that succeed (HTTP 2xx / total requests)
  • Latency: Proportion of requests served within a threshold (e.g., <200ms at p99)
  • Throughput: Requests processed per second
  • Error rate: Proportion of requests returning errors
  • Freshness: For data pipelines — how recently data was updated

SLO — Service Level Objective

An SLO is an internal target for an SLI over a time window. For example:

SLISLOWindow
Request success rate≥ 99.9%Rolling 30 days
p99 latency< 500msRolling 30 days
Pipeline data freshness≤ 2 hours oldAt any point in time

SLOs are internal targets. They should be slightly tighter than any external SLA to give you room to detect and fix problems before breaching customer commitments.

SLA — Service Level Agreement

An SLA is a contractual commitment to customers, usually with financial penalties for breach. SLAs are always more relaxed than SLOs — the gap is your safety margin.

Error Budgets

The error budget is the allowable unreliability defined by the SLO:

Error budget = 100% − SLO target

For a 99.9% availability SLO over 30 days:

  • Error budget = 0.1%
  • Allowed downtime = 0.001 × 30 days × 24 hours × 60 minutes = 43.2 minutes

The error budget is the key management tool in SRE:

  • If budget is healthy (plenty remaining): Take risks. Deploy frequently. Run chaos experiments. Ship features faster.
  • If budget is nearly exhausted: Slow down. Freeze non-critical changes. Focus engineering on reliability work.
  • If budget is fully spent: Stop feature work. The SRE and product team must agree on a reliability sprint before resuming normal velocity.

Error budgets remove the adversarial dynamic between Dev (wants to deploy) and Ops (wants stability). Both teams agree on the SLO, and the budget is a shared resource they manage together.

Toil

Google SRE defines toil as work that is:

  • Manual — requires a human to do it
  • Repetitive — done over and over
  • Automatable — a computer could do it
  • Tactical — reactive, not building long-term value
  • Does not produce enduring improvement

Examples: manually restarting a service that crashes weekly, manually scaling a fleet before a known traffic spike, copying data between systems by hand.

Google's guideline: SREs should spend no more than 50% of their time on toil. The rest should be engineering work that reduces future toil. If toil exceeds 50%, the team is operating as a firefighting team, not an SRE team.

Blameless Postmortems

When a service breaches its SLO (an incident), a blameless postmortem analyses what happened without attributing fault to individuals. The focus is on system and process improvements that prevent recurrence. A good postmortem includes:

  • Timeline of events
  • Root cause analysis (the system causes, not "human error")
  • Impact quantification (how much error budget was consumed)
  • Concrete action items with owners and deadlines

Key Takeaways

  • SRE applies software engineering practices to operations — it is an implementation of DevOps.
  • SLI (Service Level Indicator) is a quantitative metric of service behaviour (e.g., request success rate).
  • SLO (Service Level Objective) is an internal target for an SLI (e.g., 99.9% success rate over 30 days).
  • Error budget is the allowed unreliability: 100% minus the SLO. Spend it on risk-taking; protect it when low.
  • Toil is repetitive, manual, automatable operational work — SRE aims to keep toil below 50% of engineering time.

Test your knowledge

Try exam-style practice questions to reinforce what you've learned.

Practice Questions →