SRE Principles: SLIs, SLOs, and Error Budgets — DevOps and SRE Fundamentals | CertQnA

Site Reliability Engineering (SRE) was created at Google in 2003 by Ben Treynor Sloss. The core premise: reliability is a feature, and it should be managed with the same rigour as any other software feature — defined, measured, and traded off against competing priorities.

What is SRE?

SRE is what happens when you ask a software engineer to design an operations function. SREs write software to manage systems, automate toil, and build tooling that makes the system self-healing. They are embedded in product teams and own the reliability of the services they support.

Key SRE responsibilities:

Defining and tracking SLOs
On-call rotation and incident response
Postmortem analysis and follow-through
Reducing toil through automation
Capacity planning and performance engineering

The SLI / SLO / SLA Hierarchy

SLI — Service Level Indicator

An SLI is a quantitative measure of service behaviour. Common SLIs:

Availability: Proportion of requests that succeed (HTTP 2xx / total requests)
Latency: Proportion of requests served within a threshold (e.g., <200ms at p99)
Throughput: Requests processed per second
Error rate: Proportion of requests returning errors
Freshness: For data pipelines — how recently data was updated

SLO — Service Level Objective

An SLO is an internal target for an SLI over a time window. For example:

SLI	SLO	Window
Request success rate	≥ 99.9%	Rolling 30 days
p99 latency	< 500ms	Rolling 30 days
Pipeline data freshness	≤ 2 hours old	At any point in time

SLOs are internal targets. They should be slightly tighter than any external SLA to give you room to detect and fix problems before breaching customer commitments.

SLA — Service Level Agreement

An SLA is a contractual commitment to customers, usually with financial penalties for breach. SLAs are always more relaxed than SLOs — the gap is your safety margin.

Error Budgets

The error budget is the allowable unreliability defined by the SLO:

Error budget = 100% − SLO target

For a 99.9% availability SLO over 30 days:

Error budget = 0.1%
Allowed downtime = 0.001 × 30 days × 24 hours × 60 minutes = 43.2 minutes

The error budget is the key management tool in SRE:

If budget is healthy (plenty remaining): Take risks. Deploy frequently. Run chaos experiments. Ship features faster.
If budget is nearly exhausted: Slow down. Freeze non-critical changes. Focus engineering on reliability work.
If budget is fully spent: Stop feature work. The SRE and product team must agree on a reliability sprint before resuming normal velocity.

Error budgets remove the adversarial dynamic between Dev (wants to deploy) and Ops (wants stability). Both teams agree on the SLO, and the budget is a shared resource they manage together.

Toil

Google SRE defines toil as work that is:

Manual — requires a human to do it
Repetitive — done over and over
Automatable — a computer could do it
Tactical — reactive, not building long-term value
Does not produce enduring improvement

Examples: manually restarting a service that crashes weekly, manually scaling a fleet before a known traffic spike, copying data between systems by hand.

Google's guideline: SREs should spend no more than 50% of their time on toil. The rest should be engineering work that reduces future toil. If toil exceeds 50%, the team is operating as a firefighting team, not an SRE team.

Blameless Postmortems

When a service breaches its SLO (an incident), a blameless postmortem analyses what happened without attributing fault to individuals. The focus is on system and process improvements that prevent recurrence. A good postmortem includes:

Timeline of events
Root cause analysis (the system causes, not "human error")
Impact quantification (how much error budget was consumed)
Concrete action items with owners and deadlines