Site Reliability Engineering (SRE) was created at Google in 2003 by Ben Treynor Sloss. The core premise: reliability is a feature, and it should be managed with the same rigour as any other software feature — defined, measured, and traded off against competing priorities.
What is SRE?
SRE is what happens when you ask a software engineer to design an operations function. SREs write software to manage systems, automate toil, and build tooling that makes the system self-healing. They are embedded in product teams and own the reliability of the services they support.
Key SRE responsibilities:
- Defining and tracking SLOs
- On-call rotation and incident response
- Postmortem analysis and follow-through
- Reducing toil through automation
- Capacity planning and performance engineering
The SLI / SLO / SLA Hierarchy
SLI — Service Level Indicator
An SLI is a quantitative measure of service behaviour. Common SLIs:
- Availability: Proportion of requests that succeed (HTTP 2xx / total requests)
- Latency: Proportion of requests served within a threshold (e.g., <200ms at p99)
- Throughput: Requests processed per second
- Error rate: Proportion of requests returning errors
- Freshness: For data pipelines — how recently data was updated
SLO — Service Level Objective
An SLO is an internal target for an SLI over a time window. For example:
| SLI | SLO | Window |
|---|---|---|
| Request success rate | ≥ 99.9% | Rolling 30 days |
| p99 latency | < 500ms | Rolling 30 days |
| Pipeline data freshness | ≤ 2 hours old | At any point in time |
SLOs are internal targets. They should be slightly tighter than any external SLA to give you room to detect and fix problems before breaching customer commitments.
SLA — Service Level Agreement
An SLA is a contractual commitment to customers, usually with financial penalties for breach. SLAs are always more relaxed than SLOs — the gap is your safety margin.
Error Budgets
The error budget is the allowable unreliability defined by the SLO:
Error budget = 100% − SLO target
For a 99.9% availability SLO over 30 days:
- Error budget = 0.1%
- Allowed downtime = 0.001 × 30 days × 24 hours × 60 minutes = 43.2 minutes
The error budget is the key management tool in SRE:
- If budget is healthy (plenty remaining): Take risks. Deploy frequently. Run chaos experiments. Ship features faster.
- If budget is nearly exhausted: Slow down. Freeze non-critical changes. Focus engineering on reliability work.
- If budget is fully spent: Stop feature work. The SRE and product team must agree on a reliability sprint before resuming normal velocity.
Error budgets remove the adversarial dynamic between Dev (wants to deploy) and Ops (wants stability). Both teams agree on the SLO, and the budget is a shared resource they manage together.
Toil
Google SRE defines toil as work that is:
- Manual — requires a human to do it
- Repetitive — done over and over
- Automatable — a computer could do it
- Tactical — reactive, not building long-term value
- Does not produce enduring improvement
Examples: manually restarting a service that crashes weekly, manually scaling a fleet before a known traffic spike, copying data between systems by hand.
Google's guideline: SREs should spend no more than 50% of their time on toil. The rest should be engineering work that reduces future toil. If toil exceeds 50%, the team is operating as a firefighting team, not an SRE team.
Blameless Postmortems
When a service breaches its SLO (an incident), a blameless postmortem analyses what happened without attributing fault to individuals. The focus is on system and process improvements that prevent recurrence. A good postmortem includes:
- Timeline of events
- Root cause analysis (the system causes, not "human error")
- Impact quantification (how much error budget was consumed)
- Concrete action items with owners and deadlines