Master SLOs
An interactive deep-dive into Google's SRE approach to risk, reliability, and Service Level Objectives. Based on the SRE Book and SRE Workbook.
📘 What you'll learn
This tutorial covers three core chapters from Google's SRE canon:
Chapter 1: Embracing Risk — Why 100% reliability is the wrong goal, how to think about risk as a continuum, and the revolutionary concept of error budgets.
Chapter 2: Service Level Objectives — The SLI → SLO → SLA hierarchy, how to pick the right indicators, and how to set targets that actually work.
Chapter 3: Implementing SLOs — The practical playbook: from specification to measurement, dashboards, alerts, and decision-making.
🎮 How it works
This isn't a passive read. You'll encounter:
Interactive calculators to build intuition with real numbers. Quizzes after each chapter to test understanding. A simulation game where you manage a real error budget under pressure. An SLI Builder Lab where you design SLIs for different service types.
Points are tracked throughout. Aim for 100%.
Chapter 1The 100% Reliability Trap
Why pursuing perfection will destroy your product
Here's the core insight that drives everything in SRE: you should never aim for 100% reliability. Not because you can't get close — but because the cost of the last fraction of reliability grows exponentially, while the user can't even tell the difference.
The reliability stack
Think of it this way: a user's end-to-end experience passes through many layers, each with its own failure rate. Your service is just one layer.
Even if your service achieves five 9s (99.999%), the user's effective availability is still capped by the weakest link — often their own network. So over-investing in that last 0.009% of reliability is money spent where users can't feel it.
What happens when you chase 100%
❌ The 100% mindset
- Every release is terrifying
- Feature velocity grinds to zero
- Engineers are afraid to deploy
- Change is the enemy
- Product stagnates
✓ The SRE mindset
- Target = "reliable enough"
- Planned risk budget for innovation
- Deploy frequently with confidence
- Change is managed, not avoided
- Product evolves rapidly
Chapter 1Measuring Service Risk
Putting numbers on reliability
Google primarily measures service risk as unplanned downtime. There are two ways to express this: time-based and aggregate availability.
Time-based availability
This works for simple systems, but breaks down for globally distributed services that are never fully "up" or "down." A service might be degraded for some users while fine for others.
Aggregate availability (request success rate)
For request-serving systems, Google uses a more practical formula:
This captures partial failures naturally. If your service handles 2.5 million requests/day and you want 99.99% availability, you can afford 250 failed requests per day. Not zero. Two hundred and fifty.
The table of nines
Hover over any row to see what that level of availability means in practice:
| Availability | Nines | Downtime / Year | Downtime / Quarter | Failed Req / Million |
|---|---|---|---|---|
| 90% | One 9 | 36.5 days | 9.1 days | 100,000 |
| 99% | Two 9s | 3.65 days | 21.9 hours | 10,000 |
| 99.9% | Three 9s | 8.76 hours | 2.19 hours | 1,000 |
| 99.95% | Three and a half 9s | 4.38 hours | 1.09 hours | 500 |
| 99.99% | Four 9s | 52.6 minutes | 13.1 minutes | 100 |
| 99.999% | Five 9s | 5.26 minutes | 1.31 minutes | 10 |
Visualize it
Each block below represents one request out of 1,000. Use the slider to change the availability target and see how many failures (red blocks) are allowed:
Chapter 1The Cost of Reliability
More nines costs exponentially more — and eventually isn't worth it
Increasing reliability has two major cost dimensions:
💰 Cost of redundant infrastructure
More servers, more replicas, multi-region deployment, automatic failover, better hardware. Each additional nine roughly multiplies infrastructure cost.
🐌 Cost of reduced velocity
To hit very high reliability targets, you need more testing, slower rollouts, fewer changes, longer review cycles. Your ability to ship features drops. This is the opportunity cost and it's often the bigger cost.
The exponential curve
The relationship between reliability and cost is not linear — it's exponential. Going from 99.9% to 99.99% might cost 10× what it cost to go from 99% to 99.9%.
Try it: Cost calculator
How much does one more nine cost you?
Chapter 1Error Budgets
The concept that aligns product and SRE teams
If a service's SLO is 99.9% availability, then the error budget is the remaining 0.1%. That's the amount of unreliability you're explicitly allowed. It reframes "downtime" from a failure into a budget to be spent.
What can you spend the error budget on?
🎯 Planned spending
The error budget can be "spent" on things that cause small amounts of unreliability but deliver other value:
Launching new features — new code has bugs, that's normal.
Risky experiments — A/B tests, canary deployments.
Planned maintenance — database migrations, infrastructure changes.
Faster release cycles — more deploys = more risk, but also more innovation.
The alignment mechanism
This is the real genius. Without error budgets, product teams and SRE teams are in constant tension:
With an error budget, both teams share the same objective metric. If there's budget remaining, product can push features. If the budget is spent, everyone focuses on reliability. No arguments, no politics — just data.
Error budget policies
When the error budget is exhausted, the team enacts pre-agreed policies:
Budget remaining ✓
- Ship features aggressively
- Allow risky experiments
- Fast-track deployments
- Toil reduction is optional
Budget exhausted ✗
- Freeze feature launches
- Focus on reliability fixes
- Increase testing requirements
- Slow down release cadence
Try it: Error budget calculator
Calculate your error budget
QuizChapter 1: Embracing Risk
Test your understanding before moving on
Why does Google argue against targeting 100% availability?
Your service handles 10 million requests per day with a 99.99% SLO. How many failed requests per day are within budget?
What happens when the error budget is exhausted?
Which formula does Google prefer for measuring availability of request-serving systems?
Chapter 2SLI: The Service Level Indicator
Measuring what users actually care about
An SLI (Service Level Indicator) is a carefully defined quantitative measure of some aspect of the level of service that is provided. Think of it as the raw signal — the thermometer reading, not the temperature target.
Common SLI types
⚡ Availability
The proportion of requests that succeed. Could a user successfully complete their action?
Example: proportion of HTTP requests returning 2xx or 3xx out of all requests (excluding 401s and 403s).
🏎️ Latency
The proportion of requests that are fast enough. Not average latency — but the proportion that complete within a threshold.
Example: proportion of requests completing in under 300ms.
✅ Quality / Correctness
The proportion of responses that are correct and complete — not degraded.
Example: proportion of search results served from full index (not stale cache).
📊 Freshness (for data processing)
The proportion of data records processed within an expected time window.
Example: proportion of records updated within the last 10 minutes.
🛡️ Durability (for storage)
The proportion of written data that can be successfully read back.
Example: proportion of objects stored that return successfully on read.
Chapter 2SLO: The Service Level Objective
Setting the target — not too high, not too low
An SLO (Service Level Objective) is a target value or range for an SLI. It's the line in the sand that says "this is good enough."
Anatomy of a good SLO
An SLO has four components:
The four parts of an SLO
1. The SLI — What you're measuring (e.g., request latency)
2. The threshold — What "good" means (e.g., under 300ms)
3. The target — What percentage must be good (e.g., 99.9%)
4. The window — Over what time period (e.g., 30-day rolling)
Choosing the right target
SLOs should be set based on user expectations, business needs, and current performance — not aspirations. Google suggests working from these principles:
⚠️ Don't set SLOs based on current performance
If your service currently runs at 99.99%, don't set the SLO at 99.99%. That leaves you zero room for the inevitable regression. Set it slightly below where you are — maybe 99.95% — so you have budget to ship features. If you set it too tight, you're back in the 100%-trap.
💡 Start with user happiness
Ask: "At what point would users start to notice and complain?" That's your lower bound. "At what point would further improvement not change user behavior?" That's your upper bound. Your SLO should sit somewhere in between.
Rolling vs calendar windows
Rolling window (recommended)
- e.g. "last 30 days"
- Aligns with user experience
- No "reset" incentive
- Bad week hurts for 30 days
Calendar window
- e.g. "per quarter"
- Aligns with business planning
- Fresh start each period
- Risk of "spending it all at month end"
Chapter 2SLA: The Service Level Agreement
When reliability becomes a contractual promise
An SLA (Service Level Agreement) is an explicit or implicit contract with your users that includes consequences for meeting or missing the SLOs it contains. The key word is consequences — usually financial (credits, refunds).
⚠️ The Chubby example
Google's Chubby (a distributed lock service) was so reliable that teams started depending on it being 100% available — far beyond its actual SLO. When it did have downtime, dozens of services failed catastrophically. Google's solution? They started deliberately taking Chubby down to burn through the error budget and force teams to handle failures properly. Being too reliable created hidden fragility.
Chapter 2Choosing the Right SLIs
Service type determines what to measure
Different types of services need different SLIs. The SRE Workbook provides a framework for choosing SLIs based on your service type:
Request-driven services (APIs, web apps, microservices)
Availability SLI: Proportion of requests that succeed
Latency SLI: Proportion of requests faster than threshold
Quality SLI: Proportion of requests served without degradation
Example: An API gateway. Measure: successful responses / all responses, and responses under 200ms / all responses.
Data processing pipelines (batch jobs, ETL, stream processing)
Freshness SLI: Proportion of data updated recently enough
Correctness SLI: Proportion of records processed correctly
Coverage SLI: Proportion of expected data that was actually processed
Example: A daily analytics pipeline. Measure: proportion of reports generated within 1 hour of scheduled time.
Storage systems (databases, blob stores, file systems)
Durability SLI: Proportion of stored objects readable
Throughput SLI: Proportion of operations completing within expected time
Availability SLI: Proportion of read/write attempts that succeed
Example: A cloud object store. Measure: proportion of objects stored that can be read back within 100ms.
QuizChapter 2: SLIs, SLOs & SLAs
Prove you know the hierarchy
What is the correct form of an SLI?
Why should your internal SLO be stricter than your external SLA?
Why did Google deliberately take Chubby offline?
For a data processing pipeline, which SLI is most appropriate?
Chapter 3The SLO Implementation Process
From theory to practice — the step-by-step playbook
The SRE Workbook lays out a concrete process for implementing SLOs. It's iterative — you don't need to get it perfect the first time.
🔄 The iteration principle
The SRE Workbook emphasizes repeatedly: don't try to get SLOs perfect from day one. Start with something reasonable, measure real user experience against it, and refine. A mediocre SLO that exists is infinitely more useful than a perfect SLO that doesn't.
Chapter 3SLI Specification
From vague idea to precise definition
An SLI specification is the formal description of what you're measuring — in plain language, before you worry about how to actually measure it. The Workbook separates specification (what) from implementation (how).
Example: The online shop
The SRE Workbook walks through a fictional company to illustrate. Let's follow along:
🛒 Scenario: "Game of SLOs" online shop
An e-commerce site with an API, a web frontend, and a data pipeline for generating product recommendations. The team needs SLOs for all three.
Step 1: List your user journeys
Before picking SLIs, identify what users actually do:
Click each journey as you think about what SLIs it needs.
Step 2: Map journeys to SLI types
| User Journey | SLI Type | SLI Specification |
|---|---|---|
| Browse catalog | Availability, Latency | Proportion of page loads that succeed and complete within 1s |
| Search | Availability, Latency, Quality | Proportion of searches returning results within 500ms from full index |
| Checkout | Availability | Proportion of payment requests that complete successfully |
| Recommendations | Freshness | Proportion of users seeing recommendations updated within 24h |
Chapter 3From SLI Specification to Implementation
Where to actually measure, and the tradeoffs of each approach
Once you know what to measure, you need to decide where in your stack to measure it. Each measurement point has tradeoffs:
🎯 Implementation example
SLI Specification: "The proportion of valid HTTP requests that return a successful response within 300ms."
SLI Implementation: "Count of load balancer log entries with status code != 5xx AND duration < 300ms, divided by count of all log entries excluding 4xx, measured over 1-minute windows, aggregated into a Prometheus metric."
Chapter 3Setting SLOs in Practice
Achievable vs aspirational, and getting buy-in
The Workbook distinguishes between two types of SLOs:
Achievable SLOs
- Based on current system performance
- You can meet them today
- Used for error budget enforcement
- Drive day-to-day decisions
Aspirational SLOs
- Require engineering investment
- Can't meet them yet
- Used for roadmap planning
- Drive quarterly/annual goals
The SLO document
Every service should have a short document that captures its SLOs. Here's what to include:
📄 SLO Document Template
1. Service overview — One paragraph: what the service does and who uses it.
2. SLIs — For each SLI: specification, implementation details, and measurement point.
3. SLOs — For each SLI: the target percentage and measurement window.
4. Error budget — How much budget exists per window, and current consumption.
5. Error budget policy — What happens when the budget is exhausted: who gets paged, what gets frozen, what the escalation path is.
6. Rationale — Why these specific targets were chosen. This helps future you understand the tradeoffs.
Alerting on SLOs
The Workbook recommends burn-rate alerting over simple threshold alerting:
🔔 Burn rate alerts
Instead of alerting when the error rate exceeds a threshold, alert when you're consuming error budget faster than you can afford. A "burn rate" of 1 means you'll exactly exhaust your 30-day budget in 30 days. A burn rate of 10 means you'll exhaust it in 3 days.
Multi-window: Use a fast window (e.g. 5 minutes) and a slow window (e.g. 1 hour) together. This catches both sudden spikes and slow burns while reducing false positives.
If SLO = 99.9% over 30 days → allowed error rate = 0.1%
If current error rate = 1% → burn rate = 1% / 0.1% = 10×
QuizChapter 3: Implementing SLOs
The practical stuff
Where does the SRE Workbook recommend measuring SLIs as a practical starting point?
What does a burn rate of 10× mean?
What's the difference between an achievable SLO and an aspirational SLO?
What should be in an error budget policy?
Final ChallengeError Budget Simulator
You're the SRE lead. Manage your team's error budget over a 30-day window.
Error Budget Manager
You are the SRE lead at an e-commerce company. Your SLO is 99.9% availability over 30 days. With 1M requests/day, that gives you 30,000 failed requests as your monthly budget.
LabSLI Builder
Design SLIs for real-world services
Build an SLI specification
💡 Tips for your SLI specification
Be specific about exclusions. Health check endpoints, admin routes, and expected 4xx errors (like 404s for missing pages) are usually excluded from availability SLIs.
Use proportions, not averages. "99th percentile latency under 300ms" is better expressed as "99% of requests complete within 300ms."
Document the rationale. Why 300ms and not 500ms? Why 99.9% and not 99.99%? Future you will want to know.
SLO Mastery Complete
You've covered the three core chapters of Google's SRE approach to reliability.
Key takeaways
1. 100% reliability is the wrong target. Identify the optimal reliability level where cost and user happiness intersect.
2. Error budgets transform the reliability conversation from tribal politics into data-driven decision-making.
3. SLIs are ratios (good/total), SLOs are targets on those ratios, SLAs add contractual consequences.
4. Measure at the edge, close to the user. Start at the load balancer.
5. Start with rough SLOs and iterate. A mediocre SLO that exists beats a perfect SLO that doesn't.
6. Use burn-rate alerting with multi-window checks for the best signal-to-noise ratio.