Score: 0

Master SLOs

An interactive deep-dive into Google's SRE approach to risk, reliability, and Service Level Objectives. Based on the SRE Book and SRE Workbook.

📘 What you'll learn

This tutorial covers three core chapters from Google's SRE canon:

Chapter 1: Embracing Risk — Why 100% reliability is the wrong goal, how to think about risk as a continuum, and the revolutionary concept of error budgets.

Chapter 2: Service Level Objectives — The SLI → SLO → SLA hierarchy, how to pick the right indicators, and how to set targets that actually work.

Chapter 3: Implementing SLOs — The practical playbook: from specification to measurement, dashboards, alerts, and decision-making.

🎮 How it works

This isn't a passive read. You'll encounter:

Interactive calculators to build intuition with real numbers. Quizzes after each chapter to test understanding. A simulation game where you manage a real error budget under pressure. An SLI Builder Lab where you design SLIs for different service types.

Points are tracked throughout. Aim for 100%.

Chapter 1The 100% Reliability Trap

Why pursuing perfection will destroy your product

Here's the core insight that drives everything in SRE: you should never aim for 100% reliability. Not because you can't get close — but because the cost of the last fraction of reliability grows exponentially, while the user can't even tell the difference.

Key Insight: Users don't experience your service in isolation. Their experience is bounded by the reliability of everything between them and you — their ISP, their WiFi, their device. If a user's laptop crashes 1% of the time, they can't tell the difference between 99.99% and 99.999% availability from your service.

The reliability stack

Think of it this way: a user's end-to-end experience passes through many layers, each with its own failure rate. Your service is just one layer.

User's Device & OS (~99%) Local Network / WiFi (~99.9%) ISP / Internet (~99.9%) ⭐ Your Service (99.99%?) Effective availability ≈ product of all layers ≈ 98.8%

Even if your service achieves five 9s (99.999%), the user's effective availability is still capped by the weakest link — often their own network. So over-investing in that last 0.009% of reliability is money spent where users can't feel it.

What happens when you chase 100%

❌ The 100% mindset

  • Every release is terrifying
  • Feature velocity grinds to zero
  • Engineers are afraid to deploy
  • Change is the enemy
  • Product stagnates

✓ The SRE mindset

  • Target = "reliable enough"
  • Planned risk budget for innovation
  • Deploy frequently with confidence
  • Change is managed, not avoided
  • Product evolves rapidly

Chapter 1Measuring Service Risk

Putting numbers on reliability

Google primarily measures service risk as unplanned downtime. There are two ways to express this: time-based and aggregate availability.

Time-based availability

availability = uptime / (uptime + downtime)

This works for simple systems, but breaks down for globally distributed services that are never fully "up" or "down." A service might be degraded for some users while fine for others.

Aggregate availability (request success rate)

For request-serving systems, Google uses a more practical formula:

availability = successful requests / total requests

This captures partial failures naturally. If your service handles 2.5 million requests/day and you want 99.99% availability, you can afford 250 failed requests per day. Not zero. Two hundred and fifty.

The table of nines

Hover over any row to see what that level of availability means in practice:

AvailabilityNinesDowntime / YearDowntime / QuarterFailed Req / Million
90%One 936.5 days9.1 days100,000
99%Two 9s3.65 days21.9 hours10,000
99.9%Three 9s8.76 hours2.19 hours1,000
99.95%Three and a half 9s4.38 hours1.09 hours500
99.99%Four 9s52.6 minutes13.1 minutes100
99.999%Five 9s5.26 minutes1.31 minutes10

Visualize it

Each block below represents one request out of 1,000. Use the slider to change the availability target and see how many failures (red blocks) are allowed:

Availability target: 99.0%
With 99.0% availability, 10 out of every 1,000 requests can fail.

Chapter 1The Cost of Reliability

More nines costs exponentially more — and eventually isn't worth it

Increasing reliability has two major cost dimensions:

💰 Cost of redundant infrastructure

More servers, more replicas, multi-region deployment, automatic failover, better hardware. Each additional nine roughly multiplies infrastructure cost.

🐌 Cost of reduced velocity

To hit very high reliability targets, you need more testing, slower rollouts, fewer changes, longer review cycles. Your ability to ship features drops. This is the opportunity cost and it's often the bigger cost.

The exponential curve

The relationship between reliability and cost is not linear — it's exponential. Going from 99.9% to 99.99% might cost 10× what it cost to go from 99% to 99.9%.

Reliability Target → Cost → 99% 99.9% 99.99% 99.999% Cost Revenue benefit $ sweet spot
The sweet spot: The optimal reliability target is where the cost curve crosses the revenue-benefit curve. Beyond that point, you're spending more on reliability than you're gaining from it. This is where the error budget concept comes from — it quantifies exactly how much unreliability you can tolerate.

Try it: Cost calculator

How much does one more nine cost you?

Monthly revenue:
Current availability: %
Target availability: %

Chapter 1Error Budgets

The concept that aligns product and SRE teams

If a service's SLO is 99.9% availability, then the error budget is the remaining 0.1%. That's the amount of unreliability you're explicitly allowed. It reframes "downtime" from a failure into a budget to be spent.

error budget = 1 - SLO = 1 - 0.999 = 0.1%

What can you spend the error budget on?

🎯 Planned spending

The error budget can be "spent" on things that cause small amounts of unreliability but deliver other value:

Launching new features — new code has bugs, that's normal.
Risky experiments — A/B tests, canary deployments.
Planned maintenance — database migrations, infrastructure changes.
Faster release cycles — more deploys = more risk, but also more innovation.

The alignment mechanism

This is the real genius. Without error budgets, product teams and SRE teams are in constant tension:

Product Team "Ship faster!" "More features!" "Move fast!" TENSION SRE Team "Don't break things!" "More testing!" "Slow down!" Error Budget Objective arbiter

With an error budget, both teams share the same objective metric. If there's budget remaining, product can push features. If the budget is spent, everyone focuses on reliability. No arguments, no politics — just data.

Error budget policies

When the error budget is exhausted, the team enacts pre-agreed policies:

Budget remaining ✓

  • Ship features aggressively
  • Allow risky experiments
  • Fast-track deployments
  • Toil reduction is optional

Budget exhausted ✗

  • Freeze feature launches
  • Focus on reliability fixes
  • Increase testing requirements
  • Slow down release cadence

Try it: Error budget calculator

Calculate your error budget

SLO target (%):
Requests per day:
Window (days):

QuizChapter 1: Embracing Risk

Test your understanding before moving on

Question 1

Why does Google argue against targeting 100% availability?

A It's technically impossible to achieve
B The marginal cost exceeds the marginal benefit, and users can't tell the difference due to other reliability bottlenecks
C Google's infrastructure isn't reliable enough for it
D It would make the service too expensive for end users
Question 2

Your service handles 10 million requests per day with a 99.99% SLO. How many failed requests per day are within budget?

A 10
B 100
C 1,000
D 10,000
Question 3

What happens when the error budget is exhausted?

A The SRE team gets penalized
B The SLO is relaxed to compensate
C Feature launches are frozen and the team prioritizes reliability improvements
D The service is taken offline for maintenance
Question 4

Which formula does Google prefer for measuring availability of request-serving systems?

A uptime / (uptime + downtime)
B successful requests / total requests
C 1 - (errors / total requests × time)
D mean time between failures / mean time to recovery

Chapter 2SLI: The Service Level Indicator

Measuring what users actually care about

An SLI (Service Level Indicator) is a carefully defined quantitative measure of some aspect of the level of service that is provided. Think of it as the raw signal — the thermometer reading, not the temperature target.

Definition: An SLI is a ratio of two numbers: the number of good events divided by the total number of events. This yields a value between 0% and 100%, which maps naturally to an availability-like percentage.
SLI = good events / total events × 100%

Common SLI types

Availability

The proportion of requests that succeed. Could a user successfully complete their action?
Example: proportion of HTTP requests returning 2xx or 3xx out of all requests (excluding 401s and 403s).

🏎️ Latency

The proportion of requests that are fast enough. Not average latency — but the proportion that complete within a threshold.
Example: proportion of requests completing in under 300ms.

Quality / Correctness

The proportion of responses that are correct and complete — not degraded.
Example: proportion of search results served from full index (not stale cache).

📊 Freshness (for data processing)

The proportion of data records processed within an expected time window.
Example: proportion of records updated within the last 10 minutes.

🛡️ Durability (for storage)

The proportion of written data that can be successfully read back.
Example: proportion of objects stored that return successfully on read.

The golden rule: Your SLIs should measure what the user experiences, not what your infrastructure reports. Measure at the edge, close to the user, not deep in your backend.

Chapter 2SLO: The Service Level Objective

Setting the target — not too high, not too low

An SLO (Service Level Objective) is a target value or range for an SLI. It's the line in the sand that says "this is good enough."

SLO: 99.9% of requests will complete successfully within 300ms, measured over a 30-day rolling window

Anatomy of a good SLO

An SLO has four components:

The four parts of an SLO

1. The SLI — What you're measuring (e.g., request latency)
2. The threshold — What "good" means (e.g., under 300ms)
3. The target — What percentage must be good (e.g., 99.9%)
4. The window — Over what time period (e.g., 30-day rolling)

Choosing the right target

SLOs should be set based on user expectations, business needs, and current performance — not aspirations. Google suggests working from these principles:

⚠️ Don't set SLOs based on current performance

If your service currently runs at 99.99%, don't set the SLO at 99.99%. That leaves you zero room for the inevitable regression. Set it slightly below where you are — maybe 99.95% — so you have budget to ship features. If you set it too tight, you're back in the 100%-trap.

💡 Start with user happiness

Ask: "At what point would users start to notice and complain?" That's your lower bound. "At what point would further improvement not change user behavior?" That's your upper bound. Your SLO should sit somewhere in between.

Rolling vs calendar windows

Rolling window (recommended)

  • e.g. "last 30 days"
  • Aligns with user experience
  • No "reset" incentive
  • Bad week hurts for 30 days

Calendar window

  • e.g. "per quarter"
  • Aligns with business planning
  • Fresh start each period
  • Risk of "spending it all at month end"

Chapter 2SLA: The Service Level Agreement

When reliability becomes a contractual promise

An SLA (Service Level Agreement) is an explicit or implicit contract with your users that includes consequences for meeting or missing the SLOs it contains. The key word is consequences — usually financial (credits, refunds).

SLI The measurement "request latency" SLO The target "99.9% under 200ms" SLA The promise "or 10% credit" SLI feeds SLO, SLO is stricter than SLA. Always keep margin.
Critical rule: Your internal SLOs should always be stricter than your external SLAs. If your SLA promises 99.9%, your internal SLO should be 99.95%. This gives you a buffer — you'll catch and fix problems before they become contractual violations.

⚠️ The Chubby example

Google's Chubby (a distributed lock service) was so reliable that teams started depending on it being 100% available — far beyond its actual SLO. When it did have downtime, dozens of services failed catastrophically. Google's solution? They started deliberately taking Chubby down to burn through the error budget and force teams to handle failures properly. Being too reliable created hidden fragility.

Chapter 2Choosing the Right SLIs

Service type determines what to measure

Different types of services need different SLIs. The SRE Workbook provides a framework for choosing SLIs based on your service type:

Request-Driven
Data Pipeline
Storage

Request-driven services (APIs, web apps, microservices)

Availability SLI: Proportion of requests that succeed
Latency SLI: Proportion of requests faster than threshold
Quality SLI: Proportion of requests served without degradation

Example: An API gateway. Measure: successful responses / all responses, and responses under 200ms / all responses.

Data processing pipelines (batch jobs, ETL, stream processing)

Freshness SLI: Proportion of data updated recently enough
Correctness SLI: Proportion of records processed correctly
Coverage SLI: Proportion of expected data that was actually processed

Example: A daily analytics pipeline. Measure: proportion of reports generated within 1 hour of scheduled time.

Storage systems (databases, blob stores, file systems)

Durability SLI: Proportion of stored objects readable
Throughput SLI: Proportion of operations completing within expected time
Availability SLI: Proportion of read/write attempts that succeed

Example: A cloud object store. Measure: proportion of objects stored that can be read back within 100ms.

Start simple: Don't try to cover every possible SLI. Pick 1-3 SLIs that best capture user happiness for your service type. You can always add more later. Too many SLIs creates noise and makes it hard to know what to act on.

QuizChapter 2: SLIs, SLOs & SLAs

Prove you know the hierarchy

Question 5

What is the correct form of an SLI?

A A target number like "99.9%"
B A ratio of good events to total events, expressed as a percentage
C A contract with financial penalties
D An average value like "200ms mean latency"
Question 6

Why should your internal SLO be stricter than your external SLA?

A To impress customers with better numbers
B So you can detect and fix problems before they become contractual violations
C Because SLAs are legally meaningless
D To give the SRE team more work
Question 7

Why did Google deliberately take Chubby offline?

A For routine maintenance
B It was too expensive to run
C Teams had grown dependent on reliability far beyond the SLO, creating hidden fragility
D To test their disaster recovery plan
Question 8

For a data processing pipeline, which SLI is most appropriate?

A Request latency at p99
B Freshness: proportion of data updated within expected time
C HTTP 200 response ratio
D Mean time between failures

Chapter 3The SLO Implementation Process

From theory to practice — the step-by-step playbook

The SRE Workbook lays out a concrete process for implementing SLOs. It's iterative — you don't need to get it perfect the first time.

STEP 1 Choose SLIs STEP 2 Define SLI specs STEP 3 Implement SLIs STEP 4 Set SLO targets STEP 5 Error budget policy STEP 6 Dashboard & alert ↻ ITERATE: Refine SLIs, adjust SLOs based on data, improve alerting Tip: This is iterative. Start with a rough draft, measure for a few weeks, then refine. Perfection is the enemy of good SLOs.

🔄 The iteration principle

The SRE Workbook emphasizes repeatedly: don't try to get SLOs perfect from day one. Start with something reasonable, measure real user experience against it, and refine. A mediocre SLO that exists is infinitely more useful than a perfect SLO that doesn't.

Chapter 3SLI Specification

From vague idea to precise definition

An SLI specification is the formal description of what you're measuring — in plain language, before you worry about how to actually measure it. The Workbook separates specification (what) from implementation (how).

Example: The online shop

The SRE Workbook walks through a fictional company to illustrate. Let's follow along:

🛒 Scenario: "Game of SLOs" online shop

An e-commerce site with an API, a web frontend, and a data pipeline for generating product recommendations. The team needs SLOs for all three.

Step 1: List your user journeys

Before picking SLIs, identify what users actually do:

Browse product catalog
Search for products
Add to cart
Complete checkout (payment)
View order history
See personalized recommendations

Click each journey as you think about what SLIs it needs.

Step 2: Map journeys to SLI types

User JourneySLI TypeSLI Specification
Browse catalogAvailability, LatencyProportion of page loads that succeed and complete within 1s
SearchAvailability, Latency, QualityProportion of searches returning results within 500ms from full index
CheckoutAvailabilityProportion of payment requests that complete successfully
RecommendationsFreshnessProportion of users seeing recommendations updated within 24h
Notice: Each journey maps to 1-2 SLI types. Not every journey needs every type. Checkout doesn't need a latency SLI (users expect it to take a few seconds). Recommendations don't need sub-second latency but do need freshness.

Chapter 3From SLI Specification to Implementation

Where to actually measure, and the tradeoffs of each approach

Once you know what to measure, you need to decide where in your stack to measure it. Each measurement point has tradeoffs:

Where to Measure SLIs Client-side JS instrumentation Mobile SDK ✓ Most accurate Load Balancer LB access logs Ingress metrics ★ Best balance Server-side App metrics Middleware logs Easiest to set up Synthetic Probes / pings Canary requests Good for baselines ← Closest to user experience Easiest to implement → Hardest to collect Misses client-side issues ★ Recommendation: Start at the load balancer level. It captures most user-facing issues without requiring client instrumentation.

🎯 Implementation example

SLI Specification: "The proportion of valid HTTP requests that return a successful response within 300ms."

SLI Implementation: "Count of load balancer log entries with status code != 5xx AND duration < 300ms, divided by count of all log entries excluding 4xx, measured over 1-minute windows, aggregated into a Prometheus metric."

Chapter 3Setting SLOs in Practice

Achievable vs aspirational, and getting buy-in

The Workbook distinguishes between two types of SLOs:

Achievable SLOs

  • Based on current system performance
  • You can meet them today
  • Used for error budget enforcement
  • Drive day-to-day decisions

Aspirational SLOs

  • Require engineering investment
  • Can't meet them yet
  • Used for roadmap planning
  • Drive quarterly/annual goals

The SLO document

Every service should have a short document that captures its SLOs. Here's what to include:

📄 SLO Document Template

1. Service overview — One paragraph: what the service does and who uses it.

2. SLIs — For each SLI: specification, implementation details, and measurement point.

3. SLOs — For each SLI: the target percentage and measurement window.

4. Error budget — How much budget exists per window, and current consumption.

5. Error budget policy — What happens when the budget is exhausted: who gets paged, what gets frozen, what the escalation path is.

6. Rationale — Why these specific targets were chosen. This helps future you understand the tradeoffs.

Alerting on SLOs

The Workbook recommends burn-rate alerting over simple threshold alerting:

🔔 Burn rate alerts

Instead of alerting when the error rate exceeds a threshold, alert when you're consuming error budget faster than you can afford. A "burn rate" of 1 means you'll exactly exhaust your 30-day budget in 30 days. A burn rate of 10 means you'll exhaust it in 3 days.

Multi-window: Use a fast window (e.g. 5 minutes) and a slow window (e.g. 1 hour) together. This catches both sudden spikes and slow burns while reducing false positives.

burn rate = (error rate observed) / (error rate allowed by SLO)

If SLO = 99.9% over 30 days → allowed error rate = 0.1%
If current error rate = 1% → burn rate = 1% / 0.1% = 10×

QuizChapter 3: Implementing SLOs

The practical stuff

Question 9

Where does the SRE Workbook recommend measuring SLIs as a practical starting point?

A Client-side JavaScript instrumentation
B At the load balancer / ingress layer
C At the database query level
D Using synthetic monitoring exclusively
Question 10

What does a burn rate of 10× mean?

A The service is 10 times faster than normal
B 10% of requests are failing
C The error budget is being consumed 10× faster than sustainable — it will be exhausted in 1/10th of the window
D The SLO target needs to be reduced by 10×
Question 11

What's the difference between an achievable SLO and an aspirational SLO?

A Achievable SLOs are internal; aspirational SLOs are customer-facing
B Achievable SLOs have consequences; aspirational SLOs don't
C Achievable SLOs reflect current capability and drive daily decisions; aspirational SLOs require investment and drive roadmap planning
D There is no meaningful difference; they're interchangeable terms
Question 12

What should be in an error budget policy?

A A list of engineers who will be fired if the budget runs out
B Pre-agreed actions for when budget is exhausted: feature freezes, reliability focus, escalation paths
C The financial cost of each percentage point of downtime
D A histogram of past outages

Final ChallengeError Budget Simulator

You're the SRE lead. Manage your team's error budget over a 30-day window.

Day
Budget Used
Budget Left
30,000
Features
Status
30-day window
SRE SIMULATION

Error Budget Manager

You are the SRE lead at an e-commerce company. Your SLO is 99.9% availability over 30 days. With 1M requests/day, that gives you 30,000 failed requests as your monthly budget.

🚀Approve features to ship value — but each launch costs error budget
🔥Incidents happen. Write postmortems to prevent costs from compounding
📊Balance velocity and reliability — unspent budget is wasted opportunity

LabSLI Builder

Design SLIs for real-world services

Build an SLI specification

Service type:
SLI category:
Good event definition:
Total event definition:
Measurement point:
SLO target (%):
Window:
Fill in the fields above to generate your SLI/SLO specification...

💡 Tips for your SLI specification

Be specific about exclusions. Health check endpoints, admin routes, and expected 4xx errors (like 404s for missing pages) are usually excluded from availability SLIs.

Use proportions, not averages. "99th percentile latency under 300ms" is better expressed as "99% of requests complete within 300ms."

Document the rationale. Why 300ms and not 500ms? Why 99.9% and not 99.99%? Future you will want to know.

🎓

SLO Mastery Complete

You've covered the three core chapters of Google's SRE approach to reliability.

0
Points Earned
0
Quiz Questions Right
0
Sections Completed

Key takeaways

1. 100% reliability is the wrong target. Identify the optimal reliability level where cost and user happiness intersect.

2. Error budgets transform the reliability conversation from tribal politics into data-driven decision-making.

3. SLIs are ratios (good/total), SLOs are targets on those ratios, SLAs add contractual consequences.

4. Measure at the edge, close to the user. Start at the load balancer.

5. Start with rough SLOs and iterate. A mediocre SLO that exists beats a perfect SLO that doesn't.

6. Use burn-rate alerting with multi-window checks for the best signal-to-noise ratio.