Assuming that you have read the Primer on Service Level Objectives (SLOs) available here, Let’s talk about the benefits of adopting SLOs.
Short answer
Engineers and Engineering is finite resource. So would you spend them on new Features or fixing the existing problems?
Long answer (keep reading)
📊 Let’s illustrate the need using a sample service’s metrics over 15 minutes.
3️⃣ Three questions that come to mind
Timestamp | Throughput | 5xx |
---|---|---|
18:00 | 500 | 10 |
18:01 | 600 | 0 |
18:02 | 500 | 0 |
18:03 | 2000 | 20 |
18:04 | 150 | 5 |
18:05 | 150 | 10 |
18:06 | 150 | 15 |
18:07 | 600 | 12 |
18:08 | 600 | 5 |
18:09 | 700 | 5 |
18:10 | 2500 | 20 |
18:11 | 250 | 5 |
18:12 | 700 | 7 |
18:13 | 800 | 12 |
18:14 | 1250 | 5 |
18:15 | 1500 | 10 |
18:16 | 1500 | 5 |
18:17 | 2500 | 10 |
18:18 | 1000 | 10 |
📈 Traditionally, your favorite dashboard tool would show you something like this.
🕡 There is a spike at 6:03 PM and 6:06 PM but look at the quantum of those errors.
🪂 SREs usually fall into this trap of misleading Spikes.
100% uptime is not possible. A downtime, no matter how small, is inevitable. Because you need some time off to provide for the following, some requests are bound to fail.
<aside> 💡 Fair to conclude that the only way to make a service reliable is by allowing it to fail a little.
</aside>
But this downtime, albeit small, must NOT:
<aside> 💡 The key is to stop chasing every single Error. Instead, look at the Service health.
</aside>