Assuming that you have read the Primer on Service Level Objectives (SLOs) available here, Let’s talk about the benefits of adopting SLOs.

SLOs Eased

How do Service Level Objectives help over regular (manual or per-interval threshold-based) alerting?

Short answer

Engineers and Engineering is finite resource. So would you spend them on new Features or fixing the existing problems?

Long answer (keep reading)

📊 Let’s illustrate the need using a sample service’s metrics over 15 minutes.

3️⃣ Three questions that come to mind

  1. Is this Service reliable?
  2. Should it alert and when?
  3. Third, how severe is the situation?
Timestamp Throughput 5xx
18:00 500 10
18:01 600 0
18:02 500 0
18:03 2000 20
18:04 150 5
18:05 150 10
18:06 150 15
18:07 600 12
18:08 600 5
18:09 700 5
18:10 2500 20
18:11 250 5
18:12 700 7
18:13 800 12
18:14 1250 5
18:15 1500 10
18:16 1500 5
18:17 2500 10
18:18 1000 10

📈 Traditionally, your favorite dashboard tool would show you something like this.

Untitled

🕡 There is a spike at 6:03 PM and 6:06 PM but look at the quantum of those errors.

Untitled

🪂 SREs usually fall into this trap of misleading Spikes.

100% uptime is not possible. A downtime, no matter how small, is inevitable. Because you need some time off to provide for the following, some requests are bound to fail.

<aside> 💡 Fair to conclude that the only way to make a service reliable is by allowing it to fail a little.

</aside>

But this downtime, albeit small, must NOT:

<aside> 💡 The key is to stop chasing every single Error. Instead, look at the Service health.

</aside>