Service Level Objectives

Assuming that you have read the Primer on Service Level Objectives (SLOs) available here, Let’s talk about the benefits of adopting SLOs.

SLOs Eased

How do Service Level Objectives help over regular (manual or per-interval threshold-based) alerting?

Short answer

Engineers and Engineering is finite resource. So would you spend them on new Features or fixing the existing problems?

Long answer (keep reading)

📊 Let’s illustrate the need using a sample service’s metrics over 15 minutes.

3️⃣ Three questions that come to mind

Is this Service reliable?
Should it alert and when?
Third, how severe is the situation?

Timestamp	Throughput	5xx
18:00	500	10
18:01	600	0
18:02	500	0
18:03	2000	20
18:04	150	5
18:05	150	10
18:06	150	15
18:07	600	12
18:08	600	5
18:09	700	5
18:10	2500	20
18:11	250	5
18:12	700	7
18:13	800	12
18:14	1250	5
18:15	1500	10
18:16	1500	5
18:17	2500	10
18:18	1000	10

📈 Traditionally, your favorite dashboard tool would show you something like this.

Untitled

🕡 There is a spike at 6:03 PM and 6:06 PM but look at the quantum of those errors.

Untitled

🪂 SREs usually fall into this trap of misleading Spikes.

100% uptime is not possible. A downtime, no matter how small, is inevitable. Because you need some time off to provide for the following, some requests are bound to fail.

🔺Upgrades
🔧 Maintenance
🏖️ Comfort to Engineers

<aside> 💡 Fair to conclude that the only way to make a service reliable is by allowing it to fail a little.

</aside>

But this downtime, albeit small, must NOT:

Bring down other services in the system.
Impact a large majority of the users.
Become a trend.

<aside> 💡 The key is to stop chasing every single Error. Instead, look at the Service health.

</aside>