How do you measure SLOs

The previous post talked about how alerts based on Service Level Objectives are better than Regular alerting.

A 3️⃣ -step guide to choosing objectives effectively.

1️⃣ Step 1. Identify the type of Service

And choose the rightful Indicators that apply. Here are the type of Indicators that apply to each kind of Service.

| Indicator 👉🏼

Type👇🏽	Availability	Latency	Throughput	Correctness	TAT/Lag
Consumer-Facing	✅	✅	✅
Stateful		✅	✅	✅
Asynchronous			✅	✅	✅
Operational	✅			✅	✅

🛒 Customer-Facing Services

A Service running HTTP / GRPC workloads where the caller expects an Immediate response to the Request they submit.

💾 Stateful Services

Services like a Database. It is common to confuse a database not to be a service in a Microservices environment where multiple services call the same database.

Try answering this straightforward question next time you are unable to decide.

My Service HAS a database OR My Service CALLS a database.

🔭 Asynchronous Services

Any service that does not respond with the Request result instead queues it to be processed later. The only response is to acknowledge whether the Service successfully accepted the task or not; the Service will process the actual result/available later.

🛂 Operational Services

Operational Services are usually internal to an organization and deal with jobs like Reconciliation, Infrastructure bring-up, tear down, etc. These jobs are typically asynchronous. But with a greater focus on accuracy vs. throughput. The Job may run late, but it must be correct as much as possible

2️⃣ Step 2: Identify the right type of the SLO

📞 Request Based

Request-based SLOs is a aggregation of ratio Good requests vs. The total requests.

For Availability SLO, for a compliance duration of 15-minutes, we would simply count the total number of requests and the total failed requests across the 15 minute duration.

Request-based availability is 1-(failed/total)

🪟 Window Based

Window-based SLO is a ratio of Good time intervals vs. total time intervals.

For Availability SLO, for a compliance duration of 15-minutes, we split the compliance period into smaller windows, of say 1-minute each.

Good windows: where failed_requests/total_requests ≤ 1%

Window based availability is (good_window/total_windows)

<aside> 💡 You can learn more about 🪟 Window-Based SLOs here

🪟 Window Based SLOs ...

</aside>

Imagine we are a media streaming company. And there are two kinds of Service in consideration:

🏦 Payment Service

This service cares about Successful Payments. A Request-based SLO would be Ideal.

A sample objective looks like this:

<aside> 💡 Over the last seven days, 99% of the requests should serve without errors.

</aside>

📺 HD Streaming Failure

This Service cares about Uninterrupted Users. Uninterrupted users, who can continue watching for long sessions. Window-based SLOs are ideal for this.

A sample objective looks like this:

<aside> 💡 Over the last seven days, 99% of the time, the Service should have served reasonable Intervals of 15-minutes each. An interval is appropriate if 95% of the users did not receive an error.

</aside>

3️⃣ Step 3: Set the Objectives

Availability
Latency
Correctness
Lag