The previous post talked about how alerts based on Service Level Objectives are better than Regular alerting.
And choose the rightful Indicators that apply. Here are the type of Indicators that apply to each kind of Service.
| Indicator 👉🏼
| Type👇🏽 | Availability | Latency | Throughput | Correctness | TAT/Lag | 
|---|---|---|---|---|---|
| Consumer-Facing | ✅ | ✅ | ✅ | ||
| Stateful | ✅ | ✅ | ✅ | ||
| Asynchronous | ✅ | ✅ | ✅ | ||
| Operational | ✅ | ✅ | ✅ | 
A Service running HTTP / GRPC workloads where the caller expects an Immediate response to the Request they submit.
Services like a Database. It is common to confuse a database not to be a service in a Microservices environment where multiple services call the same database.
Try answering this straightforward question next time you are unable to decide.
My Service HAS a database OR My Service CALLS a database.
Any service that does not respond with the Request result instead queues it to be processed later. The only response is to acknowledge whether the Service successfully accepted the task or not; the Service will process the actual result/available later.
Operational Services are usually internal to an organization and deal with jobs like Reconciliation, Infrastructure bring-up, tear down, etc. These jobs are typically asynchronous. But with a greater focus on accuracy vs. throughput. The Job may run late, but it must be correct as much as possible
Request-based SLOs is a aggregation of ratio Good requests vs. The total requests.
For Availability SLO, for a compliance duration of 15-minutes, we would simply count the total number of requests and the total failed requests across the 15 minute duration.
Request-based availability is 1-(failed/total)
Window-based SLO is a ratio of Good time intervals vs. total time intervals.
For Availability SLO, for a compliance duration of 15-minutes, we split the compliance period into smaller windows, of say 1-minute each.
Good windows: where failed_requests/total_requests ≤ 1%
Window based availability is (good_window/total_windows)
<aside> 💡 You can learn more about 🪟 Window-Based SLOs here
</aside>
Imagine we are a media streaming company. And there are two kinds of Service in consideration:
🏦 Payment Service
This service cares about Successful Payments. A Request-based SLO would be Ideal.
A sample objective looks like this:
<aside> 💡 Over the last seven days, 99% of the requests should serve without errors.
</aside>
📺 HD Streaming Failure
This Service cares about Uninterrupted Users. Uninterrupted users, who can continue watching for long sessions. Window-based SLOs are ideal for this.
A sample objective looks like this:
<aside> 💡 Over the last seven days, 99% of the time, the Service should have served reasonable Intervals of 15-minutes each. An interval is appropriate if 95% of the users did not receive an error.
</aside>