SLO, Error Rate, Error Budget
A brief explanation of the key concepts of the SRE (Site Reliability Engineering)
Photo by Stephen Phillips - Hostreviews.co.uk on Unsplash
SLI (Service Level Indicator)
It's an indicator of good or bad events over all the valid events.
Count of requests returning response code 200 OK over the count of all the requests.
For ex:- 99 requests out of 100 returned 200 OK, then 99% is the SLI.
Count of request(200 OK)
SLI = ------------------------ * 100
Count of all the requests
SLO (Service Level Objective)
It's an objective which applied over the SLI and bounded with time.
In simple terms, bind our SLI with the period and objective which we want to achieve.
For ex:- If the SLO defined is 99.9%, then 99.9% of requests should be successful in 7, 28 or 30 days.
Out of 1 million (10,00,000) requests 9,99,000 requests should return 200 OK responses in 30 days.
10,00,000 * 99.9% = 9,99,000
Error Budget
It's the error percentage that we are allowed to consume without breaching our SLO.
For ex:- If the SLO is 99.9%, then the error budget becomes 0.1% (100% - 99.9%)
Out of 1 million (10,00,000) requests 9,99,000 requests should return 200 OK responses in 30 days with an SLO of 99.9%, then 1000 requests are allowed to be failed in 30 days with an error budget of 0.1%.
10,00,000 * 0.1% = 1000
Error Rate
The error rate is the percentage with which we are consuming our error budget. Let's try to understand this with a few different scenarios.
If the error rate is 1%, then we will be going to consume our error budget in 3 days.
1% of 1 million is 10,000, meaning we will get 10,000 failed requests, which is more than what is allowed.
So, 10,000 failed requests in 30 days.
1,000 requests will fail in 3 days.
It means we will consume our error budget in 3 days only.
If the error rate is 0.1%, which is equal to the error budget, then we will be going to consume our error budget in 30 days.
0.1% of 1 million is 1,000, meaning we will get 1,000 failed requests
So, 1,000 failed requests in 30 days.
It means we will consume our error budget in 30 days.
Burn Rate
Burn rate gives an idea of how fast we are consuming our error budget. Let's try to understand it with the scenarios discussed in the Error Rate.
If the error rate is 0.1%, we will be consuming our error budget in 30 days, the burn rate will be 1.
If the error rate is 1%, we will be consuming our error budget in 3 days, which is 10 times faster, the burn rate will be 10.
SLO Burn Rate Alerting
Target Error Rate >= SLO Threshold
We can choose a small window of 10 minutes for raising the alert.
If the SLO is 99.9%, the SLO Threshold is 0.1% and the error rate is greater than or equal to 0.1% within 10 minutes, then the alert will be raised.1000 requests are allowed to fail in 30 days as per a 0.1% error budget.
1000 requests -> 30 days * 24 hours * 60 minutes
1000 requests -> 43200 minutes
1000/43200 requests -> 1 minute
(1000/43200)*10 requests ->10 minute0.2 requests -> 10 minute
It means 0.2 requests are allowed to fail in 10 minutes, and 0.2 requests are 0.02% of 1000.
It means we will consume only 0.02% of our error budget in 10 minutes.
As we previously discussed in the error rate, if the error rate is 0.1% and the error budget is also 0.1%, then we will consume our error budget on the last day.
It means we are getting alerted even though we will complete our 30 days without breaching our threshold.