In this article I want to share an idea behind SRE that has made a strong impression on me. I have studied electrical engineering – Even though I’ve probably forgotten most of it. One of the concepts you learn in electrical engineering, is the concept of “feedback loops”.
Think of it as follows: you’re driving your car and you’ve got cruise control turned on. When you drive up a slight slope your car will lose speed. Cruise control will see the difference between your desired speed and the current speed. The bigger the difference, the more “gas” your car will give to come back to the speed you’ve configured. When I heard about the SRE concept “Error Budgets”, I knew there was something special about this SRE practice.
Measuring the user experience
The goal of SRE is to measure and improve the User Experience. Where historically we would implement monitoring based on the technical measures that are available out of the box. SRE explicitly looks to measure the user experience.
“100%? Disk full? Don’t bother me, someone will look at it in the morning.”
“Are the users impacted? Wake everyone up and fix it.”
The goal of an application or service is to provide a business value. This value is lost if the application does not function adequately. Within SRE we use Service Level Objectives (SLO’s) to quantifiably measure if business value is being lost. Or if we take this from a user perspective, SLO’s measures user satisfaction. SRE uses Service Level Indicators (SLI’s) to measure this. An SLI is defined as follows:
Defining what success looks like
Now that we have this percentage, the next thing we do is define how many “errors” we accept. This will become our “error budget”.
The error budget provides a clear, objective metric that determines how unreliable the service is allowed to be within a time window. This metric removes the politics from negotiations between the SREs and the product developers when deciding how much risk to allow.
Error budgets as feedback loops
The goal of SRE is to put in place a feedback loop that allows you to decide how much effort should be put in, to improve the user experience. If you run out of your error budget this means you should spend more time and effort in improving your user experience. One of the ways to implement this, is to add an improvement story to your backlog if you’ve run out of budget. Other implementations completely block all releases until the system is back in budget. How strict this would be for your organization is something you will have to figure out for yourself.
There’s one more part of elegance to this. Where “underspending” means your user experience will suffer. “Overspending” is wasteful. Building a system that is more reliable than required might bring no noticeable improvement to the user. Going from 99.99% to 99.999% might just also cost an extra 0 for your engineering effort. This requires increased monitoring, redundancy, automation and support. This is why, in my eyes, the feedback loop of the error budget is so elegant.