How Google Engineering Manages Downtime
Each Google product has a service level agreement (SLA) that dictates how much downtime the product can have in a given month or year. Take 99.9% uptime, for example: That allows for 43 minutes of downtime per month, or about 8 hours and 40 minutes per year. That 8 hours and 40 minutes is what Treynor refers to as an “error budget.”
If the product adheres to the SLA’s uptime promise, then the product team is allowed to launch new features. If the product is outside of its SLA, then no new features are allowed to be rolled out until the reliability improves.
Google product managers don’t have to be perfect – they just have to be better than their SLA guarantee. So each product team at Google has a “budget” of errors it can make. Basically, they just can’t make more mistakes than what the SLA allows for.
Treynor explains that in a traditional site reliability model there is a fundamental disconnect between site reliability engineers (SREs) and the product managers. Product managers want to keep adding services to their offerings, but the SREs don’t like changes because that opens the door to more potential problems. This “error budget” model addresses that issue, though, by uniting the priorities of the SREs and product teams.
By putting the onus on the product developers to architect reliable systems, it’s a win-win for everyone. SREs get to have reliable systems, developers get to add features and users don’t experience downtime (hopefully). Having a system of error budgets – instead of mandating 100% uptime – gives developers and engineers some leeway, while more closely aligning the priorities of developers and site reliability workers.