Building Blocks for Site Reliability At Google

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

I work for Google in Zurich I've been there for about just just over 10 years now and I've worked on a wide variety of systems in my tenure at Google I started out working on the web crawler on the web search indexing system the logs processing system but in Google Maps Google Calendar Google Sites right now I manage the team that runs Google's payments and billing systems I try very hard not to think about how much money is being moved through those systems every year so this talk is building blocks for site reliability and before we go into the building blocks we first have to talk about what's not reliability engineering actually is any of you have this we just wrote a book about site reliability engineering anybody have this book if you have it with you I would be very happy to sign it for you either do have a chapter in that book so what is what is happening liability engineering I was add every con in Dublin this year it which tells you one thing about San reliability engineering it's big enough to have its own use next conference but also at every con there was a panel discussion entitled what is essary and that was one question that was not answered during that entire discussion and that question is what is it sorry so in that sense the the discussion kind of failed it's go and if you look at if you look at Google's own material on the topic you get these you get these sound bites about what is essary and there's one from Ben trainer our VP of Operations so SOE is what you get when you treat operations as if it's a software problem as one from Android Widdowson who runs our education programs for essary says our work is like being part of the world's most intense pit crew we change the tires of a race car as it's going 100 miles per hour you can tell that he's American because he thinks 100 miles per hour is actually fast there's another one says s3's engineer services instead of binaries mean a couple of co-workers have been trying to come up with a with a sort of brand statement to start reliability engineering and the best we could come up with is site reliability engineering as a specialized job function that focuses on the reliability and maintainability of large systems so it's a job function that is specifically geared towards reliability as I said Google has kind of invented the term which means that we also get to define it but also talking with people in other companies it turns out that how companies implement site reliability engineering is wildly different there's a ton of different engagement models for how you can implement this job function and I've seen companies that have this sort of embedded site reliability model we have like one or two site reliability engineers embedded into product engineering teams there are companies that don't really have SOA at all that have something like Production Engineering that actually do not engage with the product at all but rather right right infrastructure and platforms for for achieving reliable systems so I'm just going to talk about how Google implements implement s3 and the way Google implement s3 is the following SOE at Google is its own department it's part of Google's technical infrastructure organization and ESRI has less than 10% of the engineering staff of Google usually it's about five to ten percent historically it's been five to 10 percent of the engineering staff of Google and individual s3 team's partner with product engineering teams - in a sort of co-ownership model of the services usually it's a one to many mapping so you will have one every team that is responsible for working with a number of services and the size of the three teams compared to the start of the product engineering teams also varies widely the best I've ever been in had a a ratio of one to five between s eries and product engineers that was incredible that was really nice because you could bet you been taking new everybody by name the but I've also been in teams that had a factor of 1 to 50 between site reliability engineers and product engineers one of the things about s3 is that that makes it hard to implement a theory in the Google models is that every team that Google come in a minimum size if you have an s3 team in one single location you need at least eight people if you have an extra team that split across two locations then we try to have at least six plus six so six people in in one location and the reason for that is that every does emergency response for the service is supported by a team and we try not to bring our people out on on emergency response on on-call response if you want to have a sustainable rotation for emergency response you need a certain number of people you don't want people to be on call all the time or one out of one out of every two or three weeks so we aim to have at least as eight or six plus six model this also means that every teams are very often not co-located with the product engineering teams so in a in a single site as a remodel you can still kind of achieve that but as soon as you go to multiple sites and product engineering in a single site and plan reliability isn't at least two then you're not going to have colocation and for a lot of product engineering teams this is actually a big change in approach that they're used to basically having everybody they interact with be down the corridor from them and suddenly they need to interact with people that are in a different time zone every recruits from both software engineering and systems engineering backgrounds so and there are no there are no organizational barriers between product engineering and an SRE for software engineers so if you get hired by Google as a software engineer you can freely move between product engineering and third reliability engineering the other background that every recruits from is systems engineering or systems administration and we basically recruit on the entire spectrum between software engineering and systems engineering so we also recruit a lot a lot of people with like a mixed background that maybe don't quite meet the software engineering bar at Google but make up for this by being really good systems engineers and yet and there's a there's a so because of this the reason why we recruit for this background is that there's a mix of operational and an engineering work in sre and there is a there is a cap on operational work for individuals and liability team so the official gap the official cap is 50 percent operational load this is the this is where we pull the emergency brake an offload operational work to the product engineering teams healthy teams have a much lower operational out than 50% usually we aim for 10 to 15 percent operational load so one thing I also wanted to mention is what about DevOps that's a term that very often gets get thrown into the mix once you start talking about certain liability engineering and I've noticed that Google has this institutional blindness when it comes to the term DevOps because we look at the DevOps principles and we look at the DevOps tool chain and we look at the DevOps lifecycle and we sort of tilt our head there we go wait isn't that how you do things because Google has no Google has no traditional IT Google does not have ops all of devs and Google is DevOps we simply do not operate in a model where we have traditional IT so pretty much all the DevOps children is stuff that we do anyway and a large part of that of the DevOps tool chain is covered by that reliability engineering especially when we're talking about automation when we're talking about monitoring we're we're talking about releases a lot of that is covered by site reliability that being said let's look at the building blocks let's look at how you get reliable service I said that site reliability engineering is a specialized job function focused on reliability so you know that means that every is likely to engage in fields that make the biggest difference to the reliability of a system and these are basically the building blocks for sre what do you need to do if you want to get a reliable service and I'm going to talk about a couple of them and the rest of this talk the others are in the book if you want to read that first of all if you want to have a reliable service you need to know how reliable it actually is and that means you need monitoring you need something that tells you how reliable is my service at this point in time and same for improving reliability unless you know how reliable it is you have nothing to improve and unless you have alerting you can't actually step in when things go wrong unless people call you on the phone and tell you that your site is broken but you don't really want to get to that to that point you want to know that your site is broken before your customers notice and all of these things make monitoring and alerting a very attractive target for sorry so every typically writes the monitoring for their own services right the instrumentation there's blackbox monitoring for the services reviews and turns alerts and so on it covers the whole the whole telemetry and an instrumentation part what else do you need you need service level objectives monitoring tells you how reliable your service is service level objectives tell you how reliable you want it to be and service level objectives serve as the goal for the SOE team to strive for and also as a reference point for how much we actually want to invest into reliability so typically a couple of words on on on services of objectives typically s ellos should be set on custom expectations one problem one trap that very often you fall into when you try to set up a load is you look at how reliable the system is at any point in time and you like well that sounds like a number we can reach and then you set that as your SLO and two years later you forgotten where that number comes from and two years later you'll be running a four nines availability service and maybe you reliability is degrading and you're starting to invest lots of effort into improving reliability and then look at your customers and you realize oh wait they actually won't notice if we drop in line or two and all this work that we're doing is not it is really first and nothing so you need to set them based on customer requirements you need to document where they're coming from otherwise they turn into these magic numbers that nobody is willing to change and you also need a certain amount of buffer in the service level objectives say you committed in your service level objectives to running a service at 200 milliseconds average latency if you consistency run it at 190 then you have no room for error and no room for error means there's no room to change you can't change the system because that might that might endanger your you meeting your service level objectives next the other thing that we need is automation and Demetri is going to give a whole talk about that and just following me I think so automation when are we talking about reliability engineering has two facets one of them is at the scale that we're talking about automation is a sheer necessity we cannot run our systems manually we need automation it doesn't work otherwise so automation is what allows an Operations team to scale but the other point is how does automation interact with reliability and the way in automation interacts with liability is that automation takes the human error art of performing processes it also introduces the possibility of computer error and it also turns out computers are far more efficient and making mistakes than humans are and they can make mistakes in a far larger scale and also faster yes but with the crucial difference that computerized automation is testable processes executed by a human are not testable automation is so you have the you have the opportunity to address the problem of reliability of your automation the same way that you address any other software problem you addressed by testing Google had a had a phase in in the in the a33 development where we focused a lot on automating existing processes basically you went from a manual failure of a procedure that somebody wrote down and followed whenever we need to failover to a different data center you went from that to a script and then you went on to a kind of more fancy script and a more fancy script and frameworks around the scripts and more tests and more tests more tests at some point we realized that automating processes designed for a human to execute does not really scale and these days we're investing a lot more effort in to building systems that don't need that either I designed with automation in mind from the beginning or that don't need automation at all so to pick the example of the manual failover well we can design a system for manual failover and we can automate the steps needed for that failover but we can also design a system that runs in a hot hot configuration where all where all instances of the system take part of the load and if you spread that around enough then one of the systems going down is not a failover anymore it's just a lot of capacity and we do capacity adjustments all the time anyway and the system is just going to deal with that in the course of his normal operations so at that point we are ending up with a system that does not need automation because what we use to automate is part of the normal operations of that system next one releases an config management if releases and config changes are the thing that causes most of your artists then that is an attractive target for some reliability engineering so early on in in Google essary some reliability engineers would often act as gatekeepers for changes to the production systems we would get new binaries to be deployed from the from the product engineering teams we would get configuration changes from the product engineering teams and sat reliability engineers would review these and test them and roll them out carefully and monitoring and make sure that everything went well this is if you in an early life cycle of an application this is an incredibly effective way of improving reliability it also doesn't scale you burn a lot of manpower just on playing the gatekeeper function for the production systems so in recent years what our reliability has focused on has been mostly building infrastructure that allows you to make these kinds of changes safely and reliably and once we have the infrastructure up to a point where the changes are either safe or they will get rolled back automatically then we can just give the responsibility for those changes back to the product engineering teams which makes our reliability engineers happy because they don't have to burn their own manpower on making these changes and it makes the product engineering teams happy because they don't have to talk without reliability anytime they want to make a change and everybody is happy next one on call and emergency response so talked about monitoring and alerting the thing that allows well what allows you to detect when something is going wrong in your systems once you've detected it you need to fix the problem and for the services supported by soe at Google that is usually some reliability engineers that get paged and then do emergency response so basically any time something is wrong with Google web search Gmail whatever probably some s eries pager is going off somewhere hopefully already there are also some organizations that practice a shared on-call model where you have members of plant reliability engineering and product engineering sharing responsibility for emergency response turns out because of the size mismatch between start reliability teams and product engineering teams this is surprisingly hard to implement because if you have a small plant reliability team six to twelve people then everybody is on-call reasonably frequently at least once every two months and people are very well trained and very well versed in emergency response if you try to scare this to a 600 person product engineering team this is not going to work because people are never going to gain the expertise they need for emergency response because they're never on call frequently enough so this mostly works on in very small very small product engineering teams what a lot of product engineering organizations do is they employ a secondary on-call rotation as an escalation point for that reliability engineering so if there is a problem that unreliability can't solve and we can escalate to the product engineering team that reliability only tends to engage with services that need an on-call response time of 30 minutes or less basically anything larger than 30 minutes is usually not worth staffing a third reliability team for because of the minimum size data that I mentioned these are incentivized to keep the number of our teachers with the number of incidents per shift to less than two healthy teams again have a lot lower number of incidents per shift usually run point one to point two per shift so one of the advantages of having on : emergency response with site reliability engineering is one I've already alluded to which is that you get a small team that is very well trained and very well practiced in this kind of emergency response there is another more subtle advantage of having emergency response in a different team which is the following it forces the product engineering teams to get their services into a shape where they can actually reasonably be taken care of by somebody who is not a core product developer so usually when when every starts engaging with a new service that service goes through an intense period of service hardening and improving the processes around running the service so that a different team can actually meaningfully do encore response for that service next one capacity planning how does capacity planning interact with reliability well if you run out of capacity and you overload your service going to lose reliability really really quickly so capacity planning traditionally also its unreliability in a new territory so satellite building engineering does the does the demand assessment as the forecasting there's capacity assessment via by load testing provisioning and also maintaining all the infrastructure around it to do this automatically data integrity is possibly one that doesn't quite seem to fit in there because well if the serving systems are up and running and if if they're producing valid valid responses and if they're if they have capacity etc then what does the data have to do with that well from users perspective data integrity problems and and service reliability problems are indistinguishable from each other the user doesn't really care whether they can't get their email because your front-end is broken all because there was data corruption in your database from the users perspective the the result is the same they can't get their email so that's the reason why saturated and reliability also I engages with integrity so some reliability does things like making sure that all of the all the data stores of a service are backed up that we have with all procedures that the restore procedures are tested and executed regularly some services have automated continuous restore pipelines where basically all the time we're storing a small part of the of the data and make sure that what we're getting back from the cap system is what we're actually expecting and every also often maintains the pipelines that periodically verify the data integrity of the data stores and with that so that's a small sample of building blocks we're sorry some of the some of the part that every engages in let's talk about what's next for essary what are the one of the big challenges for essary one of the problems for google asari right now is that we've lost most of the problems well that is not the problem the problem for Google Earth theory is that most of the problems we've already solved several times in different parts of the company so the big challenge is how do we take the accumulated knowledge of more than 10 years of sre and package it into something that scales across the entire organization so that we don't need to we don't need to solve the same problems over and over again and I spend first made part of my career at Google I'm in a department called launch coordination engineering launch coordination engineering is a part of site reliability that engages with new products and new services at launch time and basically make sure that they're reliable from the start that we that we don't that we don't launch Don trainwrecks from the reliability perspective and one part of that job was looking at the engineering team so this was this was four to five years ago one part of that job was looking at the at the product of new kids requirements and listening to what they wanted and what problems they were trying to solve and then saying oh yes this engineering human this other part of the company have already solved that go talk with them oh I this problem yes that engineering team has solved that one and this one this other engineering team s off and for the other four we have standard libraries you can just implement and this is this is the mode of operation that we want to get out of we want to we don't want to have to have this hands-on approach this hands-on attention to every service that we run because the number of series that Google is growing a lot faster than the number of engineers and we want to we want to even accelerate that growth we want to go to more micro services and automate more of the of the service management service deployment so we cannot afford to give this sort of individualized attention to every service so how do we make that scale well two ways we can make that scale one of them is baking more than ten years of every experience into Google's frameworks and into Google's libraries that so that either the engineer doesn't even doesn't have to look for solutions or write themselves we can just turn them on because they're in the framework anyway or in the best case they don't have to do anything because these solutions are on by default everywhere and you don't have to think about them so this is the one the one approach baking it into the basic libraries and the framework the other approach is standardizing standardizing standardizing we're at a point where in SRE it doesn't make sense for us anymore to solve problems for a single team it doesn't mix it doesn't even make sense anymore for us to solve problems for a single single product area if we want a self problem they have to be Google sized they have to be able to to scale to the entire organization to the entire company and one way of doing that is standardizing basically providing a production platform before the product engineering team to run their services on whereas the reaper might support for the platform and the product engineering teams run their product on that platform and I think with that yeah so this is kind of shown and I've mentioned there's a couple of times in the in the talk that some reliability used to do things a certain way now we do things differently we have plans for the future we have tons how to scale our work so in reality what is that reliability engineering has no fixed answer because the targets that are most valuable to that reliable engineering change over time and change over the years we're trying to adapt what's undeniable is engineering does to the changing environments in the environment that we operate in and with that I would like to conclude my talk

Info

Channel: Jorge Cardoso

Views: 11,390

Rating: 4.8145695 out of 5

Keywords: reliability, resilience, cloud, cloud computing, SRE, Google, Site Reliability Engineering

Id: nQv9ySa8MTU

Channel Id: undefined

Length: 31min 53sec (1913 seconds)

Published: Mon Jan 16 2017