AWS re:Invent 2019: [REPEAT 1] Amazon's approach to building resilient services (DOP342-R1)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hi good morning I'm mark Brooker I'm a senior principal engineer at AWS I've been with Amazon for about ten years over that time I've worked on ec2 EBS IOT and most recently server less work on lambda mostly but also API gateway and our messaging products so today I'm going to be talking about Amazon's approach to building resilient services and there's something I believe you know really deeply about this and that is to build resilient services highly available services and great products for customers you need both take great technology and great culture and you have to have both of these things to be successful and so this talk is kind of broken into four pieces two of them are about culture and two of them are about technology we're going to start off with talking about DevOps and specifically closing the loop between development and operations there are a lot of different theories about what DevOps is some people will tell you well it's a set of approaches like all we do C I and C D and therefore its DevOps or it's a set of tools right we use pipelines and automation and infrastructure is code and therefore its DevOps or it's a set of organizational practices right we have a single team that in contains both positive people and Ops people and therefore it's step ups you know I don't think any of these definitions are wrong I'm not going to argue about definitions but I'm gonna say that for me DevOps is a loop it's a cycle and so our cycle begins with building excuse my scratchy voice now my cycles but our cycle begins with building right in this building process always begins at Amazon with our customers and like what are our customers want what problem are we trying to solve on behalf of our customers and then the team will start building the technology building the processes and building the systems that are required to solve that problem and somewhere along the line of that we're going to get that wrong there's gonna be some kind of failure and I'll talk about the kinds of failure in detail in a few minutes but there's always something right there teachers is delivering slower than you would like there's delivering slightly the wrong thing and the next step after that and what you know what makes you successful is how you recover from those failures so the next step after that is analyzing the causes of failure like why did things go wrong and this is the really culturally uncomfortable moment this is the moment where you have to look inside yourself and look inside your team and honestly answer that question of what went wrong and it's really easy to do the easy mode version of this will you say well what went wrong was we didn't type fast enough or or we didn't you know we didn't work nights and weekends or whatever and those easy answers are never the right answers or almost never the right answers and indeed you have to do this kind of really hard emotional thing of saying what actually went wrong what were the root causes organizationally technologically that led to this failure and then the most important next step is changing your practices taking those hard lessons that you learned from failure and turning them into a change of behavior a change of technology a change of team structure a change of something it's going to make you not have the same failure the next time around the loop and then that goes back into your builders and your operators and everybody else who is responsible for making sure a product is successful and and that's everybody and for me DevOps is about spinning this loop faster it's about making the loop tighter and making sure that we have high quality signal throughout does it loop of improvement and where I see teams fail in the long term is that they don't have the connection back right they build they build things you know everyone's building things they don't succeed at that for one reason or another often they do do the step of analyzing why that is you know why did we fail what went wrong and then they don't do the step of changing their practices they don't connect the lessons they got from the that they failed back into what they're going to do next time and that means that this becomes this open loop of you just make the same mistakes over and over and over again and I think this was a big problem with a lot of traditional models and something for me the DevOps has been about improving and Amazon's culture of ownership is about improving and a lot of the cultural movement that's happened in our industry over the last 10 to 15 years has been about making sure this loop gets closed so let's talk in some detail about some important cultural parts of that and then use this quote from Richard cook he's a researcher in I in systems and systems failure and he says the risks didn't matter can't be seen from the office and what he means here is if you're sitting in your office the door closed or you're an architect drawing pictures on a whiteboard you aren't going to see the risks that lead to failure the risks that really matter to our businesses really matter to the our teams the ability to successfully deliver software deliver technology deliver mechanism deliver process all happen on the ground they happen to the people who are actually doing the work the people who are leading the organization and so you can't be disconnected you have to be there at all times not at all times right I'm not saying you know work but you have to be connected to the details and this is a great important important thing about great leadership and unfortunately something that I see a lot of senior technologist start to kind of drift away from because it becomes easy to close your office door and lose connection to the details so what are the risks well obviously the easy one is outright project failure right like we started doing something and we just couldn't do it it didn't work we didn't ship we ran out of money we ran out of time and that is easy right it's easy to understand and often those ones are the easiest ones to analyze you know what are we going to do differently next time well the outcome was very very clear but they're harder ones and perhaps the hardest one is moving too slowly why is my team moving slowly why we delivering more slowly than we used to why did a feature that used to take us one month to do now take us three months you know why is why are there some people on the team who aren't delivering as quickly as we would like and this kind of slow movement can be extremely hard to debug and it is especially hard to debug if you are even one step removed from being a hands-on owner in the organization the other one obviously and you know the classic one is is outages right like well what our system is gonna break and some of those are really easy system broke because you know there was something obviously wrong with the architecture or something obviously wrong with an operational practice but often these things are extremely subtle and I'll get into some of some of the subtle stuff that happens later in the talk and looking at outages post-talk right after the fact after you've had the outage is easy enough because it's happened and you can analyze things well it's not actually that easy because you have to get the data and you don't always have it and you have to understand the dynamics of the system in the moment so you don't always have that but it's not super super difficult what's really difficult is trying to find risks before they happen and fix them before there are outages that's difficult technologically because it requires a great insight into the behavior of your system and it's even more difficult organizationally because it requires that we are doing work against risks that haven't come true yet so that means you have to go to your boss and you have to say or what are you doing this week what I'm doing this week is fixing something that isn't broken it's critical work but it's critical work that needs cultural support and needs that's constantly to talk about why are we improving things why are we fixing things why are we building resilience into our architecture because we think we can predict the next outage that's going to happen and that's a key part of resilience it's a key part of closing the loop so how do we do this at AWS and before I talk about you know throughout this talk I'm going to talk about about and culture and things that we do at AWS but if there's one thing that's true about Amazon culture any company of the size is that there's nothing universally true so these are about the teams around me the teams I've worked with and you will find exceptions so one thing that I think is very important about the culture we have at AWS is the team's run what they build teams don't typically write code and throw it over the wall to an ops team our teams the people who write the software are on-call the people who write the software are deploying it to production monitoring and in production operating it in production and that is both applies to the applies to the engineers themselves and applies to the managers the second point is very important one to me personally is the principle engineers at Amazon are senior engineers our most our most senior level of engineers are builders hands-on builders often writing code doing hardware design designing networks designing components you know testing systems and we try and avoid non practitioner architects because you know if you're if all your principal engineer is doing is drawing blocks on a whiteboard it's very likely that they don't aren't on solving the right problems for your customers they aren't solving the problems that can actually exist because the risks to your organization can't be seen with a closed office door they can't be seen by non practitioner architects they have to be seen by people on the ground connected to the real details of the business everybody at AWS and I literally mean everybody has operational responsibilities and obviously that varies from person to person not everyone's on the tech call as soon as an outage starts we don't want that but everybody every layer of management has some role some role in either incident management communication to customers operating the business auditing our our decks and that means that you know we deeply deeply understand that every layer of our business what it takes to operate our services and you would have heard it reinvents in the past that one of the things that we do is that we have multiple layers and a lot of teams of operational metrics meetings we folks get together with their senior leadership and look at their graphs they look at their dashboards they look at their metrics and we roll that all the way up to the top and we spend a huge amount of time and a huge amount of resource on doing those things on having leaders really connect into the details and it's worth every cent because it means that our our even our leaders even our executive leadership know what the risks are on the ground know what the actual system is working like and then finally and also you know very important is that corrections of errors and these are the the post-mortems we do after outages or or or even after process failures connect broad sets of builders and operators and so what happens when you know there's an outage or a problem in an Amazon service is the team will write one of these post mortems they will go through this difficult cultural moment of asking what really went wrong inside my service what is the real problem here what are the actions that I need to take why need to ask others to take on my behalf that is going to make sure that doesn't happen again and then we take those documents often very carefully prepared documents and we share them broadly we get together in those metrics meetings and we talk to talk about them with senior folks with senior leadership with executive leadership and make sure that we are feeding those lessons broadly across our organization and make sure that we are talking honestly about these difficult risks this is culturally hard and sometimes even emotionally hard but it is critical to building resilient systems so organizationally we focus on small teams with strong ownership we tend to avoid ops teams we tend to avoid QA teams we tend to avoid teams that have non-operational ownership they're just developers instead we want small teams who have strong ownership over a particular part of their service and this requires us not only to do that organizationally you have to structure the old chart that way and that's important but you also have to structure your services that way you have to build your architecture in such a way that small teams can own well isolated components and that requires careful thinking about internal interfaces internal api's and internal architecture design and one of the most important things that I think technical leaders do is bringing those these pieces together harmonizing the ideas of organization design and architecture into building an organization that can deliver an architecture operationally what we do is deploy often move fast there's no software team as unhappy as a software team you can't deploy their code and often no obviously depends on what you're working on you know we've got some micro services we're deploying many times a day and we've got some you know pieces of hardware where it's much harder and you have to have long traditional deployment cycles but largely we try and deploy as often as we can I don't have the point on the slide here but one of the things that we do very broadly across the organization is what we call Auto rollback and it's something that's supported in our tooling where we deploy and then at the moment after deployment the tooling starts watching those metrics very carefully and if the metrics of the service turn bad the tooling takes the deployment away rolls their back as quickly as I can and in some cases that can happen in you know seconds and milliseconds and that's a very very powerful mechanism it's powerful because it lets us recover from mistakes testing is super important very important part of our quality picture but there's always going to be stuff that sneaks through in testing the other important thing about Auto rollback is that it is culturally powerful because software engineers we're very optimistic crowd of people right like we're going to deploy a piece of code we're gonna see that our you know we're gonna see some some of the metrics go bad our first thought is well that's not my code Tompa my code that broke that must be the network must be a change in customer behavior must be something else and sometimes we were right about that but often we're wrong about it and so Auto rollback of you know rollback first ask questions later is a powerful cultural mechanism to balance that a little bit of denial we all have again like with Co E's we deep dive deep on issues and there's top-down cultural push to say you have to understand very deeply what happened excuse me and so and then you have to create mechanisms to drive the learning back to implementation and it's one of these kind of Amazon isms of you know good intentions don't matter you can you can write a co e that says you know why did we you know why did we have this failure in production or we had this failure in production because a bad got through our testing why did the bad way through our testing when our testing wasn't good enough what's the action item you know try better next time and that's not gonna work that's a good intention how are you going to enforce that what is the mechanism that you could put around that learning to make sure that that happens next time to make sure you actually deliver on that and once you start building up these mechanisms you start to build up a culture of resilience a tool set of resilience you start to build these things not only into into your code but also into the tools that deploy your code the organizations that build your code and so on and that's where the experience of running stuff in production is super important and that's where this loop gets you to overtime more and more experience more and more experience baked into tools more experience baked into metrics picking the right metrics picking the right dashboarding practices and so on and it takes time and it takes experience but you have to do it and you have to spin the loop to make it work and so one of the big benefits of having teams that are deep owners is the loop on the technical side has natural connectivity right we build we fail for some reason we analyze the causes of their failure we write these co e documents our engineering team who are the ones who are running the service and therefore doing that learning learn and they learn how to build better next time and something I see is a failure mode in the industry in a lot of places that have dedicated ops teams is engineers both developers build you know they get something wrong because software is hard so it always is going to be they analyze the cause then something goes wrong in production then the ops team analyzes the causes of failure and the ops team learns and they become super smart over time which is fantastic no problem with that super smart ops team is extremely valuable but this is a step missing on this loop and the step missing is connecting the learnings of that ops team back to the engineering team so the next time you build your building it better you're not making those same mistakes again so this is where it becomes critical to have mechanisms to close the loop you know what Amazon we don't tend to have these dedicated ops teams but you might choose to do that you might choose to do that with good reason it's not a model I just like but it's a model that makes it especially important that you have deliberate mechanisms tools technology and culture that closes the loop between ops people and engineering people and that can't be only good intentions it can't be just somebody standing up on stage once a year saying well you know we learned these things over the course of the year try not to make those mistakes again they have to be real cultural mechanisms real technical mechanisms and so on and they have to consider both technology which is the easy case and people which is the hard case we have to get these ideas into people's heads and make cultural change make process change next I'm going to talk about operational safety and a little bit more about in detail of how we think about their post mortem process and I'm gonna talk about Chernobyl now I'm not a nuclear physicist so I'm not going to talk about exactly what happened and what went wrong some great books on that out there instead I'm gonna talk about the operating room and as you can see there's a whole like rafter buttons and knobs and dials and phones and cords and plugs and all kinds of fun stuff in here and we could talk about it and take us weeks and I'm not an expert on this either so I'm gonna make this easy for myself and talk about one button ah and this is the this is the big red button not actually red unfortunately but this is the big red button this is the oh no something's gone horribly wrong button and a lot of the systems that we build have these big red buttons right roll back big red button something's gone wrong we roll it back you know stop scaling start scaling we build these emergency mechanisms into our tools and into our technology and it's extremely powerful and it's extremely powerful for people to feel safe in hitting that big red button because what should you do first you should hit the button and then worry about the aftermath of that and so that's exactly what happened eventually at Chernobyl but unfortunately as the International Atomic Energy Agency found a couple of years after the accident they say the scram which was the pushing of the bed red button just before the sharp rise in power that destroyed the reactor may well have been the decisive contributory factor isn't that amazing everything's going wrong operators are saying oh no I don't understand the state of the system I'm gonna push my big red button and it becomes the decisive contributory factor that's terrifying it's a terrifying place to be for operators and so if you've read books about your Noble or you watch the recent great HBO series you'll know that one of the operators in the control room that day was a guy by the name of an attorney Dyatlov and Anatoly Dyatlov has been made into a bit of a villain he was made into a bit of a villain at the trial in the HBO series unfortunately also made him into a bit of a villain but after he got out of prison yet love wrote an article which I enjoy very much and he asked a key question one of the most important cultural questions that we should all ask ourselves how and why should operators have compensated for design errors they did not know about the operators in the control room the people who chose to push the red big red button did not know about the design problems with the reactor that led to pushing their big red button being a decisive contributory factor to the accident that happened how and why should our operators be able to compensate for properties of systems that they don't know about we can't expect people to do that we can't expect people to be perfect every time we can't expect them to know everything about the system and we've got to try we've got to try build our systems try tell people about it but we can't have unreasonable expectations of human operators and a very powerful concept in this area that I enjoy very much comes from psychology research the research about kind and wicked learning environments so a client learning environment is a new learning environment where the things that we learn match the environment well our lessons are experienced over time as matches our mental model that we build app matches the behavior of the system more experience means better predictions better mental predictions what's going to happen if I push that button and therefore better judgment and this doesn't mean things are easy right chase is a great analogy chase is an extremely difficult game to master but it's a very kind learning environment you can learn more build up a great mental model and experience these four better predictions and better judgment the other side is wicked learning environments and these are the learning environments with the things we learn don't match the environment where our mental models tend to lead us astray where I experience leads us to do the wrong things and to push the wrong buttons and this is really hard for operators right and how can we expect operators to operate things in these wicked learning environments how can we expect operators not to push the big red button if we say well push the big red button and then when you push it the thing blows up we can't expect that and so for me a very important principle of system design is build your system to be kind build your system in a way that the mental model that people build up over time of how the system works is accurate build your systems for people can bring their experience from other systems and other tools and don't be surprised this is related to the principle of least surprise but you know it's it's broader than that it's about build your system so the people who build experience up over time learn and make better you can use their better judgment to make better decisions and don't put sharp edges on things how do we do this at AWS well first and most importantly CEO is post-mortems to not settle for operator error so you know just don't don't accept that as a root cause people do make mistakes but that's not the root cause because everybody's going to make mistakes and we believe that reviewing tooling and especially operational tooling is very important work for our most senior engineers I spend a good amount of my time with my team talking about operational tooling talking about the tooling they build and how to make that safe and unfortunately in a lot of the industry because tooling is a very easy thing to develop we look at people we say well you know we can have the intern build the tools and that's great they're great interns in the world but you have to build that real experience into your tools so I'd encourage you to have your most seasoned operators your most senior engineers the people with the most experience look those tools and think about how to make them safer and more powerful let's talk about blame for a minute operators are very easy culturally to blame it's very easy to write that co e that says or that post-mortem that says well you know the operator did it wrong it's emotionally easy because it require doesn't require us to look inside ourselves and consider that you know maybe we were wrong doesn't require architects to look at their architecture and say maybe the architecture was wrong it doesn't require leaders to look at the organization and say maybe we need to fix the organization to fix this problem no you can laser focus on a single person and say that person type the wrong thing so it's easy it's very attractive it's culturally easy it's emotionally easy and so you know and and so we need to avoid it very strongly and so why you know why is he well that person made the mistake I'm a manager I know no responsibility here anymore right like because it was them I'm an architect I have no responsibility anymore because it was that person not me we fired them issues fixed very easy to fix one action item don't have to write any more code don't have to build better tools don't have to build a better organization we just fired someone issues over let's move on with the next thing they no longer work here when your senior leadership comes to you and says how are we going to make sure this doesn't happen again you say well we fired that operator they don't work here anymore issues fixed never gonna happen again I mean we don't honestly believe that but it's such an easy answer it's such a social cohesion answer right like it's the easiest answer to tell your boss yeah there was one bad person and now they're gone and so this is a kind of thinking it's culturally easy it's emotionally easy and has to be rejected at every level of the organization you have to have your most senior leadership understand how bad an idea this is and push back on it you have to have every layer of people who are writing post-mortems understand how bad an idea is and push back on it and then this your title starts with the word chief you probably you know you probably don't have the enough power to really just entirely push back on this yourself and so this has to become something of a cultural movement where you talk to your boss about you know why yes an operator made a mistake we're all going to make mistakes but operators do make mistakes I'm not denying that mistakes happen but we have to understand how bad an idea it is and how it completely breaks that improvement loop if we just blame operators for problems so there's some really interesting reading material I'll fade out some of these books after the session so this point about by Nancy Leveson is a book that I I particularly recommend called engineering a safer world and lays out a bit of a framework for thinking about safety in complex systems so let's move on a little bit and talk about technology but before we talk about technology I want to talk about a dog I had when I was a kid it's a very energetic dog liked to run around no matter how many walks you took it on a day it was always bolting around the garden in one way or another and one of the things I'd like to do was leap up on the roof of the house and run around on the roof and that served it pretty well so you seem quite fun and so let's think about the kind of Freebody diagram of a dog when it stands on a roof what are the forces in play here all this gravity pushing the dog towards the ground and because the dog is standing on a sloped roof this leads to two different things happening there's an amount of force due to gravity that's trying to push the dog down the roof make it slide off and an amount of force that is trying to push the dog up the roof and that's a force called friction and for the most part the dog is not sliding right we take MUOS there in red that's the coefficient of static friction as the amount of friction there is you know when things aren't sliding at all and the dog doesn't slide can stand there on the roof it can walk around very safely and then something happens the dog starts to slide it starts to move on the plane of the roof and that newest factor becomes mu K the coefficient of kinetic friction and a fun thing about a lot of systems most systems and certainly most kind of dog roof systems is that the coefficient of kinetic friction is lower than the coefficient of static friction what does that mean what it means that as soon as the dog starts sliding its gonna fall all the way off the roof so you're not going to slide to a stop you're gonna slide and you're going to end up in the bushes and so this dog roof system and I'll put all your mind at ease my dog survived this incident and was very healthy for for a best part of a decade afterwards but never got on the roof again but you know what what is the point of telling the story well the dog on the roof is a very by stable system it's a system that is stable when the dog isn't moving and it's system that it has another stable point and there's a kind of dynamic stable point with a dog is sliding off the roof and ends up on the bushes and both of these points don't really lead to each other right once the dog is still that's gonna stay still it's not gonna start sliding but once the dog is sliding it's going to slide all the way off and fall into the bushes and a lot of the systems that we build almost all systems that we build have a similar behavior have these by modalities and again we'll talk technically in a second about why this happens but this is culturally difficult because my dog walked around on the roof for many years with no drama right so if I was saying what am I going to do to make sure my dog is safe well I could look at all the metrics and say number of times dog is fallen off roof equals zero and so it's fine but we all know that looking at the dynamics of the system is not fine so this is a this is one of these risks right is one of these risks that exists in the system there's a behavior of the system where it folds over with the dog slides off the roof that we essentially just doesn't happen until it happens and then once it starts happening it doesn't stop happening the system is down and is not going to recover without human intervention and this is why it requires engineering judgment and not just metrics to build resilient systems let's talk about one of the technological reasons that this happens so the starts off with the system becoming overloaded and it lots of reasons that can happen your business suddenly becomes popular fantastic you know it's it's Black Friday fantastic customers rash in fantastic you know you have a bunch of hosts failure not as fantastic but also leads to overload so these things happen a lot and as we know in a lot of the systems that we build load increases latency and load increases latency because of contention then locks they disk drives their networks there are things that have a fixed amount of capacity and once you start driving those things beyond their fixed capacity latency goes out or is latency do well it increases the concurrency in the system and therefore increases the amount of increases the amount of contention so that's why noob it's really tightly system becomes overloaded latency increases concurrency increases system becomes overloaded but there's another loop and this is the bigger one where timeouts you know we're good distributed systems engineers with more people we built timeouts into our client and what happens latency goes up and latency goes over their timeout threshold and your client times out and again because we're smart people we bought retries into our clients and what is the retry do after the timeout well it seems more load to the service and well then the system becomes more overloaded load increases latency timeouts cause failures and so on to distraction round and round we go until our service collapses and you'll notice there's no recovery mode here there's no mode that stops these clients from tossing the service instead it's just down at this point it's just down this is the mode where the dog is in the bushes at the bottom of the roof and so you know there are lots of these kinds of loops in distributed systems a lot of these loops aren't as nice and clean as this one it would be easy to design steps if they were all this cane but this is a key one that I see cause a lot of problems with real-world systems well what do we do and how do we how do we fix this kind of thing at AWS one of the most important ones is limiting queue size you know people build queues into staff because queues are great chews help availability they help throughput they help distribute load cubing is a very very powerful idea but infinite size queues and even unlimited size queues or even large limited size queues can be a very bad idea and there's a cultural thing here worth playing attention to you know we have a service and it's got a queue and we've got some kind of spiky traffic coming in and then we put a limit on the queue size because you know this guy reinvent told us and then the next week we go and we review our metrics with our boss as we should because this guy reinvent told us we need to do that and occasionally there are these errors right rejections because the queue is full and what is our boss say as you should please make those errors go away how do we make the errors go away when we make the queue bigger so if this grows errors come back we make the queue bigger and this is what culture is so important because often the things that make a great resilient system like limited size queues aren't necessarily the things that motes that may make the system most available in its steady state there's a lot of crossover there but there's also some real tension limit retries if possible don't have the client retry forever you know obviously what you know what your clients doing what your business is trying to achieve makes this possible or impossible but if at all possible limit retries exponentially back off on retry don't retry in a tight loop you know retry now one second later two seconds later four seconds later and so on but one of the risks were exponential back-off is the exponential functions as we know grow super fast and so if there is a real outage one of the things that can happen is you can back way way off right you can have a 10-minute outage all your clients are backed off to retrying on you after 20 minutes and what does that do it ruins your time for recovery because even after you've fixed your service it takes 20 minutes for your clients to try again against the healthy service so what do you do about that will you cap your exponential back-off you do some maximum amount of time but that sort of breaks the exponential back-off thing because you still have some amount of you know constant traffic from all your clients so that's something to really pay real attention to and make sure your system is designed and can handle the load in that mode where all of your clients are backed off but still sending traffic limit retries no hang on I said that before but do it limit retries especially not twice on one slide and then finally end-to-end backpressure and this is a fantastic idea and this is the idea that you know you want in a distributed architecture right we've got layers and layers and layers of microservices fantastic nice and clean lines are well with our organization what happens well one of the services at the bottom gets slow or overloaded so it's next layer up retries three times and because the next layer above that times out before those three retries are complete it retries three times and because the next layer up times out before it's through you it's you know a it retries three times and suddenly that bottom service isn't getting one call it's getting 27 add a layer it's 27 times 3 come on through that in my head but it's even more so you kind of get this in a layer cake of retries and instead what you want to do is have that bottom service give back pressure say hey I'm overloaded push this at the stack don't retry it because retrying on me when I'm overloaded terrible idea push it all the way up the stack and sometimes they can go all the way up the stack back to the customer and sometimes it has to go to some kind of top layer that knows with your business logic how to support that and you know if it's a if to website you know sending it to the customer is probably fine if it's a really you know critical IOT device you might don't get away with that but you do want to make sure that services can explicitly say whoa back off and send that as far at your architecture as you can and finally our final topic about Technology talk about jitter and the idea a very very powerful idea that randomness and pseudo randomness add resilience to our services but before that I want to talk about cheesesteaks cheesesteaks one of my favorite foods if you don't eat cheese steaks just imagine you were one of your favorite foods up there and there's a great track that comes around you know around my office building it serves a fantastic cheesesteak very popular so folks get hungry around midday and because we're these flawed animals who like around numbers we look at our watch and say it's 12 o'clock I'm going for a cheesesteak so we stand out from my desk and we walk to the cheese extra cheesesteak track and the queue is really long because everyone else likes 12 o'clock it's a nice round number and so we say are the queues really long now I'm going to come back a little bit later so we go back to our desk 1215 rolls around because we like round numbers and we go back to the cheese static track and the queues really long oh man I'm gonna come back later 12:30 rolls around again we like the round numbers so we go back to the cheesesteak truck and the queues really long oh I'm just I'm just not going to get a cheesesteak today and what's happened there or what's happened there is that there's been a spike at 12 there's been a spike at 12:15 there's been a spike at 12:30 and if you had gone to that cheesesteak rocket-like 1207 or 1211 or something it's quite possible that they just would have been empty there would've been no queue and you would have gotten your cheesesteak way earlier but because we and a lot of the systems we build cron and you know all these automation things like these round numbers do it once a day or once a day meme just off to midnight sure why not once an hour yeah at you know one one zero zero zero zero once a minute let's do it on the second systems end up with this really spiky traffic and you can see this if you look carefully at the kind of permanent per second and per millisecond time stamps on basically anything that gets traffic from humans or gets traffic from automated systems you see these spikes of traffic on particular numbers particular hours minutes seconds boundaries because we like round numbers and we build them into our systems and in that silly so distributive systems have this problem they have this problem of floods of traffic at certain times and then being much less busy between those times we're just kind of tossing ourselves and so what do we do about it well we add jitter and justice just this idea it's Vegas I think about the wrong number of sides on these dice but jitters just the idea of adding some amount of randomness to our systems so let's look at a bit of a simulation of the behavior of an exponentially backing off or a set of exponentially vacuum of clients talking to a capacity limited system well what happens everybody try it tries a time you know just off to time zero big spike of traffic there everybody times out and tries again and then again and then again and then again and then again and so you can see the exponential back-off that's happening here each gap is twice as long as the loss so we've got that best practice down but again just like the cheesesteak track we are overloading the service at particular times and it's mostly idle in the gaps between those and we see this in real world system logs all the time you know both from humans and automated systems so what do we do if you can get away with it add jitter instead of sleeping for an exponential amount of time sleep for a randomly selected but exponentially increasing amount of time and what happens here while those spikes go away and the service becomes much more busy the queue maximum queue size goes way down and we actually complete all of our work by time 1,000 instead of having these constant spikes they essentially go off into infinity and so the simple idea of adding a little bit of randomness has made this system behaves much much better and so you might be thinking well if I do this with randomness why can't I coordinate this what can I build a service that is like a win can I go service and that's not a terrible idea but as we all know you know when we build distributed systems coordination is expensive coordination is slow and choosing a good time slot requires coordination it's a coordination problem and what scales way better and is nearly as good is just picking a random or pseudo-random number and spreading things out randomly in time you know there are other techniques but this is a very very powerful one to start with and then people often ask me when I when I do talks like this how do I add that randomness like is it isn't an additive amount is it a multiplier you know how can I calculate that well the simplest way and very nearly the best way is to take your exponential function so that's you know 2 to the power of number of attempts right or whatever to the power of number of attempts times base so base would be like a hundred milliseconds and then on my first attempt attempt to be zero and then one and so on so that's a hundred milliseconds 200 milliseconds 400 milliseconds and so on and then just choose a random number between 0 and that number and what you will notice about that function is that it means some clients will retry immediately and some clients will wait for that entire back off period and there will be uniformly distributed between those and that's exactly what we want we want things to be spread out as much as they can be in time you know those of you deep on statistics will say that they are actually better ways to do this and I agree but this is a really really simple way it's one line of code and it's very very good so what do we do in AWS always jitter when using back off adjudge are as much as you can try not to back off without jitter in fact if you're backing off just always edge it so it's going to make things better always jitter periodic work add randomness to those timers to those cron jobs you know to anything that happens every hour every day every second because that's going to spread your work out and instead of having you know these spikes have load to your servers you're gonna have nice much more spread out workload and there's something kind of subtle here about the way that we monitor systems a lot of the way that we monitor systems with with a lot of the common monitoring tools make kind of time buckets like a one second back at a one minute back at a five minute bucket and you say how many requests were there in this bucket and at all this time series and often when you look at those you'll see well my service says my graph says I'm doing a thousand requests a second and the problem with that back atang is it tends to lose the peaks and so what happens is if you actually go and look at the logs you'll see that things are way spiky er than that and maybe there's a piece of automation that's sending 900 of those thousand requests in the first millisecond of the survey the second and the rest of the second is mostly empty and that means you have over scaled your system by 10x those of you who are lambda customers would have seen we launched percentiles on currency metrics this week it's actually one of my favorite operational you know operational tools ever because you can look at a percentile look you know that the tail shape of how busy your service is and then talks over let's talk later today I'm gonna be talking a lot about concurrency and why I love it so much is an operational metric but if you're using lambda definitely check that stuff out and then consider adding jitter to all work in your system consider just adding some randomness some random delays into all the places you can in the system so you build you know you don't want to do too much you want to slow things down you don't want to be like by that threads on doing a bunch of sleeps but as much as you can add randomness within reason you don't need to be silly about it but within reason it's going to make your system scale better how often you con right like you know you can't necessarily add just a two you know if you go to you know you go to amazon.com and you're load that detail page because you want to buy something cool and that detail page took 10 seconds to load because somebody added jitter on the back end that's not a good customer experience so I'm not encouraging that I'm saying where it doesn't matter don't do work on these round numbers we're humans we lack round numbers our computers around micron numbers we build them into our systems and we shouldn't because it scales badly so what did we talk about we talked about ownership and the loop of Devon ops and now of how if you structure your organization to have separate ops teams from your dev teams you have to be extremely intentional about creating mechanisms that close that DevOps loop that take the learnings of your ops team are learning and feed them back into your development practices take those things that your ops organization is learning and feed them back into your development organizational practices we talked about operator safety how to build systems that are kind and how kindness sets operators are for success how blaming operators is a foolish thing that just breaks the loop we talked about service stability and what it means to be stable at scale and how systems can have these modes without going along really nicely and are stable in one mode and stable in a mode where they've just completely fallen over and how that can be difficult culturally and technologically to monitor that and finally we talked about jitter and how jitter adds resilience and it's a good thing you should do all over but most importantly my most important point in this whole talk is that succeeding in building resilient systems requires both culture and technology you have to be great at both of these things to be successful and this sometimes you know it can be difficult for engineers to hear but you need the right culture you need the right organization and you need the right technology you know I'm an engineer I love technology but technology can't succeed without the right culture if you're enjoying the DevOps track AWS has some great DevOps training and certification opportunities you can check them out on the website and their opportunities throughout reinvent learn a lot more about about DevOps please fill in the this often speech speaker surveys they help us make reinvent better and helped me as a speaker understand what you liked and didn't like about this talk thank you very much I really enjoyed speaking today so thank you [Applause]
Info
Channel: AWS Events
Views: 3,944
Rating: 5 out of 5
Keywords: re:Invent 2019, Amazon, AWS re:Invent, DOP342-R1, DevOps
Id: KLxwhsJuZ44
Channel Id: undefined
Length: 52min 21sec (3141 seconds)
Published: Thu Dec 05 2019
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.