AWS re:Invent 2019: [REPEAT 1] Improving resiliency with chaos engineering (DOP309-R1)

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

thank you very much for coming I know it's been a long way vent I hope your head this is not exploding too much if it is I hope at least we're not gonna explode it I even more did you have a good reinvent so far it's good right I'm sure the guys who says no wouldn't say like too much like so thanks anyway so today we're gonna talk a little bit about chaos engineering and just before we start show of hands who does currently chaos engineering at home only chaos if we only care that how how many of you are kind of funded or interested but find it very hard to start right so almost the same people it's quite funny and now many of you kind of trying to plan or to get into it interested the other ones is going to be after this session maybe or not all right cool all right so this this session is going to be co-presented with olga hall who is leading the resiliency team a prime video and she's going to talk about the journey of prime and I'm gonna talk about chaos engineering in a more holistic way it's basically I've been working for sometimes almost 12 years on AWS and I've been doing chaos for almost six years now and I worked with quite a few customers lately and trying to help them develop chaos chaos practices so some of these these learnings are going to be in this deck and so I want to share that with you so I want to ask you a question how many of you are firefighters in the room this few of them that's really cool so thank you very much actually for doing that there's something very peculiar with firefighters is to spend an enormous amount of time training actually some studies said that professional firefighters have to train several thousands of hours before going to production going really kill the fire a real fire do you know why right it's to build an intuition actually they want that fighting fire becomes very very natural and if it's not natural basically the fire will be faster than you and you'll probably get killed so they want to develop this intuition it's actually the same intuition that we have when we walk if you would have to think how do i balance myself when i walk it's too late i'm falling it's very similar with firefighters and there's something interesting here because in early 2000 we had one person called Jesse Robbins that was hired and he was a volunteer firefighter and he came to lead some of the operation teams at Amazon the retail side and especially trying to make the retail side more resilient to outages it was the time when Amazon started to grow ridiculously and well you know failure happen all the time and you know at this more you scale the more you have possibly to have failures so they started to look at new ways trying to prevent failures and Jesse Robbins was hired and he brought the idea of firefighters and trying to build an intuition and actually trying to develop that capability to train before going to production and fight fire in production I'm sure many of you have had outages in production right how many have been trained by outages in production right it's less really the right place to train yourself because if you if you go in in production and you have an outage humans or at least I have very often felt very very stupid I lose 50% of my ability to do anything this under pressure you start sweating you do really really stupid stuff I deleted production database while trying to fight a metered mid-sized outage and I can tell you the result was a dramatic outage and that's simply because I was not trained right I was really panicking and didn't know really what to do like and in the meantime I got disturbed we didn't have a war room so someone came and yelled at me while we had the small outage I go back to my console and I get in the law wrong terminal and I do a drop table and then I press ENTER and the moment when you press enter and you hear this voice in your head at this moment I knew it was wrong you know just you already hear it and then that's down but that's simply because I was not trained enough right and so he brought this ID at Amazon and wanted to practice outages so what it would do we started game days in 2004 and what it would do it will randomly go into a data center and start unplugging servers or killing processes so that his team could practice recovering from failures or at least build system that would be resilient to failure we call that partial failure modes right so that kind of grew within Amazon those massive game days as we grew kind of switch into more midsize type of game guys but actually prime video and we'll hear the story nowadays do game days in production several times a month or even every week so we'll hear this story from alga so game days were kind of nicely actually Jesse Robbins and probably the best title in the world like master of disaster I'm not sure I it goes on on LinkedIn if you try to be hired later on but at least it's prequel then in 2000 in the early 2010-11 Netflix kind of did the whole move to AWS so they migrated the entire infrastructure to AWS and they started to adopt micro services architecture and in last that time a degress was not you know what it is today we've trained a lot we've improved a lot but the beginning of the cloud you know there was maybe less managed services people had to do more things so it was also more prone to failures and they wanted their system to be really resilient so they started to build monkeys like a chaos monkeys that would kill randomly instances in production to make sure their software was always stateless for example that was always always very well architected to cross multiple races and stuff like this they even went into creating a Kong monkey that was killing full full regen services and then they would practice flip from one region to another which they still do today so we are here right then in 2015 they formulated that field that is now called chaos engineering and if you want to read about the principle of chaos is a very nice website about it what is very important to realize this case is not just let's go and randomly kill stuff in production don't go Monday and say I heard the talk and reinvent chaos engineering is great let's unplug database it's not really like that actually we do really chaos engineering to avoid having to train for failures while it's happening in production it's a way to gain build confidence for your team into our application into our tools and in our culture to wisp and turbulent condition when you have an outage it doesn't only affect application it affects the culture it affects us the processes the tools so case engineering is a very good way to test all that before problems happen in production however we don't really start necessarily case engineering just like this there is some prerequisite you can start Kings engineering very early on but if you want to really take it to the next level and go as close as possible to production there is some prerequisite and especially in terms of resiliency this is a list of some of the things I've seen most happening in production some causes of outages just a quick hand here how many of you have been victim of expressions of certificates outages yeah right and this is not something we can really engineer compare different engineering problem it's really checking a date on a certificate and yet I've had at least three of those so is very common so we need to have processes and we need to have some automation that I know helps us doing this on my side I think the application side having especially timeouts defined you know frameworks very often have default timeouts Python developers in the room a few of them this Python developers the request library to do STD because the default time out any idea infinite so this is by default it's not defined so it really can hold the connection it's not the only one Python it's not the only one actually c-sharp is the same the JDBC driver for sequel same these collections of libraries to do HTTP calls that actually have very very long defaults if it's not infinite sits 30 minutes or things like this and you know in the distributed system and in application 30 minute even is really really infinite so do you have processes to look at all the timeouts that we use in our libraries when we do npm install or when we do peep requirement install do we look at all those libraries answer all the time ads are not the default one we assign them we know what they are how many of you do this review yeah sorry I know one so this is also a problem right because one day there's gonna be a failure and the timeout is going to be infinite and then you're gonna do a post-mortem and set up maybe we should define timeouts and chaos engineering is gonna find this pretty pretty early on especially if you do a latency based injection program another one retries with back off right if you if you experience an issue and you want to do retry very often system would do immediate retries like that are we there yet that are we there yet that are we there yet that are we there yet you've heard that right you don't want to do this kids would drive you crazy it's the same with the software system if components keep on asking and retries if there's a failure you create what we call retry storms your network gets totally full of retry packets and that creates very often cascading failures so what you want to do is implement back off so you ask a request if you don't get the answer you back off two seconds four eight sixteen and especially you need to have a max retries so these are kind of few things that very often at the outages are happening monitoring as well of course if you want to do chaos engineering you need to monitor I need to what's happening so even though you can start very early on you know in your laptop do docker stop or you know kill processes eventually you need to have a baseline to build this confidence in your application to say all right now I believe my application is solid let's test it right and this is actually what case engineering is you have to have this confident this confidence that your application has a very strong baseline is really resiliency if you want to read about this we actually launched today during the keynote I couldn't add the slides but we launched the builders the builders library which is collection of papers from all our principles the Amazon how we do things that we build resilient system and it's also you can read some of that into the well architected papers I'm not going to go into details here because we only have 60 minute so we've now have a quite resonant application our team Israel is confident we can start chaos engineering right so what is it well it's actually a cycle it starts with understanding the steady-state on an application right then we move into the AI potus's phases where it's basically a scientific experiment we have an ID we verify it or we execute an experiment and then we verify it and then eventually we want to have the improve cycle right because if you find something you really want to improve it so let's dip dive a little bit in this phase what's the steady-state the most common problem is here we steady-state these people focus on operational metrics like CPU memory concurrent connections or things like this it's very important to attach this to a business metric I'll give you an ID amazon.com a very good metric is the number of orders because it's if there is a failure in kind of a checkout it's a bad user experience you you know you made the search should take some time and the moment where you want to check out it fails that's a bad user experience so for us steady states here is very good because it's a notion that a the back end is broken but it's also a bad user experience this is a good steady state and a system doesn't have to have only one steady state it can have several of those we're gonna hear some of them from Olga for prime video also focus on that sometimes it's actually very hard to understand to understand what's the steady state of your application but take the team and spend some times trying to look at it but if you combine customer experience with operational metrics that's usually what what gives you the best results once you have your steady States it actually is very important to know the steady state because when you make a cow experiment you want to verify that it doesn't break the steady state right so you don't want to make a case experiment that changed or breaks everything on your steady state so that's why it's very important after that you go into the iPod is this cycle and this is one of my favorite one because what I always suggest to customers is take the entire team that develops the application not only the engineers or the back-end developers which is often the case when people are starting with case engineering take the entire team everyone that is involved in designing this application from the UI designer to the backend the architect the program managers everyone and then you make a non pathetic you say for example what if so what happens if for example take the database down what happens if I inject latency in the network take one hypothesis I start somewhere there and then what you need to do here is very important ask people to write on the paper write what asked them to write on the paper what they think is gonna happen with the updates so for example I'm a back-end engineer someone asked me what happens if the database goes down as a back-end an engineer I would probably write something like well my back-end will detect that there's a failure in the backend through a timeout the timeout should happen after 30 seconds after 30 seconds the circuit breaker will notice that the database the database should flip to read-only mode after a minute to reboot maybe to update the domain name of my database and all these kind of scenarios try to get the times and what people think how long is going to get why do we do it on paper is because if you do it without paper and people talk in a room it creates a convergence if you take people in the room and say what happens if database I'm gonna start saying something the other one is gonna say yes this this and you create convergence and there's no way to really find if there's a problem if you write on is divergence everyone has different ideas and then you can stop here and you say why is everyone having so different ideas of what's happening your production that's usually a problem in specifications documentation so it's a very good way to say whoa whoa all right let's recalibrate and let's roll back and you know fix our specification maybe fix some part of the software and this is very interesting phase actually I also always tell customers that in the specification phase this is very good way to do with to improve specification not necessarily when you do chaos engineering so make it everyone's problem and then what is what is also important and which is often hard for people to start with case engineering is which hypothesis I'm gonna make well if you're on AWS I would say and you probably are using some sort of auto scaling group I would say something like a test mode - scanning group right you have an application across multiple ICS you do CPU injection on the instances behind a load balancer and see actually if the auto scaling group is reacting the exact way you were thinking that's one thing that you can start with or look at the history of your outages what is the most common sort of outage that you get if it's deployment if you get outages often with deployment a very good way to do is maybe inject a failure in the deployment pipeline say ok I'm going to deploy something and that should fail the health check and then you can verify you have the rollback and all this kind of stuff and this is quite simple just look at your history another one the third one and in my opinion so it's pretty nice you look at your api's and the critical services for each of your API is and you say ok I'm going to focus on this API let's look at the critical services and you start with the last one and you go up but don't stop there then make combination of experiments if you just do a simple experiment like a CPU you Jackson you know that's not enough usually you want to do a CPU injection and at the same moment you might do a Latin si injection and this is where you see problem can really surface because outages never happened because of one problem they happen because of a collection of small things that create a much bigger outage a scenario so focus on these kind of things what kind of scene areas do you see and try to reproduce this Lennie needs run in the experiment and usually in case engineering we run experiments by doing failure injection right and there's plenty of ways you can do further injection the most common ones on the application level so you might throw in errors I'm sure you test sometimes your application like this you generate exception or error and see if you catch them do you want to fail fast or do you want to you know do a very complicated trying catch and then hide the errors fail fast is often a very good solution in that case especially if you want to have a recovery oriented architecture on the host level I've talked about it might be a CPU injection so you burn the CPU or you remove some part of the memory you remove the disk space how many of you have had outages because the logs are filled up the space yeah that's common one right this is a very good actually sort of outage to start with our sort of experiment just write a big file on the disk of the instance and see what happens if you don't have space on the instance you can't open sockets you can't do anything right so it's usually a very good way to see if you can deal with this kind of stuff and then you can go all the way down to maybe a bit bigger a Z attack you know you remove a subnet you deny all traffic in the subnet or then a regional attack you remove all the routing to that to that region especially if you use the ENS or something - now you might see a people attack so let me explain what the people attack I don't just go around and break somebody's neck don't do this a people attack what I love to do is go into the teams and identify you know the 10x developers the guys that or the women that are really really good at doing extremely complicated tasks while drinking your coffee speaking other people they are the GU ran they know everything they fix everything no there are gods and we all have them in our teams how many of you have kind of these kind of gods in your team right what happens if they get crushed by a boss right take the laptop you arrived the morning and I've done that like few weeks ago I went in the company they had this god I took the laptop and I said you go home six hours we had to bring back that person in urgency because no one knew how it was going to fix the problem or they didn't have access to a MFA or they didn't have the right code the right credentials these kind of things is actually quite scary if you look at your teams how little information sometimes is shared right so do these kind of things as well it's not necessary on the software attack you know culture attack people attack is something that is very interesting so I highly recommend you to look at this what is very important here as well is once you plan an experiment have an idea of how you might want to stop the experiment or rollback it can go wrong and this is why we capture the steady state you monitor the steady state you do to your carry case experiment and all of a sudden the steady state goes wild and you're like oh we need to stop something is wrong if you don't have a stop button or way to rollback and you didn't think about it well there you have an outage in production if you just create it and I'm telling you it's not going to be good for your team or for the case engineering publicity internally so think about it while you design your experiment have a way to roll back practice it may be tested before in kind of a test environment and things like this and keep in mind that sometimes caves experiment might corrupt or data or store incorrect data so either way maybe to delete it you know attach data attach to tag with some data so that you know that this was synthetic data or things that you don't want to keep in production sometimes it's actually quite hard but if you think about it at least you make it easier for you and then you go into the phase of verifying actually what you want it to do and this is very very very important time to detect how many of you ever had outages that you didn't get an alarm for yeah it's very very common one of the biggest outage I've had we got alerted from Twitter yes you don't want to be alerted from Twitter and especially you don't want you see CTO to tell you Twitter it tells us that your system is down what's happening right practice this kind of chaos engineering where you don't tell everyone in the organization you're gonna practice maybe an experiment today you say this week we might do it or if we if you say today stay they so that actually people can really check the time to detect and verify that the escalation past might work I've had people leave teams and the schedule impasse didn't get updated and didn't get a escalated yeah it happens verify everything every time you do something try to verify your response time the time to detect and an outage is super super important and though and then go all the way to time to recovery with everything in between how long did it take for you to notify the public do you know what you're gonna say do you have already messaging in place or do you need to ask your PR to write something while you have an outage you need to communicate so have this in place if you practice it it's not only engineering very often when you have outages it's a lot related to communication communication internally and externally Keshan alright so try to keep in mind that you want to get as much data as possible what you want to have that is because then you need to go into a post-mortem right this is where you capture what has happened you make a summary of the incident you put all the data that you've collected into this post-mortem you try to get a timeline of what happened try to and this is where people always complain when I'm saying this try to deep dive into the cause of an outage now in brackets there's never one cause there's no root cause we all know that it's a collection of causes right we always look at Amazon at three different points processes tools and culture all right deep dive in this one because we will have a lot of different point things to improve things to think about try to get as many graph as possible humans are really really reactive to graphs right so it adds a lot of value in a post-mortem by the way we call the deep dive the five why's it's not literal some cos have literally 40 or 50 Y's in different parts of the thing so it's just a process that we call five was because it was used back in the days very old when we started basically doing that process and it comes from Toyota we just haven't changed the name but the way we do it is still very different than just the five why's especially we have a culture that is non blaming so we never ever stop at the operational problem we don't say it was an operator making a mistake this is actually a very good point if operator makes mistake what do you do you stay there why was he able to make a mistake why didn't we have this guy in place or why wasn't it worn you know when I deleted the database I could have been fired I was not I was I was actually asked the question how come you were able to run that in a production system and this is a very good question right so we went into creating all sorts of tools and processes is not removing freedom to me is actually trying to avoid me making mistakes I was still able to do all that stuff but add guardrails if I write drop DB then I had an mfa protection and I had to verify or if I had the terminal in production and one in test different colors I read for production blue for test environment you know all this kind of stuff it's trying to create safety in your operation it's not removing the capability or the freedom of your developer create a safety environment for them try to capture the blast radius of what is happening and the number of people that are being being affected by the average the potential outages and try to find ways to limit that as much as possible right few things to remember and I already talked about is never blame people for the mistakes and then there's never one cause for outage right and then you're going to improve and here I don't have a silver bullet what I would say is for us to improve we always go through what we call the weekly operation meetings and these are weekly meetings that happen with all our operation team from all the service teams we go through the post-mortems that we've had identified the actions follow up on the action try to share some of the best practice with ahead if there's best practice we love to automate that or create a tool so that the automation has the best practices embedded in it we don't need to repeat it these are kind of things that actually improve operational excellence in the long term if you rely on people good will you're doing wrong because people already have good intentions when you hire people the engineers they want to do good mechanisms are the things that actually will help right and this is very important you can you can look something called the unknown cord which is a very good mechanism as well for everyone in an organization to be able to you know stop processes or stop things that go wrong and this fits very nicely in that in that process there are some challenges in my opinion most of the challenges are actually cultural and I'll talk a little bit about it but it's also the case in generating dozens of all your problem right it will identify some but case engineering is a very good thing is that it changed start to change the culture of companies in the long term because people that do caves engineering start to feel a lot more humble humidity increases because well it's gonna is gonna give you a lot of different failures and then you can realize that maybe some decisions that you've made or other people in your team have made are not you know as strong as what you had thought initially so culturally this is something that in my opinion is the challenge but it's actually its greatest strength as well so look look into case engineering my opinion for all this goodness technical goodness processes culture and tools all right and on that note I want to invite Olga to tell us around the journey about prime video from chaos to resilience thank you thank you my name is alcohol and I run resilience engineering team on Fran video it's an honor to be here and share the story of my team OSU about six years ago for those of you who raised your hands and said I'm just started to think about chaos resilience caries and resilience on my side I wasn't the same shoes I was thinking where do I start how do i structure it how to make it work what the progress looks like so everything that I will be talking about is the inside and my and my team lessons from that journey so happy to share so let me set the context and it's important for me to highlight that under the hood and prime videos there is actually three different businesses if you're a prime subscriber you have access to Amazon original content or you can catch live content such as maybe music or also two days ago we launched English Premier League for example for customers in UK so that's like sports if you want to watch in front and move it you in the mood for that so you can do that we ran to buy and the third business is you can also watch your favorite channel it's important to highlight it because this complexity of business rules and this complexity of different services communal together comes in the picture when we talk about chaos and resilience okay so let's talk a little bit about our journey where we started we started with the game days of preparation for Q for all Missourians and many of you know that when there is concentrates holidays there is a lot of holiday shopping Black Friday Samba Monday right preparing for q4 is super important to all of the Amazonians okay great we started there and what we quickly realized is that is not enough as Prime video moved from one country to 240 to territories that were run on right now we were looking at the content going viral and having super popularity in different countries in the different zones and we realize that we need to be totally in Evergreen State our services need to be always ready for those pleasant surprise okay so what did we do at that point I've added an engineering team that engineering team very soon created a suit of products that I will be sharing with you as an example of how to solve chaos and resilience and right now we are at the point where a lot of the technologies that we have is running automatically behind the scenes and we are testing and production okay here we go when I got asked a question so what's after and I think about this as well all right so you have the technology you run this game days and automatic fashion so what comes after that what I'm finding is that there is inevitably communities that shape up within the organization and within also larger industry it's very first-ever who are curious about and passionate about either chaos or is alien to specific load testing practices and these are the engineers that start talking about going much deeper in their journey and creating even better attack okay so when I started for those of you thinking where do I start I created the program along the lines of well architected framework which basically means that we have very systematic set of programs that grows and improve availability scalability and resiliency the only difference here is that we also had such a thing as high-profile events high-profile events or we called them also high velocity dance is when you launch such popular content as Jack Ryan Walmart lessness me so that is still to come life sports and then on the corresponding team structure I have a team of engineers and a team of technical program managers that run this programs the engineers build tools for the program's an interesting insight here at Amazon we have the culture of strong ownership okay compared maybe to some other companies where you hear that site reliability engineers and people whose job it is to focus on performance but don't do that we basically make it all of our team's job to care about availability and resiliency however inevitably within each team there is a person that is super passionate that steps in and says I'm gonna be scaling a point of contact and we work typically with this single point of contact within the teams to structure those programs and accomplish our goals all right um so what is the inside what did we learn over the years one thing one the really big thing that we learned over the years that live-streaming is super hard and the bar for live-streaming there's so much higher there is a phrase that I hear very often especially this year that nines do not matter if the customers are not happy and frankly this is so trooll we had super rough summer especially August during tennis matches where customers were not happy and we've learned that expectation for streaming is for live streaming especially it's to be hundred percent uptime not only that in addition to this 100 percent uptime you need to be better than broadcast you need to make sure that all of this new features like high-definition and high frame rate and also ultra HD they available and user experience need to be intuitive okay that was good lessons for us so where did we start and what was the mental model when we were thinking about life streaming our mental model and the frame of mind that we stepped into is that readiness to handle failure Oh unknown its feature zero in the product and I specifically focus unknown why is that well let me show you a couple of examples all right that is super interesting that is bundesliga this is a football match that happens in Germany and what is interesting here is that the blue line shows the regular ramped up of customers coming in joining watching the game great there's workload coming in typically much higher than your typical profile that's wonderful however the match got tied in the middle so it was too high-profile teams going at it and it was not clear who's gonna to win so what happened we had this subscription spike the black line is a subscription spike that is happening right in the middle of the Match super counterintuitive because you would think and that's how we plan that you have a high profile match and two rivals planning you will have customers joining some we're learning right from those of you who are thinking about were close workloads and how to prepare for them would have told us that it's also we also need to think through a rival rate and the spike happening at any given point okay this is a second one that was super cool and I want you to share with you so this is a popular TV show that is on our service and what was happening with that customers you see this double hump which is not normal customers were watching the first episode when it got released and what we know is that customers were reworking right after that this same episode second time absolutely non-intuitive for those of you who are thinking how do I test this this means that you need to test your workload double in time double in duration that was a bit of surprise all right this is my favorite and I'm gonna bring him back I talked about it a year ago that last minute is literally last minute this is also what you see in digital media this is subscriptions literally a minute before a popular show starts and I called an Eiffel Tower so here you go so let's talk about how to think about this unexpected and what can help us deal was this unexpected there is a John law that I really like when I think about framing hypotheses and working with my teams you know how do how would we start right and Riaan talked a lot about building hypotheses and your questions probably or so where then I stopped what are the hypotheses are right John law what helps with that and the basically says that your average number of customers in the system is the product of the arrival and then meantime in the system okay so let's break it up ultimately you can have a hypothesis that says I want to know what is a total load will be that comes to my services which basically says what is total number of customers in the system and in order for me to do though I need to project and forecast the arrival rate and then time in the system and you can basically have a hypothesis and experiment that prepares you for the overall load okay another thing that you can do you can say hmm arrival rail is super cool and I need to plan for that this sharp spikes in the middle of the load this is interesting I need to have an experiment that allows me to do that I wanted to share this customer examples and purpose because all of you have seasonality within your services you have seasonality was in your business and what's really cool to start with is understanding your customer behavior how they interact with your systems and stop there because this will allow you to construct your hypothesis and your goals would you want to test where do you want to start with when you do that the conversations about technology and the choices that you have become much much easier then you know which kind of 2/3 you need to have and we'll take a look at recommendations what goes into this tooling as we go along so let me share with where do we start with where did my team start we started with asking questions okay the first thing that we needed to do we needed to understand can our services sustain projected load okay great and what you see here is our resilience framework where the first step in this resilience framework is basically really we're starting with understanding our hardware profile how many instances we might need out our outer scaling rules all of that right and it's mostly positive testing and it happens in production so all of that load testing experiments do they are happening in production environment and prank video okay so what is the second thing you do the second question that we ask we basically ask okay now that we have that full load right for who projected number of customers how does this load change performance and what should we see between our dependencies and services in terms of latency errors and throughput I'm going to phrase it slightly differently a question that you probably hear very often is about a Solis how do we construct a solis between our services and dependency so doing experiments like that helps you to identify what is acceptable okay the third step that we do is we ask ourselves when is the breaking point and we run the stress test experiments when we go a little bit higher than our focus and in some instances much higher we typically recommend our teams to do it and their betas and their test environments first but we also run this experiment here in production as a matter of fact just last week as we were preparing for the launch one of the team said hey although we've done all of the testing that we need to do for the projected load but we really want to know the signals of our systems and distress why don't we double and we basically construct this stress test so we can understand how the systems will respond we run that it was super productive tests really happy with that and the last in the resilience framework but not least the SKS experiments and this is when we do either dependency failing super popular one here's experiment or slow Network or CPU and memory maxed out now sure who is you specifically the most popular one was in Amazon video engineers what answers did we get here's all of the four questions and our zillions framework what did we learn in digital media when you read about big events there is almost second assumption about concurrent streams it's almost like a currency by which you measure how popular they tent was and ultimately what we've learned is that part of the playback systems they are absolutely scaling up with this concurrent streams metric remember when Adrienne was saying when you construct you experiment tie them to your business metrics so for us that concurrence 3-matic metric was super important to understand the house of playback and experiments in the playback but we've also learned something different we've learned that the arrival rate matters to the first part of the journey where customers are subscribing the discovering content they're interacting with the system's before they hit the play button and our insight was that we need to construct different experiments that line up with different business metrics and the biggest takeaway that I have for you is think through not one string through multiple business metrics that you want to have and those two that I shared with you especially framed in the littles law is probably a good example of how to think about construction of these experiments okay so let me share with you which tools we actually use we we're known for game days like I said we run game days in production and when we run this game days this is the hypothesis experiment and we ultimately have a few ways to construct the game thing one way is to ask yourself a question I'm preparing for this high-profile event I know what my workload peak will be what do I need to do and are my services ready for this workload right so that load testing the second thing that we'll also do as Elian pointed out you can take an outage and you can take looks from the outage and you can run logs also to replay what has happened so that you have a bit time to dig a little bit deeper into the systems and services and understand what went on we do intentionally cause failures to determine if our systems behave as expected little fun fact when we were preparing to English Premier League we run the dress rehearsal and during this dress rehearsal we had a test stream from a pl coming through and our team sitting in the room they did not know which some of the issues were we were discussing real one and which ones were test one okay what kind of coming a loop together on purpose just it has the responses from the teams and one of the case experiments that we're on we did host reboots and frankly we rebooted 7,000 instances in one day it was played across multiple teams obviously across multiple services but it was really good to see that monitors and alarms were kicking in and our friends when AWS was also tapping us on the shoulder hey what's going on whispering video and we're like don't worry we just want in case experiments so it was kind of cool in as part of this game days testing and running of hypothesis that alarm certification making sure that your configuration is set a super great by-product that we always almost find helpful what we do from process perspective we have a process called operational readiness review where we ask our teams that create other new services or new features go through systematic set of questions and make sure that all of that resilience framework that I've just discussed is being addressed and all this tests been done and they show the results with us okay so what do we do in terms of towing we've created a control plane for running game days and then basically super flexible and allows us to put together a large number of load generators hands off executions there is only a one engineer pushing the button and it runs across entire platform across all of our services and also do tendencies in retail and digital partners so super cool super helpful okay we also build additional products so far they are internal products that help us do forecasting so the forecasting is done for each individual service you kind of need to know what that load will be we run resilience experiments those chaos experiments sometimes we deployed them at full load automatically and sometimes we run them manually depending on complexity super important thing so Adrian was pointing out you have your hypothesis you did the validation but that next step is learn and improve learn and improve how would you do that we build fairly rich reporting and analysis set and what we are able to do with this analysis set were able to say ok what happened during tests let's compare it during always the production and that's also reason through what do we expect in terms of load during real event how many customers will be on mobile devices how many customers will be on living room devices who's going to be on the web and we basically talk through how that load distribution will look like and having that reporting and analysis super helpful all right these chaos experiments and I kind of organized them but the most popular ones so as you can see CPU hog packet loss dependency super popular ones within our engineering community let me give you a practical example a very common outage scenario besides certificates exploration is dependencies throttling right dependency has failed so what we basically deal with one you have an availability drop because of the dependency the solution is very obvious right you need to think through either fast field circuit breaker or something similar to that so that's great but then the question becomes how do you test don't really want to wait till the next outage to test that so this is where Kerr's experiments and doing failure engine and throttling injection is actually help you and these are real life graphs where after we put the code and fix in place we run the chaos experiment here is the latency as basically kicking in during the attack and you can see that the tendency was super healthy there is no spillovers the overall learning here is that when you think about your chaos tooling it's really not one just platformer it's a tool set right and I think where you need to start with you need to start with what is the best holding available for you now so that you can construct your own experiments and you take into consideration then the needs that you have how do i do forecasting for the services how do i do failure injections when i need them and then having experiments repository so you're learning and you have the consistent learning cycle is super important as well all right so let's do a bit of a summary so when you start you start with very specific goal what what I'm doing why I'm doing this and that goal should be tied to a business metric or customer behavior because having done conversation with your engineering team with your leaders you know across a group and explain why we're doing this case experiments what we're preparing ourselves for is becoming so much easier for us we have to prepare for those high profile events and the shows that you enjoy and it was easy to start with because we know that the delivery needs to be super smooth so find those goals that you want to start with the first step is important why the first step important because you want to prove that it works so think through which hypothesis you want to start with and what customer behavior possibly you want to validate a new system to make sure that your service is all ready once you've done this first step the rest of the journey will be easier you will be keeping a programmatic focuses because what I've learned over years our engineers they are fantastic at asking questions how can I do this better what can I improve over time how I can scale our services better and that programmatic focus allows you to build program all this time as we're taking a look at the entire framework resiliency is really bigger than just chaos there is just much more going on there and it prepares you for their knowns those unknowns in customer behavior that is super hard to predict the first point keep learning about your customers having that feedback loop that allows you to say what has changed so seemed some seems systems are super seasonal right and sort of like having the discussion what has changed with the seasonality and bringing back into your case experiments are super important and the last but not least chaos the super fun just absolutely fun rebooting this 7,000 instances over has a lot of fun because we've learned a lot about all the scaling rules about alarms and monitoring and it's one of those things like we've done that so start your journey and start your journey today [Applause] I'm sure you still wanted to do this pray this is yours all right the last part this last part apparently the last light so meet chaos where you are I've started my story was me thinking and being in your shoes and being a chair that like the first couple of rows that's to be interested here you have a lot of tools and there is a lot of material on this conference that will help you to start this journey right hopefully the presentation material here is giving you an insight on how to structure the program and how to get started and my call to action - everybody take hails was any of your companies within your groups on your own journey yeah thank you that's the most important I think a lot of people think chaos has to happen in production you doesn't have to I'll tell you a story when we in my previous companies we hire developers and the first thing we're doing with them for the first wing week is to build system and actually break them and learn how to do chaos oriented development as a second nature when you build something how to break it in out to emit a automate that and this culture then develops and then they are so smart they build stuff they develop the right tools then to go in production so don't always believe what you see on Twitter chaos engineering starts on your laptop then might move to test data and then eventually they'll your team will bring it to production right so on that note thank you very much for your time I hope you had a good rain event enjoy the party tonight if you don't have any talk after that and feel free to ask us questions we're going to be here feel free to ask us on Twitter as well these are our handles but thank you very much [Applause]

Info

Channel: AWS Events

Views: 3,207

Rating: 5 out of 5

Keywords: re:Invent 2019, Amazon, AWS re:Invent, DOP309-R1, DevOps, Not Applicable

Id: ztiPjey2rfY

Channel Id: undefined

Length: 58min 23sec (3503 seconds)

Published: Mon Dec 09 2019