AWS Nordics Office Hours - Prepare and protect your applications with AWS Resilience Hub

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

hi everyone and welcome to aws nordic's office hours with me guillermo i am a developer advocate here at aws and i'm fortunate enough to be able to be with you every week here to hopefully inspire you how to build with aws and this is our weekly show where i bring on guests to talk about a specific topic and you our reviewers you're able to learn something new hopefully and get all of your questions answered and this week i am joined by florine he's a senior solutions architect at aws and this week we're going to look at how to prepare and protect your applications from disruptions using a new service aws resilience hub so welcome to the show florian thanks thanks happy to be here thanks for the invitation so tell people a bit about yourself what you're doing sure so i'm a senior solutions architect with aws um this pretty much means i'm essentially owning the technical relationship with our customers um i'm assigned to different accounts right so i work with some of our customers i'm specifically in the digital native space so that's essentially companies born in the cloud that are on their way to grow yeah and i'm pretty much invested in continuous resilience i'm running some programs with my customers and um as such resilience have the new service right is a very interesting thing for me to share with you so i'm really looking forward to that yeah and that's how you and i've gotten to know each other as well in in this time at for me at aws with resilience and with chaos engineering so sure well let's start off with aws resilience hub it was released is it a couple of weeks ago now and tell us a bit about what aws resilience hub is right so so aws resilience hub is essentially a central place where you can import applications or resources connected to applications right and you can define what is essentially an application or workload for you right and you can track the resilience of these applications on aws so essentially you can on the one hand side you know look at what you have what is part of an application and kind of create that segmentation and create expectations in terms of resiliency right so like what are recovery time objectives and recovery point objectives for your application um this fight define these as policies create these policies associate them to the applications and um pretty much run assessments right so doing kind of a continuous resilience pretty much like you do any kind of other testing you know you wouldn't um kind of launch a new feature into production before testing it and making sure there's no there's no bug there it's pretty much the same kind of thing with resilience it also provides suggestions to improve your infrastructure layer software layer right standard operating procedures fault injection experiments so it also integrates very nicely with an earlier service released this last year right fault injection simulator in the sense that you can create disruptions artificially create disruptions right in your infrastructure to measure really how overzealous your application is that's so pretty much in a nutshell yeah so we can pretty much end this this session now you've covered it all in one answer no we're gonna dive deep on the service and you're gonna show us a couple of demos of how to use it so uh a word you used several times there is resilience so tell us about what resilience is when we're talking about well basically building systems right right so first of all resilience is right one of the core pillars of the well-architected framework so that's essentially an aggregate of best practices and suggestions right that we see out there in the field with different customers and it's essentially the capability of your application to maintain availability to be up and serving write the correct kind of data information that it needs to in face of turbulent conditions unexpected failure right and especially since we're looking at distributed systems right i guess we can all agree that um building distributed systems is inherently hard right like things fail all the time and uh this is something that we need to embrace right and live with this chaos and testing resilience right um is essentially a way to take you know your understanding of how well can the app perform from an anecdotal kind of evidence to an actual test right that you can run i don't know before big events right or or something like that and additionally of course investing in resilience like you know kind of tech tech depth or or testing may not be the most interesting or the nicest thing to do but it's something critical right so resilience by itself doesn't guarantee that the apps are gonna delight customers but i think the absence of resilience can really uh deter them away right right and um you mentioned aws well architected framework and the the pillar where this falls under the reliability pillar and aws resilience hub it it's built around some of the best practices or uh it uses parts of what's in the well architected framework and and we'll see that when you're doing the demo as well i believe so let me bring up your screenshot around so you can walk us through a bit about what aws receive resilience hub is sure so um first of all just to take a look um a quick look at the service page right um this is essentially the kind of landing page for more information so you can you can get a kind of a snapshot what the service is about what it integrates with right so we have here fault injection simulator cloud formation systems manager and cloudwatch what it offers right um there's also there's also the links to the pricing page um and essentially this is what i would recommend you know you you kind of have as a first thing in mind for you when you you know you want to learn about a new service there's always a service page there are blog posts and then uh ultimately there's the documentation right to deep dive into it so that's kind of the the the use cases right as described also earlier we are trying to really think about continuous resilience right see what kind of weaknesses are in the applications and protect these applications um in part within policies that really are are tailored to them right so again resilience is a spectrum right not every application needs to be uh mission critical right or or have mission critical resilience but we want to kind of create that separation and of course for these customers for example in financial services right help them meet contractual or regulatory requirements in terms of resilience and to kind of have an overview and a dashboard that we're going to see later so that's kind of the the intro to resilience have all of these links will be shared um i think at the end right so that's there's no need to kind of um yeah i'll post links in in the chat throughout the session there yeah when we talk about distributed systems uh and and perhaps the complexity there um i think it's pretty important to mention that it doesn't have to be these services that has hundreds or thousands of microservices in use a distributed system or even a complex system can be quite small given the way that we build systems today sure yeah absolutely and that also goes to say that you know reliability is not really just for these advanced use cases or for very very mature workloads um that's why we actually refer to it as continuous continuous resilience and continuous reliability right it's really something that can start even before the application is in production right if you think of operational readiness reviews of these kind of checklists it's really about the question of what do you include in the definition of done to to measure whether an application is ready to serve customers and if it is an application that does have reliability and resilience as this core a principle like maybe it's an ordering system maybe it's a payment system you know maybe it's the the kind of back-end for a real-time video game or something like that you know um resilience is really is really critical there so the journey to resilience starts maybe even before uh even before going live yeah and and uh it's not only about the technical details as well resiliency is also a cultural thing an organizational thing about your processes the way that you build and operate your system as well and i believe aws aws resilience hub covers parts of that as well which we'll probably see later on yeah definitely definitely so um i guess we'll dive deeper into them um later but it's it's uh it's a really good point right so when we're looking at resilience it's not an isolated you know technological thing that can be solved um if things would be so easy i guess we would have solved it a long time ago right um it's it's also about practices people right organizations um different uh different cultures within different organizations and um that's something that resilience have really touches upon so we're looking at we're going to look at infrastructure layer recommendations we're going to look at standard operational practices things like alarms right monitoring and uh experiments right hold injection experiments so right so i think we can yeah yeah i was just going to say if you just join us this is the aws nordic's office hours with me gionna i'm joined by florent today and we're having a look at aws resilience hub new service that was launched uh a week or two ago and with the aim then to help us prepare and protect our applications from these disruptions that we see happen in in our distributed systems so if you have any questions just pop them in the chat and we'll do our best to answer those so florent let's get to it yeah cool so before we dive into resilience hub let's uh take a quick look at what we're going to use for today's session right as a backdrop so um the well architected labs is um are essentially the collection of workshops right that go through the different pillars of the well architected framework that we just mentioned they're essentially helping you learn about these best practices and apply them in hands-on in a hands-on way we're going to be deploying this application which is essentially a simple simple three-tier application right so we have a database in a multi-asset setup then we have an auto-scaling group with an instance in each availability zone and then we also have um a nat gateway right um and that's that's pretty much what we're going to use to assess right for the for the purpose of keeping it simple keeping the scope small so that we look at what resilience hub has to offer cool so then without further ado i guess we can jump right into it this is the landing page that you're going to see the first time you go into resilience hub so if you don't have any application right here you have a quick navigation to the main dashboard applications policies and notifications so we're gonna go through them um step by step but i guess the first thing to get started would be to just add an application right so the way we can add applications um resilient tab essentially discovers the resources in the applications and we'll create a reference right so essentially a record of what the application is what it is composed of and we can have up to five cloud formation stacks or we can essentially import an existing resilience hub application we can use one from an app registry right so essentially if you if you're offering um if you're for example a platform team maybe right then you're offering your internal customers right um platforms pre-configured predefined right you may want to kind of look at that template and to to include it in resilience up and to measure its resilience right to kind of keep a check on that or you can you can have resource groups so luckily the application that we're looking at today is uh confirmation based so we're gonna just choose the stacks right essentially the stacks that we are interested in are the web servers for resiliency testing right the mysql layer and then the vpc with associated uh with the associated resources there's also a way here to add stacks outside of aws region so um we we can essentially describe a multi-regional uh kind of application right um however if we want to to add a stack outside of aws if this aws region we need to provide the error in right so you need to set up the the permissions to be able to to reach that stack exactly yeah yeah so once we once we selected the stacks right the next thing to do is essentially give a name and description to the application so let's just call it um web application office hours demo right and then we can add a description so something like mission critical web application right and this um this just helps us identify what the application is especially if we have a lot of them of course as a best practice we also recommend tagging the tagging the resources and also the resilience hub assessments right in the applications um it does make sense right because you can essentially have an overview of what really is connected to one particular workload so here i would say something maybe like cost center right and then i would just add random cost center and then i would add workload and just call it web app demo right so then once we hit next we are going to have the resources identified and this can take a while so it's going to describe essentially these stacks and look at what's inside um so um then we're going to see soon a list of the resources inside so we found eight resources out of which six are included and supported two of them are not supported yet but of course these are these are essentially going to be supported at a later time right um one thing that we need to remove from from this particular one is this this lambda the lambda is actually not in scope of the infrastructure so this is essentially just a custom resource that helps with orchestrating the deployment right so i would just exclude it here at this step um out of the out of the resilience vbc stack right then it's just going to update these things the resources redescribe them and then we should be good to go all right so now you can see that yeah yeah i'm just going to say that i'm sharing the link to the documentation in the chat i believe that also explains what resources are supported and so on as of sure but as florence said more more resource types will be included as as more and more customers start using the service yeah sure absolutely so we're actually going to have a look at the resources soon so let's just go through that right we will select the policy so i already created a policy um if if if you don't have one here right for the first time you can just click on create resilience policy and this will essentially take us to the policies policies view right and we can create a policy based on business needs so we can essentially give it a name description and then kind of classify what application is it is it you know foundational i.t core services something like email without which you know nobody could do any work is it mission critical is it non-critical and then you can do also manual um configurations right what is the rto and what is the rpo so maybe just to have this different definition out of the way right recovery time objective is essentially from an outage right from a moment of outage how much time would it take to bring the operat the application back to operational to normal operation right so that's the time objective of recovery and the recovery point objective is essentially how much data can we lose so before an outage right um how much data is acceptable to lose so for some applications this may be really zero seconds right for some applications be more than minutes or hours or you know it really depends on on the application itself and that basically means uh if there is an outage and you have to revert to for instance a backup it's a matter of how far or in well how far backwards you you uh are able to go and lose the data in between i was strange explanation of what you actually explained a lot better exactly so if you think like if you think of you know the outage would be at one particular point backwards right how much data would did we lose and that's the rpo and then forwards how much time does it take to get back to speed up to speed and would be the rto so again you know you can you can define these um you can define these manually so we can look at um application right customer applications so these would be rtos and rpos related to things like software bugs or some operator error or a bad release or something like that right so this is essentially something that would cause some kind of business failure so apis to fail or or something like that then we have the infrastructure itself and the infrastructure you know we look at individual components um we look at availability zone um and we also look at region so in this in this uh session today we're not going to take a look at the region because our application is uh bound to a single region but of course if um if we would have imported a multi-region um application we would have gotten recommendations in that way as well then there are tags in case you want to tag the policies but what i did is i used a suggested policy so you know it's giving you an easier starting point and then you can essentially look at some suggested policies i classified this as mission critical application right um and i just created this one and then i had to to perform an edit so the the edit that needed to be performed is here at application layer because we're using a an rds database right and uh we're not using aurora so the actual failover and rto is about 100 one hour 40 minutes uh give or take so i just edited the rto to be one one hour and 40 minutes right because um of course in the scope of the demo i don't think we have time to really upgrade to aurora right now it would take a bit um right so i you can you can use these as a starting point right and essentially adapt them to your to your business so florian in your experience then as a solutions architect do businesses have our rtos and rpos in in place for for their applications and their infrastructure sure yeah so um i would say i would say most of them do um i've seen i've seen some that don't um i would highly recommend that right because um i think it's important to really think of what are our goals right to really measure against so um you know it's it's hard to answer a question of are we meeting resilience or availability targets if we don't actually set a target of course the first time around the target doesn't need to be hyper aggressive unless again it's a critical system right but it's good to start measuring these things and kind of turning it into metrics like for example you know how many bad deployments have we had this month right do we want to include that that is a metric do we want to track that and then try to improve that to essentially um measure where we are and see whether we're making progress in the direction we want to go so um strongly recommend that they they would be included even better if they're actually um you know if our customers think about them before they go live with an application because you know just thinking about these things make you makes you really ask some really interesting questions like should we use more availability zones should we back up our data and the way that we back up our data can we recover within a specific time frame do we even need to right now and i think this this this is not just something that technology the technology side can can answer right so um if we just throw it over the fence to the infrastructure teams right it's not something that can be answered there because you know there are business components in there um you also need to kind of work with your stakeholders to understand what are their expectations as well having that discussion um between the the different organizations within within your company i think that's an interesting discussion to have because it needs to somehow be related to what you're actually building and to as you mentioned if if you need to use multiple availability zones if you're using rds or if you're using aurora and so on to be able to meet these but aws resilience hub then will also what we're going to see is that it's going to help us show if what the rtos and rpos we're setting up is actually feasible sure yeah so that's um that's essentially the creation of uh of policies right um i will i will not create that one because i already have i have this one right here right so i will just go to it actually that's why i couldn't see the policies my session has expired so let me just re-login of course the demo uh the demo effect as always right oh that's it's there to show that it's a live demo yeah [Music] all right you're back up with your adventure again yeah so then we have the so this is the web application um policy that i created right i just mentioned that we changed the rto to one hour at 40 minutes and then if we go back to actually creating the uh application right um we can see that it's not published so what we created already is um is still saved right so we don't have a resilience policy attached because we didn't attach it right we stopped at that step so we would essentially need to first attach a resilience policy and then i would just select this one to attach and then we see here we have a summary of the application so we see you know what version are we looking at it's it's actually in draft because we haven't published the application yet so we're always looking at two versions that we're working with uh right now we only have draft because we didn't really publish it right but we will always have a release version and a draft version to allow us to perform some changes right so one thing that needs to be done here um and this is this is something that you know we've we've reported to the service team and will be will be fixed soon the nat gateway is detected as three components because it is a highly available right managed service so by default it has three uh three components and if we assess them as such right they will be considered as once uh as one resource pair availability zone so they will not pass the check so what what we need to do here is um we would just add this as a network we would call it we would create a new component and just call it gateway and here you can see the the component types right so we're talking about resources and components right so first of all what is a component um components are essentially different groupings of of resources that um belong together right so here we have compute database networking queuing and storage app components right so we want to create one for the networking up component to essentially reassign these not gateways to one component so that they are belonging together right they are assessed together and not as part of three different uh three different components as they are right now so the way i would do this i would just select each one of them and i would say change component i would remove the component that it got auto detected to and i would just add this one that we just created so i will add net gateway 1 and then do the same for the for the other ones before i actually publish the application so again this is um this is giving you you know that kind of step to reevaluate the detection right how how the resources were detected if you want to kind of perform logical regroupings exclude maybe some of them if you haven't excluded them in the in the first step and um yeah make sure that everything is uh it's good to go so i would also add this to the nat gateway component and then we have the third one and you talked about um thinking about resilience rto rpos even before you actually start building an application but when you actually have it in place that's when you start using aws resilience hub is that the case since you're using a cloud permission stack for instance you need to have infrastructure in place to be able to use it yeah so that's a very good question right i mean at the same time you can evaluate um i guess before going to production right you're going to test the application or deploy it in a staging environment or in a pre-production environment that's maybe somewhere you can already start um you know using resilience i've been assessing the the application before before you get started with this so if you have something that is similar to production to the way it will look in production you can already start before going productive so um you know some of these recommendations and a lot of the recommendations in animal architecture framework are applicable early in the development stage right and again resilience is something that has impact you know in the way you develop the develop the software and the way you test it and what you test and so on so um thinking about it as early as possible will help you catch some of these maybe omissions or or bugs right earlier in the process yeah so here i've i've added all these to the uh to the to the aggregated component so i will delete the generated and i will just remove these that don't actually have any anything inside and then we can publish our application so now we have the resources we have the nat gateways belonging to the same component and i would just click here publish new version right this um this pop-up tells me here that of course i need to be careful the the release one will always be the one that assessments will be run against right and as soon as i press publish we see that we have a release version um and there's no difference right now right but i could start working on the draft and change the infrastructure as the application evolves right because i guess maybe if you're not running resiliency assignments every day or every week you know if you do it once a year or before business speaks i guess maybe there are some changes to the application of infrastructure or architecture um and then of course you can you know work on that draft and make sure that you assess the latest version so with that in place right uh let me just get this uh green power away from here um we have the these kind of four steps which are the main workflow and resilience up right so we publish the application we described it and we looked at the resources um the next one is to assess the resilience right so running an assessment brazil an assessment will result in us getting a report so i will say here report demo let's just call it you know 2021 11 22 and then we just run the report so um this will essentially uh look right in in the resilience of back-end and all of the resources the way they're set up what kind of configurations do they have does the database have a backup and continuous backup um do we have auto scaling group um do we have configurations in place and come back with recommendations right three types of recommendations namely um and these are alarms standard operating procedures and fault injection experiments um so alarms are essentially you know things that we may want to be notified of things we may want to to continuously watch standard operating procedures uh you know kind of operational run books how to do a particular task in in our infrastructure and then fault injection experiments um we're gonna see you know things like taking taking down an availability zone um whenever we want to or refreshing um instances in a in an auto scaling group failing over a database things like that right so things that could happen right in a day-to-day business that you want to check so with that with that said right here we have the the list of assessments here we would keep all of the assessments right in case you have a need to show them maybe to external auditors or something like that right you can always um rebirth back in back in time maybe a smart thing would be to i don't know include a kind of if you have a global versioning system right to include what what was the version of the application at that particular time so that you know right version x of my application with all of the dependencies run through this assessment and this was the result so looking at the looking at the assessment report right we see that the policy is met um so we have here the results um in terms of rto and rpo right and they are split across application type infrastructure type availability zone and the region is not applicable because we don't have a multi-region construct so if we expand all of these right we can see what is the target rto and what is the estimated rto and we can look at breaches if there are any breaches right and here we can actually hover over and click on the estimated rtos to see um how did we reach these these conclusions right so for the compute one right how much it takes to roll back um or forward to a stable configuration of course they're based on estimates so the average real time may really vary um and then the more you know the more this this gets used then uh i guess the improvement the the recommendations would be better then the database app component right again we have an rds so we don't have aurora here um this is the the estimated rto the networking app component right we just mentioned it it's not gateway so it's fully managed it's highly available so we don't estimate that there's any rto and rpo impact and the same goes for cloud infrastructure and availabilities all right so and then here all of them are um are documented right how do we get these suggestions and in case there is a breach which one is the breach so luckily we don't have that but what we do have and is it very interesting maybe at the top is resiliency recommendations and operational recommendations right so if we look at the first ones right just click on the resilience recommendations we can see that we have the three components that we just created the compute one the database one and the networking one right so they're essentially evaluated independently as part of as part of a workload and here we have three kinds of recommendations that we get so first one would be optimizing for availability zone rto and rp1 compute the second one would be optimizing for cost and the third one would be optimizing for minimal changes so you can see that there are different changes in terms of you know how much do we want to invest either as money right or as time spent there or whatever the goal is right so here the availability zone one is the most comprehensive right so we could uh add an auto scaling group in a different region right and then do kind of rerouting to this different region using prop 53 and it also includes the one which is optimizing for cost right to have min size to two and max size to four so that we don't have one um one instance maybe in every availability zone because of the traffic that we have and in terms of minimal changes we don't actually have anything in the compute but if you look if you look uh right at each one of them you you will see different different recommendations so yeah yeah these recommendations they are then basically built on a rule set that that relies on the well architecture framework um architectural best practices yeah and also of course on what we see right in terms of in terms of usages and in terms of what our customers are doing um we just saw that we have the estimates right um how do we estimate these application rtos and of course the changes themselves right they also come with um updates to the estimates right so here we see that we would have um some improvements in terms of rt2 and rpo if we would do this change right and maybe in the the um in the compute one as well right we would have again infrastructure and application um impact across these two stacks so that's kind of the resilience recommendations we have what we could do now right is we could essentially go and take this change right and then perform it and then rerun the asses the assessment right because i guess between running assessments you would do some of these changes so every time you assess right we're looking at the infrastructure and the application as it stands um additionally we have operational recommendations and these are the ones that we mentioned right so um these recommendations are provided on our best practices you know as we as we see in aws but again application resilience are application resilience is really a it varies from application to application even within the same department within the same customer right they may have two kind of applications there um they're they're responsible for and of course they need to they need to evaluate them independently so in terms of operational recommendations we have alarms one of the very interesting ones the ones that we would create is the aws resilience hub synthetic canary so this is essentially a synthetic canary it's it's an alarm that we we get as a recommendation right it's common to everything um and with it we could create um a synthetic uh test to our api right so that we essentially ping every minute um or our ec2 instance to see whether it's still up and running and on that note right this is essentially what you will get with the infrastructure so it's it's just going to serve the metadata of the instance and you're going to see essentially the availability zone that is uh that is essentially serving this this uh request right so this is the one that we would create what would happen um is we would click on create cloud formation template right and we would we would essentially create it in cloud formation i already have all of these created so i will not create them right now for the in the interest of time but i will show them to you right and this this generates really a cloud formation template you can just take all of them go to cloudformation and just um create them right and there are there are more right that we don't have set up like for example high cpu utilization asg alarm over utilized cpu for the rds instance and so on and so forth right so for all of them we get all of these operational recommendations right then we get standard operation operational procedures and here um you know these these are essentially how to scale up and scale down the asg and restoring rds from uh from a backup and again it's the same thing right if we would click it here and create a confirmation template we would generate a generated template and the templates that you create right so that's just for the for the sake of demo i will create one here i would call it synthetic canary right and just say web application and then when we create it right it first of all will notify us here that it's being created and it will also add this newly created template to to the template side right so these are the ones that we chose to actually go forward with and if we um if we go ahead and click on this one right you can see that it created an s3 bucket where we have everything um we have everything inside so essentially you would go to this s3 bucket and we would go to the alarm folder and we have a json here which is essentially the cloudformation template right and this would contain the alarm so we would download this go to cloudformation and just drop this in um yeah so that's that's what what we get out of the um assessment report right additional operational recommendations include fault injection experiments and these are maybe the interesting ones right because now we can look at a holistic view of you know what is my resilience what experiments can i run when did i run them right what was the impact and so on and so forth to run the experiments um the actual execution of the experiments depends on some of these alarms right so we would need to essentially provide for example if we are um if we are looking at let's say you know refreshing instances in asg or simulating an availability zone outage um when we run this experiment right we the experiments come predefined but they do require a canary a synthetic canary alarm right and resilience hub makes it easy because we can just import this one right so that we have it already created and um and everything is ready ready to go right so it it then uses that as a stop condition in in the experiment yeah exactly so this is essentially the thing that will be monitored but if if something goes wrong with the alarm um then of course we will see it on the dashboard and the experiment will go uh will stop right so we go back to the dashboard right we see that we have this application and currently it has um we didn't set up the the recommendations right and we have a resilient score of zero percent so the way the resilience score is calculated it's essentially a weighted um normalized value between zero and one and you can find this in the documentation so we have the test coverage the monitors coverage and the standard operation procedure coverages so all of these suggestions out of them how many did we implement and there's also a weight associated to to all of them right so if the recommendations are for a regional outage maybe the weight is not so high right because it's a more unlikely event right however if we're looking at a software uh disruption type because you roll out the bug right that would have a higher weight because of course these things so changes in the software layer are more uh more frequent and more likely than you know in a region not being available yeah so this is this is the application so what would happen now i would set up the recommendations and then i would essentially reassess the application and the application with the resilience hub will detect that i have set up these recommendations right and we can run the experiments so i will go to cloud formation real quick so that we see that we have these set up we have the stacks right so we have here um in the in the lower side we have the uh cloud formation stacks for the epic the other application that i prepared and here i have the web application canary alarm and then i have the two fault injection simulator experiments so we have asg instance replace and asg and availability zone outage so what does the canary alarm actually look like right it's um it's going to create a cloud watch alarm um and it is it is this one over here so this is the name that the that we will need when we actually run the experiment and as you can see i'm not sure if you can see it here the best right we have a success percentage of 100 the alarm itself alarms on whether the success percent is lower than 100 for any two data points within within within three minutes right so we wouldn't accept any kind of outage for the experiments that we run because we do have a multi-availability zone set up we do have an auto scaling group right and we would expect the application to still be running right albeit maybe not serve the not serving from some availability zones so this is what you get with um from resilience up right you would need to essentially edit this metric and specify the name of the canary so the canary itself is not defined because this is something specific to your application right so here um if you go back to cloud watch right in application monitoring under synthetics canaries you can create a new uh a new canary and essentially define you know how to measure the application so i have it already prepared but how you can how you can essentially measure that right is you can you can just add an application or endpoint url and you will get essentially a puppeteer script that just tries to fetch the webpage right and this is essentially the actual data underneath the alarm right so that's something we have i called it simply canary right and this is the name that i've also put in the alarm that comes from resilience hub and you can see that it's essentially checking the the health of the application all the time so with that with that up and running right if we look at the other application which is essentially the same one with the recommendation setup right i can see here that in terms of uh in terms of the summary we have improved the resilience score because we have added some of these recommendations one of them is the alarm we have two of the experiments so we have here also executions right these these are actual executions um in in fault injection simulator we differentiate between an experiment template which is you know how to run an experiment and an experiment execution um yeah and here we can we can go to this and see what happens but we will just actually run our own um experiment uh let's see so we have here the simulate az availability zone outage or refreshing the instances in an availability so i don't know do you have any preference which one we should run uh ac is fun i think okay so then let's uh let's try to start the experiment here we get notified you know you're gonna you're gonna do something that may be disruptive so we just get a confirmation i would start the experiment so then we have here the experiment being initiated um i will open the ec2 console just so that we look at the instances and in the meantime we go to fisk to see the the console right so resilience hub really set up everything in terms of uh so it i got a cloud formation template and all of these templates were pre-created right to simulate an availability zone outage um you can take a look at them and edit them right but it essentially uses the ssm document integration so this already has integration with uh with the things like ec2 and rds right but for anything that is general purpose and is not covered you can run an ssm document and that's essentially what is being done here um so here we can we can look at the targets and we can look at the uh the uh what kind of exports right the targets are not defined because everything is within the script so we're gonna take one availability zone down and if we look here the stop condition was already pre-set to this synthetic canary right so it's the one that we we have created there's an im roll created right so that we make sure that we enforce um principle of least privilege right we don't need to give the experiment more permissions that it needs to have um and then if we look at the timeline right it's just running this automation script so this is some automation script so we can go to systems manager and we can look at automation and we can essentially see this is the execution that is currently in progress from this and we can see what is what is it actually doing right so currently it's selecting an execution mode it's uh you know terminating some instance so it was already pretty fast maybe we've already missed it i think now it looks if i'm refreshing right we we're only getting uh b and c now a is back as well so if we're refreshing this over here we see that c was actually terminated um right so so the instance is being terminated and we can still see we just get a and b so we don't get c anymore i'm not sure if you can actually see this very well so maybe right we're looking here at the availability zone yeah and in the meantime right the canary is actually doing uh doing the same the same thing right behind the scenes and it's just getting all this data and we can go back to our alarm and see that it's still fine because we still have two um we still have two um availability zones serving so we're not in alarms we're still we're still green here um and and everything everything is fine right um yeah so now i guess um as part of the experiment might be uh we have terminated the the web the web server the the instance right and the auto scaling group will pick up the change and at some point we'll bring it up to speed so that's kind of you know what it what it looks like um that's one experiment that's a very quick introduction right and after running this experiment right if we if we reassess the score right this would also be taken into consideration in assessing the resiliency so that's um that's you know the continuous part and continuous resilience where you measure right you essentially define what does normal really look like right um what is the normal steady state for my application and then what would happen if i would do this or that right or do i need to think about it and then really turning them into kind of like school experiments right to really try to take down some of these instances and then to measure you know and see is there any prints to your hypothesis if there is of course update the hypothesis because it's always a living document a living thing right and the same thing goes for policies and in the application application itself right so we can always go back to um to the application to the application descriptions right under versions and start working on the draft and see you know look we're not using this mysql anymore so we can exclude the resource right and we can include some new resource like for example um i don't know we we can add a new resource and just say here's by the way the list of the resources supported right so you see your api gateway auto scaling groups dynamodb and so on and so forth right we could say that maybe we had an s3 bucket or we we just added dynamodb and we don't want to include this by sql database um so you can always kind of change that and update it um yeah so that's these are the experiment templates if i go back to summary we're seeing that the experiment itself is still running so um i guess it's gonna it's gonna take a while but um again if we look at the alarm right um everything is still okay now i think from from just that perspective with the experiments with chaos engineering uh resilience hub really shows it more in context what chaos engineering is is about in in a larger context that it's not only about running an experiment to to cause or inject failure it's actually about this entire journey to to make your system more resilient to failure and to actually assess what those experiments do sure yeah so um i think um one of the adrian is actually one of the people i work with a lot right i think you've had him on the show as well i like that he says that um chaos engineering is more of a thinking thing than uh doing thing right so it's more you know kind of assessing evaluating determining what is uh what are the you know um whatever service level agreements or whatever we have right that we are committing to to the business or to other stakeholders and the actual experiments you know are just a reflection of that so it's not really about creating chaos right it's about kind of measuring and trying to tame that chaos inherently in these systems so it doesn't go to say that yeah you know just go ahead and uh take down three availability zones um because you know if it doesn't give you any kind of insight about your application and you're just going to cause all cause and voltage you're gonna upset some people right um hopefully you know the end consumers are really going to be upset but also business stakeholders and uh that's not really a good experiment right it's just uh it's obvious right yeah i know casey rosenthal one of the creators of chaos engineering at netflix he famously said that i wouldn't have a job very long if i were break we're only breaking things on purpose all day so uh yeah absolutely cool so uh you've now walked us through aws resilience hub from well from the beginning creating an application then creating or attaching a policy doing the assessment and looking at the result of that the the report you get from that assessment and finding the uh well if you are in line with your rtos and rpos or if you need to improve your application and how how would you say that how would you say that companies would use this is this something that would be used by the engineering teams themselves to create these reports or is it something that would be done on a different level within the organization it's a brand new service so this is just hypothesis right now yeah of course so i would imagine uh honestly that both could do that right so depending on the kind of setup that uh that that our customers have you know you could you could imagine engineering teams that we also have an amazon right that kind of have a you build it you kind of own it um approach to things so in in in that aspect right of course the things that they build they're responsible to assess and to to monitor and measure um however there are sometimes platform teams infrastructure teams so maybe games that are responsible for kind of the data platform or the message streaming infrastructure right that provided as a service to other teams so i guess for the the core components that they would use they would be responsible to assess and the engineers using them would rely on them right but of course having this um having this measurement and having this uh thought up front right would help everyone to communicate better and align on the goals right because you know if you provide something as a platform if you think about it and you know your consumers can look and see that your slas are maybe not matching with what they expect they can make better more informed decisions than you know if um it's just it's just an anecdotal kind of evidence so i could imagine both um i hope both right um i guess uh it's never too early or too late to start talking about resilience so um definitely hope that this uh this is you know useful to more and more of our customers and especially as we add more services and more recommendations and uh yeah yeah and having it having it as a basis for that discussion over across team team boundaries for instance to be able to talk about what are the improvements we need to do and because often the improvements we see well sometimes they have cost associated with them of course to be able to to run services in a certain way might increase the cost but instead you might instead then have a better chance of reaching that rto or rpo gold sure absolutely and and i like that you also see in resilience have these kind of um you know pragmatic advices taking into consideration are you really going for that extreme uh resilience use case is is it really a payment system or something like that or a health care system whatever um or are you more looking for cost or for minimal change right do you have a deadline that is that is really tight and you're still trying to do something but you also want to fit it in a backlog right so it's it's nice that you get all of these suggestions and you can pick and choose them or add them all to your backlog but you already have some sort of priority uh predefined right and you're you're still able to change that yeah so i see your experiment is still still running so what would happen after the experiment would be that we would just click here on reassess right and we would get another um another evaluation and then we could look at the uh at the recommendations right and then our our um our uh score would also be recalculated but i'm not sure that's something we're going to be able to do now now while it's still running i suppose we can't can't do that but we're we're trusting you on that one yeah cool so i've shared a bunch of links in the chat throughout the session so take a look at those um is there anyone i haven't shared i haven't shared the api reference i'll share that one as well so you have that any yeah final things you want to add on about aws resilience hub for him yeah so i mean um of course as a call to action you know take a look at the service try to try to play with it read the documentation see what it offers right and it's it's free from the moment you start uh using it for for um for three months so you can you can add uh some applications and really test them in a safe environment right see if it if it um makes sense to you and your teams um and if you would like to see new features you know or you would you would think some things can be done better definitely reach out to your aws point of contact and uh we're happy to hear feedback um and to to kind of take this in the right direction sure and if you are interested in aws resilience hub and you haven't checked that out yes we'll have a look at aws fault injection simulator as well as you saw it was tightly integrated uh resilience hub and aws face so do check out both of those services so we are approaching top of the hour and i want to thank you florram for joining us today and thanks to all the viewers for joining us as well and i and we've having a look at how aws resilience hub is able to help us in preparing and protecting our applications from disrupt disruptions so i'm very interested to to see how aws resilience hub is is used by our customers going forward and so hopefully floren i'll have you on again in some time when we have more things to talk about in the resilience and chaos engineering space absolutely absolutely thanks for the invitation it was it was really a pleasure to give the demo into to have this resilience chat again you know it's always a good time to talk about resilience especially continuous resilience so looking forward to that and if people want to reach you for rent you are available on linkedin i believe so yeah we can link with you there and uh yeah and if you want to connect with with me you can see my twitter handle on screen i'm also on linkedin of course and with that thank you all for watching thanks for iran once again and i just want to make a last-minute recommendation that next week is the big uh show of the year aws re event so if you're not in vegas for a rematch well do sign up for re-event virtually and join us and watch the keynotes the leadership sessions and a bunch of sessions that will be available on demand as well all right thank you all and have a good day ahead you

Info

Channel: Gunnar Grosch

Views: 66

Rating: undefined out of 5

Keywords:

Id: EC7yPhDf6cU

Channel Id: undefined

Length: 56min 45sec (3405 seconds)

Published: Mon Nov 22 2021