AWS Supports You - Driving Operational Excellence using AWS Well-Architected

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[Music] hello everyone i am padma mali garjaman i am an enterprise support manager at aws and i am based out of austin texas welcome to aws supports you where we share best practices and troubleshooting tips from aws joining me today is john steele and manusatpati from aws can you give us a quick introduction let's start with you manus thank you padma hello everyone i am manas sutpudi a senior technical account manager in aws and a cloud operations specialist helping our enterprise customers on operational excellence and cost optimization and i'm john thank you very much for having us here padma so i have been a well-architected essay for a while now prior to that actually i started as a technical account manager like manus working on operations and i've been working with customers on well architected been at aws for almost seven years thank you everyone today we'll be introducing you to how aws helps customers drive operational excellence using the well-architected framework we will specifically focus on operational excellence design principles and best practices that have helped aws customers and our internal aws teams before we get into the details a quick note to our attendees online please use the chat window on the right hand side of your screen to ask your questions and share your thoughts throughout the episode we look forward to hearing from you john can you walk us through what we're going to be talking about today absolutely thank you very much for the intro paddling so as we talk about driving operational excellence using well architected we really need to start by understanding what operational excellence is and what the aws approach to operational excellence is so it kind of begins by thinking about the way modern i.t systems work in a new paradigm so there's not really a barrier anymore between i.t systems you use to provide infrastructure to manage your applications and for the operations that use to support all of that in a modern paradigm we really want customers to adopt the amazon philosophy of you build it you run it just said by often by our cto verner vogels we really mean by that is we want customers to think about operations as a part of their designs because when they design with operations in mind they'll be much more effective at driving operational excellence in their environments and with their teams so we should start by talking about hey you're here today what we'd like you to get when you walk away from it now we definitely want to start with learning we would love for you to learn more we want everyone here to learn about operational excellence and the aws approach to it but also we hope that you'll be able to share what you learn with others and explain operational excellence both as a general industry term but also the way that aws approaches it more than that we are going to share a bunch of resources that can help you to learn more about operational excellence so today isn't just the end point of your learning it's really the starting point we also want to make sure that you're able to define the design principles that we as as the well architected team have established when it comes to operational excellence and to really understand the importance for you and your teams and on your workloads when it comes to designing with ops in mind because in the end we really want you to also understand how aws can support you in designing with ops in mind so what are we going to actually cover today well it's going to start with an introduction to well architected where i'll talk about what while architected is and how you can actually use it we're also going to then dive a little deeper into operational excellence both from the industry perspective and then also talk more about how aws approaches architectural operational excellence we're then going to dive a little bit deeper not just into how we think about it but how aws support can help you and your teams drive operational excellence in your organization then we're going to talk about the design principles for the operational excellence pillar and give you an overview of some of the best practices for operational excellence and most importantly as i said previously you will also get some resources that you can use to help learn about operational excellence so we should start probably with a little bit of an introduction to well architected so when i talked about while architected if you're not familiar with it let's talk about why we created while architected really aws wanted customers to be able to answer a simple question when they looked at the workloads their teams were building could they answer the question are you well architected now hopefully most of you would want to answer that well yes of course you know we got smart people we got the best technology and the best of intentions so we're pretty confident that we're following best practices for architecture the challenges we recognize with customers is we dig a little deeper and start asking well how do you design for security or for reliability or performance efficiency or cost optimization sometimes their level of confidence that answer gets a little bit more difficult so we designed the well architected framework to be a mechanism for your cloud journey we want you to be able to learn the aws best practices based on both what we learn from our customers and what aws service teams have learned then measure your workloads and your teams against those best practices and identify opportunities for improvement so you can continuously improve your workloads over time so what is the well architected framework actually made up of well it starts with pillars we think of pillars much you would like to think of the foundation to a building if you build a very strong foundation for a building when you build a building on top of that foundation the building itself will be much stronger and more reliable we think the same way about the pillars for well architected those are the foundations of good architecture we also then have design principles which is the set of key topics and concepts we believe all customers should think about when designing and running workloads on aws then within the core well architected framework there's a set of questions there are a total of 52 questions in the well architected framework which are the core of well architected's key concepts and then for each question we also have best practices these are the ways that we have seen as aws and with our customers that are most likely to make customers successful when approaching those concepts so the pillars that well architected framework are operational excellence security reliability performance efficiency and cost optimization and we believe if customers think about these pillars they're more likely to design architectures that will run well and be able to be secure reliable high performance and optimized for cost now today we're really going to focus on operational excellence and dive into that pillar a little deeper before we do that there may be one last question you're thinking though well why would i even want to use the well architected framework what does it get me well we designed the well architected framework as a mechanism for customers cloud journey because we want them to be able build and deploy faster we know customers use aws because they want to provide value to their customers more quickly now to do that that means you have to be able to reduce your firefighting and focus on developing things that are actually going to add benefit to your customers you have to first identify and then lower or mitigate your risks to do that in the end what it should leave you with is the ability to make informed decisions about where you spend your resources whether it's to achieve desired benefits or to reduce or remove identified risks in the end we really want customers to use the well-architected framework to learn the aws best practices that we've gotten from all of the past you know couple decades as running the cloud but more importantly what customers have learned as well so we're going to pause here for a moment actually and i'm going to ask as we're getting started are are there any questions from those of us that are joining us padma thank you john uh we have one question here from wii m jaini thank you for your question or will this be a series where we will cover all the pillars of the well architected uh this is a so as part of aws supports you we i'll share a link here with you where we have a lot of past episodes but that's a great feedback to cover all of the pillars of the well-architected framework and we will take your feedback into account anything else you want to add to that john sure i mean that's that's great feedback as a matter of fact there have been some other series that have it focused on that so please take a look for those and we'll definitely take that into consideration again today is much more about operational excellence and thinking about how to how it drives your ability to monitor and understand your workloads but we would definitely be willing to look in the future about diving into the other pillars too so good feedback thank you so any other questions padma back to you john great all right well let's keep rolling then so you can see here the next topic we really want to dive into you've got an overview of well architected we've talked about the pillars and we'll architected but since we're going to talk about operational excellence today we need to dive into first what does operational excellence really mean now we're going to think about this really from the industry standard perspective so if you think about it from an industry perspective there's a lot of places you can look it up but really the key thing to understand about operational excellence is it is a philosophy and a mentality it's not a set of activities and it's not some defined itsm framework it's the way you approach the work that you do what you're trying to do is to drive continuous improvement and that's what operational excellence is really all about meaning that it's not a destination you don't just become operationally excellent rather it is a continuous journey of improvement it also means that when we think about operational excellence it is an outcome focused mentality not an input focused mentality we want to think about the kind of outcomes we drive both for our teams our environments our architectures and for our customers when it comes to driving operational excellence it also requires more than just thinking about technology operational excellence isn't just about your technology it's about the people and the processes that you also use to support your technology and in the end operational excellence understands that operations teams exist to support business too often there's this paradigm where operations teams are thought of as overhead and they're the break fix org and operations is just there as a cost center where i end up spending money the reality is operations exists to ensure that your business is able to provide value to its customers now one last thing about operational excellence again because this is something that covers a big area of concepts there's a lot of different ways to drive operational excellence there's not a single methodology that is the way you must do it so what are some of the industry methodologies out there that can help support operational excellence so you might have heard of these maybe your team's abusing are using them but it doesn't matter if it's itil or cmmi or devops or sre or six sigma or lean or agile you may have heard of many of these frameworks in the end any of these frameworks or even several of them combined can be used to drive operational excellence in your organization the thing is most these frameworks share a lot of common concepts so you know it includes things like system level thinking it's not about components about the system as a whole also it's about reducing defects improving your flow of benefits into production and identifying constraints in your environment again many of these frameworks talk about creating and amplifying feedback loops they think about continual experimentation and learning and how that will work and really most of them focus on how to drive continuous improvement in the end if you really sum up all of the things that you see across many of these frameworks it's essentially that these frameworks are designed to create business value by delivering value to your customers so that's the general industry perspective what i want to do actually is i'm going to pass over to manus manas can you give me a little bit more background now on how aws thinks about operational excellence thank you john and let me dive a little bit deeper into what operational excellence is it is about how your organization supports your business objectives it is your ability to run systems effectively and gain insight into the operations in order to deliver business value and to continuously improve supporting processes and procedures the core focus of operational excellence from the aws perspective is ensuring that your workload including the functionality of the technology the team members who support it and the procedures that they use all are delivering business value at the same time in other words it is about running your workload effectively if you have a workload which is secure cost optimized reliable and highly performant that's fantastic but if your teams cannot run it effectively it is at risk of becoming an override to the business the goal of the best practices in the operational excellence pillar is to help your organization and teams make informed decisions about how they will apply resources to get desired benefits or reduce identified risks let's talk about how aws support helps our customers drive operational excellence there are three segments where aws support can help our customers depending on the support level you get these entitlements all these features mentioned in this slide are available to our enterprise customers if you look at different teams who support our customers studying with cloud support engineers or cacs available 24x7 to dive deep into the issues you may face for enterprise support customers you have access to designated technical account managers also referred as stamps who are senior technical executives to help you proactively support in your projects including on technical inquiries or escalations your terms can help you engage and dive deep into product features or other technical discussions with aws service teams secondly we have a series of processes and frameworks aws has a flywheel methodology we used to engage in supporting our customers flywheels are virtuous cycles that drive continuous improvements these flywheels apply to engagements like operations review which focuses on organizational level operational excellence in enterprise support customers uh driving operational excellence other customers can use aws one architected framework to do self-service partner-led or team-led aws stimulate reviews where you can get help to build secure high-performing resilient and efficient cloud infrastructure for your applications and workloads finally there are series of tools our customers can use for example trusted advisor that continually inspects your aws environment and then makes recommendations when opportunities exist to save money improve systems availability and performance and help close security gaps then reviews health and dashboards it gives you ongoing visibility into your resource performance aws personal health dashboard provides alerts and guidance for aws events that might affect your environment you can use aws health api and cloud watch events to automate many of your operations activities like automated remediation actions for specific events before we look at the design principles let's talk about how traditional way you used to do things before the cloud most changes were made by human beings following some documents or run books that were out of date most often it was easy to become very focused on technology metrics rather than the business outcomes because making change was difficult and risky we tended not to do it often and therefore we tended to batch changes into large releases we rarely simulated failures or events as we were too busy fighting fires from real failures we were so busy reacting to situations it was hard to take the time to extract learnings it was hard to keep information current as you are making changes to everything to fight fires every server was a snowflake let's walk through the design principles here that is specific to the operational excellence pillar that represent the new approach ws has seen successful customers adopting and those starts with perform operations as code and making frequent small reversible changes refine operations procedural frequently anticipate failures and learn from all operations failures we are dive into each of these a little deeper including giving you an example of how or what aws services can be used to build architectures that follow these design principles let's look at the first design principle operations as code it is not just infrastructure score while it has the same benefit the focus is much larger it includes not just your infrastructure deployment but all operations activities including change management incident management patch management and more it is not just automation either by codifying your operations activities into scripts you can automate them but you could still trigger them manually with operations as code you can trigger your automated run books in response to events and are able to apply all the benefits of code like version control pipelines code reviews automated testings to your operations activities now if you look at the methodology the regular code methodology in the cloud you can apply the same engineering discipline that you use for your application code to the entire environment you can define your entire workload application infrastructure operations run books as code and update those with code the implications of performing operations such code now that you are you can script your operations procedure and automate their execution by triggering them in the response to events this ensures that the results are tested and consistent if you do not perform operation such code there is a higher risk of human errors and inconsistent responses to your events and might take longer to perform let's look at an example of how you can perform operations as code on a ws in this example you can see how customers can use operations code to maintain compliance you can use aws cloud formation to deploy your initial workload that gives you infrastructure as code and you can extend it to operations code using aw systems manager to automate monitoring compliance you can use systems manager op center to visualize your ops items for visibility into your operations and using aws config you can set up necessary rules to monitor your compliance requirements and execute automated run books using systems manager to have automated remediation customers can use amazon's simple notification service to run and run a aws ram the function to notify your chat ops channel when any of your resources are non-compliant and run necessary systems manager documents for further remediation let's look at the second design principles making frequent small and reversible changes when possible you should prefer frequent small and reversible changes you might ask how frequent we should deploy the changes to to customers it should be often enough to support successfully delivering business value to your customers it should be granular enough to be able to efficiently identify sources of a failed change and take action before business value is negatively impacted so what is reversible change should be able to be rolled back as soon as possible to prevent or reduce negative impact to business value you should design your workload to allow components to be operated regularly in it and to increase the the flow of beneficial changes to production make changes in small increments that can be reversed if they fail to add in the identification and resolution of issues introduced to your environment during the deployment this supports the ability to reduce defects and improve law of value to customers making large changes which impact multiple parts of your workload at the same time has a large blast radius for issues to impact your business and some of these you can cover by process ensuring teams for different parts of your workload don't all make changes at the same time if something goes wrong troubleshooting and rollback becomes much complex designs without this principle lead to higher risk of lost business value and extended down times as a result of changes that have an undesired outcome limits to the rate of change can increase the time it takes to deliver value to your customers let's look at another example of how you can implement frequent small reversible changes on aws in this example we are using aws cloud formation again to deploy the initial infrastructure and you can see the traffic is routed to a blue fleet of homogeneous instances using a load balancer now to implement frequent small and reversible changes you need to implement necessary tools to automate your path to production with your preferred deployment configurations in this case it is blue-green deployment you start with a code repository in this case we are using aws code comment while you check in your code for workloads infrastructure networking configs including automated operations run books and playbooks then you design a continuous integration system to trigger your builds and tests and extend to a ci cd pipeline to build test and deploy the changes to your environments this pipeline here is a cws code build for builds an aws code deploy for blue green deployment of a sample wave applications deployed into amazon institutes it uses operations as code as well to create a new golden ami using aws systems manager automation and a ws lambda it patches a supplied base ami and creates a golden ami to deploy into new fleet of instances part of a blue green deployment aws lambda function here updates the systems manager parameters with the golden ami id and code deploy deployment group and your auto scaling configurations this shows how you can implement multiple operations procedure to create a golden ami get necessary approvals required and update your green fleet of instances with the new golden ami id as well as with the new application build at this point uh podma do we have any questions on the design principles we have covered so far thank you maz yes we have a couple of questions here the first question here is from thank you for your questions by tony the question is are there any aws customers out there who are using well architected reviews would you like me to take that one manus yes yes okay great so yes as part of the wild architected team there are a lot of customers out there i'm going to first talk generally about the kind of common use cases that we see so we tend to see the common use cases for well architected as being one to just learn about aws best practice like we talked about two to actually build a technology portfolio which means that they're able to take all of their workloads and do reviews of those workloads and see them in a single pane of glass so i apologize i'm probably going to not get the name right but there is a company in uh there's a very large bank ito de banco in uh in brazil actually that has used this now to implement beyond just technology portfolio they are actually using it for governance and all workloads that are going to be moved to production must go through in both the design and pre-production phases a full well architected framework review to ensure they're meeting the governance standards so we do see a lot of customers again using it to learn to build technology for portfolios and to build governance so are there other questions admin yes there are and for this question i will post a link on the chat window where you can look up a lot more customer use cases who are benefiting from the well-architected framework i have one more question here thank you for your question bmg the question is does the code build code pipeline and code commit form a part of operations as code framework yes i can take that padma thank you for the question and you are absolutely right that it is it is uh helping you to form the operations as code paradigm to to have everything together including your infrastructure networking changes and operations activities that you need to do part of the pipeline so the answer is yes thank you uh and everyone please continue posting your questions here for uh john and manus to take them all right back to you [Music] at this point uh let's do a demonstration in the aws console that will show you how it look actually in the in the ws fine we have done we are doing it so give me one minute and we'll get started here and we will start with aws cloud formation to create the base infrastructure with the blue fleet of instances and you can see we have a few cloud formation templates if we go to one of those you can see it creates a series of resources and if you go to the output tab you can see it is it has created a auto scaling group your code commit repositories your code deployment group your elastic load balancer and ssm documents and if you go to the current website build that is deployed you can see it is finding to the blue fleet of instances including these two ec2 instances and here the two ec2 instances running you can see and now if you go to the code pipeline you will see the entire pipeline that is been done starting with the source code using aws code commit and then i have a manual approval for your peer review and final check before you go to deploy to production and then it builds a golden ami using a base mi and then it does the application build and then the deployment using aws code deploy and to trigger this pipeline we have configured that it will be automatically triggered if you make a code change so we'll do a quick sample code change here and make a change in the index.jsp and to change the background color to green to show the the new build is going to the green field of instances using blue and green deployment configuration we do a quick git add hit comment and we have configured cloud watch events so that uh the code comment will trigger the pipeline once we get do the git push and it should be triggered uh in few seconds yes it's triggered and it's completed updating the source code in code commit repository and it's waiting for the next step that is manual approval so in this manual approval you can review it right here and reject to stop the pipeline or approve to move forward so now we transition to the next stage of the pipeline and that is to build the golden ami and you can see how uh it runs the necessary systems manager automation here and let's go to systems manager automation as you know systems manager uh systems manager the operations hub in aws so for your operations as code you will be using systems manager many of these features available with it to gain gain operational insight and take actions it has applications management features change management features including maintenance windows configurations you have series of node management features including compliance inventory session management automation in this case we are using automation so let's get to it and you can see there is a automation document already in progress you go to the details you'll see that the detailed status and all the steps this is going through including including launching a temporary instance creating the image and once it is done it will terminate that temporary instance you can also look at the cloud watch logs for even further details of what commands it is being running so we are going to cloud watch and if you go to this log stream let's load some of the logs and you will see all the details that those commands and steps are being run while it is running let's go and check the deployment configurations we have configured and this is the application and application is targeted to a deployment group and in the deployment group if you look at that's where you have all the settings for your deployment type like in this case we have using we are using blue green you can see the application name your permissions and your deployment configurations like one at a time in this particular case and if you go to the edit you'll see the details of this configuration including the types in place or blue green in in place it updates the existing instances and in blue green it replaces the existing instances and creates a new auto scaling group and in this case we are copying the existing auto scaling group and creating a new one and you can read out the traffic immediately or you can choose a later time to route your traffic and here the deployment configurations you can configure these are the available ones or you can configure a custom configuration and you put your load balancer details there so now if we go back to the code pipeline and see it's just has completed creating the golden ami and transitioned to the next step that is application build let's go to the details of the build log and you can see the code build has its own logs showing you the details of your build phases it is going through as per you have configured in your build spec and once the build state has completed it will go to the final step in this pipeline and it's already kicked off the deployment step and let's go to the deployment step to show you the details of the blue green deployment it is doing this is where you will see the deployment status all the steps you will go through these blues are blue instances are your existing one and the new ones will be created and once it installs the application of the your application to the replacement instances it will start withouting the traffic to the replacement instances and you can see already there are two replacement instances has been created as per the auto scaling group configuration you can see uh detailed events that it is going through while replacing and installing the instances and this is where you can see the details that it is going through entire steps to have your new fleet of instances ready for customer including validating steps and as you can see at this point it completed installing the two of these instances and re-routing traffic and it is still pointing to blue set of instances and once it at this point it has routed the traffic to at least one of the instances and that you configure in the deployment configurations and now if you reload your applications you should be seeing it is already going to the green fleet of instances to your new release that you have just deployed this shows that any changes that you are committing to your customers committing to your master uh branches that should be able to give you the capability to do a frequent and small deployments and using blue green deployment you can reverse and roll back it as and when needed and this uh looks like completed the routing the traffic and once it terminates your original instances the deployment has completed and that completes this pipeline so in this way you can see that you can do operation such code as well as your application code together at the same pipeline and achieve operation such code as well as frequent small and reversible changes to your production so john could you please explain the rest of the design principles and how it relates to our uh design principles what we have covered so far sure i think that's a great idea in fact i'm actually going to take a quick pause though thank you for the demo uh if i could pam i'm gonna ask are there any questions in the chat that we can answer at this point in time no questions on the chat window everyone please feel free to post questions here for john and mannis to answer uh back to you john great thank you very much and i believe pama has posted a link if you're interested more in seeing how this architecture works that madness walkthrough there's actually a couple of blog posts that should be really helpful you can actually do this in your own environment if you'd like to test it out so we're going to keep on rolling thank you padma all right so as minus it asks i'm going to continue on by talking about the rest of the design principles we've covered two of the five but let's get into the next ones so our next design principle that we have is that you should refine operations procedures frequently so what we mean is you need to start by reviewing your operations procedures and you can do that this relates back to that previous design principle we talked about performing ops's code if you've codified your operations procedures you can actually review those procedures along with those other parts of your code in your application your infrastructure using the appropriate level of discipline but applying all those same benefits of a code review methodology now beyond just reviewing those to make sure they're accurate to know they're accurate you have to test them as well so you should also test your operations procedures regularly to ensure they're accurate and more important to ensure they have the desired outcomes not just that they work as expected but the outcome is also what you want and then you really do also need to make sure you update from learnings so while keeping operations procedures updated is important the places that you determine how and where to update them are also very important so you should update your procedures whenever you're running regular operational activities and as you do a review of operations events or any operations activities those should feed into your learnings and they should be used to perform updates but also you need to consider what happens when something goes wrong and you need to take the information you get from testing your procedures and feed that back in now the implications of this are really that if you use operation procedures as codes you can look for opportunities to improve them more easily you know as you evolve your workload you should also evolve your procedures along with it one way to do this is through setting up regular game days and game day is essentially an opportunity to test out your production people processes and technologies in a pre-production environment that looks like production using ops code and infrastructure as code by testing in a game day you can determine if things work as expected and have desired outcomes as well as identifying opportunities to improve your procedures because the challenge is if you do not update your operations procedures frequently you end up with them being out of date and when you need them there's a higher chance they won't work as expected won't be available or understood by operators and because of that you're more prone to manual effort and longer times to respond when you're trying to understand what's happening in an event because of human error so let's take a quick check here and take a look at a way that you could do this i'm going to go back to that architecture that monis has originally stated if we use operations as code and infrastructure as code let's imagine we have that production environment and cloudformation and systems manager set up to do our infrastructures code and our operations as code we have our ci cd environment we have our ec2 stack and it's all built out now because we're using ops's code and infrastructure as code and everything in our environment is codified we can deploy a complete copy of our production environment as pre-production with all the same resources and configurations and then we can run a game day event in a pre-production environment that's safe because we won't be impacting customers to test out our procedures and ensure they work as expected this will allow us to identify both anything that's missed that we don't know about and haven't documented as well as finding things that are maybe out of date and more importantly finding areas where you can make improvements so that's our third design principle let's jump into our fourth design principle that's actually closely related anticipate failure so anticipate failure you probably heard our cto verner vogels make another famous quote we need a design like everything fails all the time and what he meant by that is not that you should expect your environment to be unreliable but you should embrace failure as a natural part of how systems work now it is important to understand that everything could fail at any time and to embrace that in your designs but while understanding that failure is important there is more to it than when it comes to anticipating failure it's also more than just architecting for failure architecture is important it is very important you architect for failures by understanding where your failures could occur but if you talk to all of the people that run reliability and operations at aws you'll quickly learn that reliable architectures if you want high availability the architecture only gets you the first part of a high availability environment the rest of it comes from your operations activities so you can't just architect you must also operate for failure in other words we mean you need to plan for failure by practicing it so we talked about that idea of game days and you may be thinking when it comes to game days well if i'm going to run a game day and i'm going to test my environment maybe i can do a game day where i inject failures intentionally to see how things work so we'll call that like a completely safe version of chaos engineering but how do i know what my failures are well planning for failure by doing things like pre-mortems which is an exercise where you sit down and you basically pretend maybe our environment has failed let's think about why would it have failed and use that as an opportunity to improve you can use that to make your game days better basically the implications are if you test your environment and you test your procedures you'll know if they have known expected outcomes that are up to date but if you do not test them then when failures occur there's a much higher likelihood for a higher severity and higher impact as well as a longer recovery time if you're trying to figure out your procedures on the fly under the duress that happens when operational events occur so what does it look like to anticipate failure well here's an example of how that pre-mortem exercise could work again it's a tabletop exercise where you sit down and say hey this is the environment you might be able to take that you all probably know that one person that's always complaining everything's gonna break and nothing's right you can turn that attitude into a hero in this case because if i have my architecture i might be able to find all the places of failure in that pre-mortem and say hey you know automation can go awry and automation is great that automation could break our stuff at scale really quickly or maybe i could say well what happens if my alb swap doesn't work correctly and there's a swap failure and my traffic doesn't transition or what if my security groups aren't set up correctly what will be the result or what if i deploy code and then test it's fine but as soon as it's under load all sudden everything goes wrong and my green stack suddenly when it's under load with users on it it goes bad how do i recover from that or what happens if maybe i have a major networking connection issue and i can't get to aws and i don't have access to the control plane for the services i need i can't pass commands how am i going to deal with that by doing this kind of tabletop exercise you can identify sources of failure and run more effective game days where you practice these kinds of failure also you can take these sources of failure and if you want to make sure they don't end up in production you can remove them by adding them to things like your operational readiness review checklist to identify when they might go wrong so let's briefly jump into our last design principle which is learn from all operational failures and by the way this is closely connected to the rest of design principles it all tells a story about operational excellence so first you need to what operational failure learning means is you need to perform thorough post-incident analysis so you probably heard the term doing root cause analysis and that's important but thorough post-instant analysis is not just understanding what went wrong and it's not about blame and it's not about shame it's about understanding what were the sources of failure the people the process and the technology and then most importantly thorough post-incident analysis means determining how you're actually going to prevent those failures from recurring that means you need to learn from your failures to drive improvement in order to make sure that you don't repeat the same issues again there's even more to it than that learning from ops failure also means you need to share those learnings across teams there's nothing like having a failure happen and then having a team next to you say oh yeah that happened to us too why didn't you tell us about it you need to share that right so you need to make sure that you share across teams which will reduce the overhead for those teams to design and create their own you know responses to failure but also make sure they don't repeat the same failures that other teams in the organization have learning will prevent that recurrence but it also reduces the extra manual effort and the duplication of time and energy to develop those strategies you use to mitigate or remove operational issues in your architectures so that's the overall implications of learning from operational failures that you need to do post-instant analysis learn to drive improvement and share those learnings across teams now what would that look like well at aws we actually have a thing that we do for post incident analysis that we use as part of a regular cadence with our teams and it's called ops metrics so let's imagine this scenario for a minute so we have the same environment we have that environment we talked about before and everything's great we identified our sources of failure we created mitigation plans and we removed them using operational readiness reviews but you know what this side thing we hadn't thought about happens you know we have it set up so we have two stacks 40 instances each 80 instances total no problem now we have this little problem though we know that our stacks are growing over time because we're getting more user traffic so the next time we deploy we now have 60 instances in each stack so we need 120 instances but we forgot to do is we didn't look at our service limits and because it turns out our quota for this region for the instance type we're using was a hundred we try and spin up 120 instances and as we begin scaling our target group it fails because we don't have enough instances available in our service limits for that now we could just say oh go in and fix it and never go back and really address that problem but the better thing to do to learn from that operational failure would be to instead during our operational metrics review understand why that happened and put procedures in place to ensure that we are regularly looking at our limits and employ technologies to maybe even automate the increase of those limits if we're going to need them so i talked about ops metrics as a way to share learnings at aws we do a thing that's called operations metrics and in fact i believe padma can even post some links for you about the way that we approach that but operational metrics is a post-instant analysis including an identification of how we will correct errors that shared across every single aws service team so it includes our line of business owners and it's cross-functional and there's an agreement across teams on courses of action now in fact going beyond just that we extend learning from all of our operational failures to even failures that don't necessarily impact production maybe you as a service user have never even seen that there was an impact what we do in ops metrics is we also have a regular cadence where every team every single week all of our hundreds of service teams have to actually prepare their operational metrics and be ready to present them to all of aws for us to review again no shame no blame no one's telling anyone what they did wrong but it's rather so we can all share learnings now you might be at this point saying wait how is that possible you have hundreds of service teams and once a week every one of them has to be ready to present well yes how do we do that here's how we do that we actually have created a tool that we call the aws ops wheel so if you wanted to go out to github you can actually get this ops wheel for yourself and deploy in your environment there's a cloud formation template there that deploys all the resources for it and essentially all of our services are listed in this digital wheel in fact this wheel even used to be a physical wheel that would be spun at ops metrics meetings once a week each team must prepare their metrics because if their team name comes up in this digital wheel they are selected and they have to do a presentation on their operational metrics to ensure everyone can learn from them so if you're interested in running something like that for yourself as a part of your own desire to do ops metrics this is actually made available so that covers all the design principles hopefully some really good examples so i'm gonna actually pass it back to padma here pema do we have more questions that we can cover we have one question on the window thank you for your question vmg the question here is if a if an architecture only covers the infrastructure side of things or just the devops side of the architecture how could we use that design to complement the operations as code principle yeah that's a good question i can take it when you are architecting for architecting for your workload you should not just think about application architecture or devops or any particular areas you should do the design with ops in mind so you are looking at entirely including your operations how it will be done john do you want to add anything no that's very good and again if you've designed an architecture and you're looking at your technology think about the practices that you've applied to design that architecture to pick the services to codify them apply that same mentality to all of your ops procedures and consider maybe codifying them as run command documents or as automation documents to be triggered or manually executed in systems manager great okay any more questions padma that's all the questions we have here today uh everyone uh thank you for joining us today so just to sum it up today we looked at driving operational excellence using the well architected framework if there were any questions that were not answered today please post them on forums.aws.amazon.com and email us any feedback aws supports you at amazon.com we look forward to hearing from you and tell us what else you would like to see on this show thank you for joining us today at aws supports you happy cloud computing thank you very much and again when it comes to resources please check all the things that have been posted by padma we're really happy you could join us today it was great to have you here
Info
Channel: Amazon Web Services
Views: 167
Rating: 5 out of 5
Keywords: AWS, Amazon Web Services, Cloud, AWS Cloud, Cloud Computing, Amazon AWS
Id: -rRF8V3zk6A
Channel Id: undefined
Length: 50min 49sec (3049 seconds)
Published: Tue Sep 21 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.