AWS Disaster Recovery | RTO | RPO | Business Continuity Plan | Simplified and Visualized

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

the step number one is to understand what is a disaster let's suppose you hosted your application in a region and unfortunately that region met with the national calamity or disaster like a flood or earthquake or it could be a data center that lost power or due to cable damage it lost the network connectivity or it could be the most common disaster that is human action like someone actually planned and plotted a bad configuration or it could be damaged due to unauthorized access by deleting your data or even it could erase all your customer information and that's where we reach to our step number two that is high availability is not disaster recovery so why are we saying that high availability is not disaster recovery you might think both are the same because both of them start with the same operating principles first both monitor for failures second both deploy the resources at multiple locations and third obviously is automated failover where if one data store or resource fails you have the backups to take over but think rationally about the steps that you take for high availability so when it comes to availability you are more focused towards the components or resources of the workload to serve the customer demand and so that you can operate continuously without failing and you strive hard to meet the service level agreement of service availability and on the other hand when it comes to disaster recovery you need to focus on the time it takes to recover from a disaster you need to ensure the workload or resources that you have provisioned meets your availability objectives and the focus is rightly on deploying discrete systems to multiple locations but the main objective here is also to have a multi-site active active workload distribution so multi-site active active is a disaster recovery strategy to run the workload that you have in a way that it can serve requests in two or more distinct data centers or regions so this strategy enables your workload to remain available despite disaster events such as natural disaster technical failures or human action and point in time backup so which enables you and your customers to recover the backup data from a specified time within the retention period that you have set so let's suppose you have your application hosted on a region with multiple availability zones with a proper scaling mechanism your users are very happy with the application status if one of these availability zones that is data centers get affected and stops responding so a load balancer is smart enough to route the traffic to other instances in other availability zones and your users are still fine as they are able to access the application but what if the whole region gets affected what is going to happen in simple words your high availability is now zero availability isn't it and that's where disaster recovery might come in handy because disaster recovery might tell us to have a deployment strategy of multi-site deployment that is like deploying it in other regions as well as shown here if your application is not responding on the region it can as well respond using other regions i know you might still have doubts regarding this so just keep watching and i'll tell you one thing one of the best ways to avoid disaster is subscribing to this channel that's my disaster recovery mechanism and it's free so having said that let's move on to our third step the third step is are you resilient enough so when you think of trust in terms of your product you should always remember the word resiliency this word resiliency in its actual sense refers to the capacity to recover quickly from difficulties so it's same when it comes to your infrastructure as well so resiliency is the ability of a workload to recover from infrastructure or service disruptions and dynamically acquire computing resources to meet demand and mitigate disruptions such as misconfigurations or transient network issues always remember that things can go wrong at any point in time but the most important thing is how fast you can recover from the disaster that's where resiliency strategy comes into the picture so if you see here disaster recovery and availability are a very important part of our resiliency strategy where disaster recovery focuses on how workload responds to disaster and how well it can recover from that on the other hand availability focuses on uptime or downtime for your resources over a period of time also known as mean value over a period of time so when it comes to disaster recovery if you see here the response actually depends on the business objective and how you can avoid loss of data that is also known as the recovery point objective that is rpo and the other thing is how well you can reduce the downtime for your application where your workload or resources are not available for use which is also known as the recovery time objective that is rto as i already told you before availability focuses on uptime or downtime for your workload or your resources over a period of time also known as the mean value over a period of time remember its mean value i am sure that most of you are totally aware of what a mean value is so we have to focus on mean time between failures that is mtbf and mean time to recover that is mttr in order to understand how we can calculate the availability you need to understand how to calculate mtbf and mttr so first let's check empty bf mean time between failures here you have to calculate the total working time of your application minus the total time of your application breakdown and divide it by the number of breakdowns that you had so like let's suppose you had a total working time of your application that you have hosted as 100 hours and out of which you had five hours of breakdown time so breakdown time imagine it there's a time period where your users were not able to access your application and there were 10 breakdowns in total so your mean time between failures will be 100 minus 5 hours of breakdown time divided by 10 breakdowns so that is 9.5 and similarly if you have to calculate mttr or mean time to recover you have to check how much time did you spend to bring up your services or application and divide it by the number of repairs that you had to do so for example you spend around 10 hours in maintenance and it was around five times that you had to repair the system so mttr will be 10 divided by 5 that is 2 hours so now that you are aware of these terms let's calculate availability yes availability is basically as the word tells you it's the amount or percentage of time or value on how available your services or application is to your user but you have to remember that it's not always that a disaster should occur to impact availability it could be like for example you have created an application design that supports a maximum load of 100 users but what will happen if you get around like 10 000 user requests at the same time so you would ask yourself why is it not working for your application to support that amount of load there is no calamity or earthquake or flood still your application is not available to all the users so you have to take some steps to improve that isn't it i think you already have the answer to this and if you do please make sure that you put that in the comment section below as what you feel availability is according to you so now let's get back to this so availability is a measure to calculate the amount of time that your application is available for use divided by the total time or total amount of time it has been hosted so internally it is translated to mean time between failures divided by the total of mean time between failures and mean time to recover so it is mtbf divided by mtbf plus empty tr and you might have heard or read about this term called availability in terms of nines so like three nines or six nines so where 99.9 is three nines so this is one of the most desired availability that services are trying to achieve so let's suppose we have an empty bf or mean time between failures that is around 400 hours and mttr that is mean time to recover as 10 hours so if you put that into the formula that we have above so we will get the availability is around 400 by 400 plus 10 that is 400 by 410 so you'll get a value of 0.975 so you get availability of 97.5 percent which is not that great but this is just an example so i'm just i'm just showing you how actually the calculation works but availability can be measured with the response as well and not just time so let's see how it works so here availability is the number of successful responses divided by the number of valid requests so for example you get successful response as around 500 responses and you have a valid request that you have sent that is 510 request so availability for you will be 500 by 500 and that is 0.98 so if you convert that into percentage it is around 98. so 98 is good but this is just an example i just wanted to show you the first step that we have here is the resiliency of the cloud so what aws tells us is that resiliency is a shared responsibility between aws and you the customer so you must understand how disaster recovery and availability works in the shared responsibility model and i am sure you have already seen this in the security pillar that we have discussed before we had already discussed the shared responsibility model and i hope you will relate to that so first thing that you have to understand here is the phrase resiliency of the cloud so all the infrastructure that you see here the hardware software networking and the facilities that run aws cloud services and its resiliency is the responsibility of aws so all these things that you see the hardware software networking and the facilities and all the services that are run on aws cloud and its resiliency is the responsibility of aws so if you're not able to launch an ec2 instance due to a region failure it's aws's responsibility to bring that up but if you lost your data that you had in your data stores and you didn't plan for disaster recovery that is also the responsibility of aws isn't it no you're wrong that is your responsibility to safeguard your data and to have disaster recovery in place that's not aws responsibility your data is your responsibility and that's where we have to go to our fifth step step number five is resiliency in the cloud i will tell you one statement and it will sum up this step and that is your responsibility will be determined by the aws cloud services that you select does it make sense so let's move on with this so here if we have to take an example so aws gives you a service called elastic compute cloud that is ec2 but it's your responsibility of what you host there and the data that you store if you don't host your application in multiple availability zones for high availability it's not aws's responsibility that your application went down it's your responsibility if you are using ephemeral stores it's not aws has a responsibility that you lost your customer data the configuration that you have is your responsibility the encryption of data is your responsibility hosting your application in multiple regions and having a failover is your responsibility i hope you got the point here if you did let's move on to the sixth step the sixth step is business continuity plan or bcp so what aws tells us about bcp is that your disaster recovery strategy should be based on business requirements priorities and context so you might ask me what is bcp or business continuity plan so business continuity plan is the process involved in creating a system of prevention and recovery from potential threats to the company so you might be thinking we'll have a disaster recovery in place and it will work fine and it will have no issues but it's not always the case i'll tell you a few scenarios or situation where the bcp strategy or disaster recovery strategy can be of very less impact so let's suppose you are selling a product online and you have a multi-az and multi-region setup to help disaster recovery as i already told you that product was launched with a particular flavor and that flavor caused a lot of allergies after five days of use but you already had sold 50 000 units and your business outcome depends on that now in this situation only god can save you no disaster recovery strategy can save you you might be thinking i'm just launching an ec2 instance how will it impact me so if you're working on a project or application at your office or startup for you it might be an application where you're implementing some features and writing some code but for the people who make the decision to a lot of budget on your project for them it's a product from which they're expecting to make money it's a business for them they're not doing it for a charity so when you design an application on the cloud or infrastructure the cost and budgeting are a business constraint and it will be approved by the people sitting at the highest order and that is why your disaster recovery strategy should be based on business requirements priorities and context when creating a disaster recovery strategy we must plan for recovery objectives and they are recovery time objective rto and recovery point objective that is rpo so imagine this is the timeline and this is the point of the disaster listen to this very carefully recovery point objective or rpo is the maximum acceptable amount of time since the last data recovery point so this is your recovery point what it means is that between the disaster and the recovery point using which you can recover your data this period that you see here is the amount of data that you can afford to lose that is why that period here is mentioned as data loss that is why it is mentioned here that it is the maximum acceptable time or acceptable amount of time since the last data recovery point so as an organization we define what is considered an acceptable loss of data between the last recovery point and the interruption of service that is why it is called a recovery point objective so we have to define this objective and there is another one another objective that we have that is called the recovery time objective so rto is the maximum acceptable delay between the interruption of the service and the restoration of the service from the time that it went bad till the time it went to the acceptable state so as an organization if you define that one r is the rto or the recurve time objective then within that time you should be able to restore your service from interruption that's the acceptable amount that you have so the time frame that you see here the time frame that you see here is the acceptable time window for service interruptions that is why it is mentioned here how quickly you must recover and what is the cost of the downtime so now that we have the intro in place let's talk about these in detail in step seven and 8. so our step number 7 is recovery time objective or rto so what you see here is the cost and complexity versus length of service interruption graph so as we know rto is the maximum acceptable delay between the interruption of service and restoration of service but in order to give the user the best possible resolution the cost to business will remain the biggest differentiator so you might ask me how and for that i'll tell you a very small example here so please listen to this very carefully and you decide what exactly is the differentiator so i'm going to give you an example here as to how you should imagine this graph let's suppose i give you a device that is broken and i ask you to repair it and you tell me that it will cost you around hundred dollars and it will take five hours for you to get it repaired okay that's good but i want it to be repaired much faster so i visit another shop a person and ask him to repair it so that person said that he or she is going to charge me around 500 and it will be repaired in 1 hour so this is very good in terms of time but i don't have 500 so it's something that my business cannot afford so i visited another repair shop and he said i'll charge you around 250 dollars and it will take you around 2 hours to the repair to be done so that sounds reasonable isn't it based on the time and cost it provides me so now tell me what was the deciding factor here was it the cost or was it the time yeah it was both but in consideration with my business outcome so i can afford to pay 250 dollars and wait for two hours of service interruption that's the maximum acceptable delay between the interruption of service and the restoration of service and this is your length of service interruption in time scale and this is the amount of cost and complexity of the process to bring the service back to life and this is the cost of business impact and each disaster recovery strategy will fall into one of these categories as we have mentioned here so the first one is that is multi-site that is active active where you create a second active replica setup of your service and which is one of the best when it comes to time but too much for the cost side because it exceeds our acceptable recovery cost so the next best approach is considering the time factor that is warm standby where you have a scaled down version of a fully functional environment that is always running in the cloud this is optimal for disaster recovery but reasonable on cost as well next comes pilot light disaster recovery strategy so in pilot light approach the most core components and services of your application are replicated in another region like your most important data and storage replicas and other services are turned off and only used when testing or only used during testing so that is pilot light and when you face any challenges then you can rapidly launch the whole setup when you need it so this is very good when it comes to cost and also reasonable when we consider the recovery time objective as it falls within the criteria then comes backup and restore which is like extreme cost saving but also provides a very slow recovery time so here you backup your data and application from anywhere like other location or on-premise to the aws cloud for your application usage so this will surely take a lot of time because you have to set up everything in another location but having said that it's not that you're going to choose multi-site active active always to save time no you should not do that the disaster recovery approach depends on your business outcome as i told you and it does not make sense in most of the cases because to run a full scale setup across multiple regions or locations you have to pay an enormous amount considering the cost so think about what fits best for your business before taking into consideration any of these approaches but if you see here as per the graph the warm standby and the pilot light are very good approaches when it comes to cost and time so the concept of active active and active passive is very important considering the exam so easiest way to remember this is to imagine it in a way that inactive active you have a replica of your application running in another location or region always so the whole setup is available to you that is why it's active on one location and active on another location and you can route traffic to any one of these regions that you have at the same time so when it comes to active passive you should remember that one location has an active setup running but you have another location or region where you are running your scale down version or minimized version of your application or or even your actual setup as well but they are not active at the particular moment so only one is active at a particular given point of time so that is why it is active passive now let's come back to the diagram here so here as we already discussed backup and restore takes hours of restore so it takes hours to restore so that is why it's just for low priority use cases or less critical workloads and you provision your workload post the disaster that has occurred and this is the lowest cost disaster recovery and as we move forward we come to pilot light so as i already told you that in pilot light most core or critical components and services of your application are replicated in another region like your important data and storage replicas and other services are turned off and used only during the testing time so here you have live data but considering some of the services are kept idle so it might cost you more than backup and restore but it will surely take less time for your recovery process then we come to the next active passive disaster recovery strategy or setup that is warm standby so here your services on another location or region are always running but it's a scaled down version of your actual setup so this is very good for business critical applications and when you have a disaster you can scale up your resources and obviously it will cost you more than the previous two setups that we have but it is reasonable with recovery time and the recovery time is in minutes so it's a very good option as well the next is the best disaster recovery strategy but it costs a lot more than any of these other disaster recovery strategies but you get zero downtime and zero data loss in multi-site active active so try create a balance between recovery time and cost as per your business requirement and that's what i would suggest now we have the step number eight that is recovery point objective that is rpo so here as well when it comes to rpo that is the maximum acceptable amount of time since the last data recovery point so to remember this imagine this scenario whenever you create a data set you keep a copy of that in another location so that in case you lose the original data you have a backup from which you can recover your lost data so imagine someone tells you that they will keep a replica of your data every second and in case you have a disaster you can recover your data that was lost up to one second of the time when the disaster occurred so you'll be very happy isn't it but when he says he will charge 500 for every replica that gets created then you will be taken back and you might feel it's too costly isn't it and another person tells you you will have zero data loss but you will have to pay thousand dollars for the amount of data that is being replicated so you might be really excited about this but this will be beyond what you can afford or your business could afford isn't it in some cases so here as well you have to consider what your business outcome is and based on which you have to take the decision of which disaster recovery strategy should you choose so here we have the graph of cost and complexity versus data loss before service interruption remember this point very carefully we are talking about data loss before service interruption so having said that we all know that with multi site active active you get near zero data loss because you have near real time data replication so it may be around in seconds but backup and restore is considered to be most efficient when it comes to cost because your backups are created in the same region as their source and are also copied to another region but it gives you the most effective protection from disasters you might ask how because you have the data backed up and you can restore it into another setup but the thing is that it'll take more time that's the only problem but the interesting thing for us is to see that both warm standby and pilot light have similar data loss but the cost in complexity is less for pilot light compared to warm standby and you will ask me why that's a good question so here both of them use near real-time data sync because both of them have live data with periodic backups but the difference in recovery cost comes because in pilot light most of the services are turned off or are kept idle compared to warm standby which has a scaled down version of its actual application so i hope you got the point here so now we reach to our step number nine the most important one that is disaster recovery options in the cloud so let's talk about disaster recovery options in the cloud with a few more detail so we have already discussed these options before which are available to us ranging from the low cost and low complexity of making backups to more complex strategies using multiple active regions so let's talk about them one by one the first thing that we'll talk about is backup and restore approach so as the name tells you the approach in this will be to take the backup and restore it to another region along with the infrastructure that you have so that's a very straightforward approach isn't it you take the mac up and restore everything or the infrastructure along with that in another region so this is your active region where you take your backups and this is your dr region that is your disaster recovery region the rpo is the recovery point objective that will depend on how frequently you take the backup and that's where your point in time backup helps you for that we are using amazon s3 which gives us the feature of s3 cross region applications so we are storing ebs snapshots and the db snapshots on s3 which will be replicated across the region using crr and with the help of aws backup we can also copy our backups of our configurations across regions and the importance of backing up configuration is that it will help you launch your infrastructure using that configuration with cloud formation or any other iac that you're using and it is advised to use a different account for your disaster recovery region that you have here so in case your account gets compromised in here it does not get affected to your other region but even though you use aws backup to restore the data it cannot be done automatically so or the restore cannot be done automatically so in order to do that we have to make use of aws apis to automate that when it comes to taking backup and taking restore what we can do is we can create jobs using apis which can be executed periodically in time based on a trigger point so the aws backup sends the notification about the job once the backup is completed so it gets notified to the user using the aws sns notification service then using the sns we can trigger the job by invoking lambda function for testing restore functionality and cleanup actions based on which we can perform the restore operation and up for the ebs block storage so i'll repeat it once again so what we can do is we can create jobs using apis which can be executed periodically in time based on a trigger event so aws backup here will send the notification about the job once the backup is completed so the user will get notified along with that using that sns notification service and then sns what it will do it will create a trigger to execute the job by invoking lambda function for testing restore for a backup and the cleanup functionalities and actions and based on that we can perform the restore and cleanup of the ebs block storage now coming to pilot light as we have already discussed here we have the database and storage always active but the application servers which contain the code and configuration are turned off or kept idle so i'm highlighted important pointers here for you to keep your focus on that particular thing that is important so the first thing that you notice here is that until a disaster occurs we don't route traffic to the disaster recovery region as services and resources are kept idle there or they are turned off but having said that you have the option to quickly provision the full scale of your production environment by switching on the resources that are currently turned off so that's a very good thing as well and for data replication we can create a raw replicas by making use of the aws aurora global database that gives us the feature of asynchronous cross region replication we already have a video on the aurora database please go through that to understand more on how the replica works for the routing we can also make use of global accelerators and based on the health check it can direct traffic to the appropriate endpoint and as we already have mentioned before if you see the application servers are turned off here and we can also make use of cloud ndr for our disaster recovery as well if we wish to have our data centers or on-premise locations restored onto the cloud so now let's come back to warm standby so this is similar to pilot light but the main difference here is that you can see the resources and the workload that you have in your standby location are scaled down but having said that unlike pilot light this environment with warm standby is completely functional but it's a scaled down version of it so in other words it is like less powerful than the actual environment but don't get confused here it's a copy of the actual environment but just a less powerful one and when you have a disaster or if you want to support more users or traffic you can also allow routing to this region by just scaling the resources that you have currently if you see the route 53 is inactive for the production traffic but if you want you can also route the traffic by by allowing the traffic through this and if you want to support more number of traffic then you can also scale it up so on the right hand side you can see it is mentioned minimal instances running so we just need to scale it up to match the production environment so i hope you understand the difference between warm standby and pilot light so on pilot light the services are turned off completely but in warm standby it is scaled down so now we reach the most expensive disaster recovery strategy i'm just kidding but it's very expensive but actually provides you zero downtime and near zero data loss so with multi-site active active you can run your application in both regions simultaneously and that is the active active strategy but there is one more advantage here so as we mentioned or as we discussed just now with multi-site active active you can serve traffic on multiple regions at the same time that's obviously an advantage so but having said that you can also serve traffic from a single region that is called hot standby active passive where you have an application hosted on multiple regions but but you can choose the one region that always stays active to serve the traffic but you might have a doubt here that if you have to use hot standby it's better to go with the warm standby at least we'll save some money isn't it yes you are absolutely right so if you're using activeactive go with that else going to hot standby doesn't make much sense and in activeactive you might feel what is failover in this scenario so yes there is nothing called failover it's just traffic redirection and that's what we test when testing for disaster recovery with site site or multi-site active active for replication obviously we can make use of dynamodb global tables and you can route the right transaction to the closest region that you have and also use it for continuous backup at last we reach to the final step that is step number 10 detection and testing your disaster recovery strategy so let's see how we can do that so for detection you can make use of the personal health dashboard which will provide you alerts and remediation guidance when aws is experiencing events that may impact you so why this is a very good service is because with the personal health dashboard you get the personalized view of the status of the aws services that are a part of your aws resource list and you can integrate it with amazon cloudwatch event so that you can create custom roles and based on that you can trigger actions using lambda to apply the remediation steps so if something happens you can do that and if you can see here i have mentioned it rightly that it is important to know as soon as possible that your workloads are not delivering the business outcomes that they should be delivering so you must always be aware of what is going on in your environment if it is functioning properly or not and this is one of the most important steps for disaster recovery strategies and with personal health dashboards you get the event log where you can see if there are any services or any issues with the services that you are currently using on aws so that's a very good thing so make sure that you use it and use it to your benefit to identify what are the things that are going wrong or if they are functioning properly or not and you can see here we have operational issues and you see the status has been closed so these are the issues that have been resolved now that we have detected it let's see how we can test it so when it comes to testing it's more about testing the disaster recovery strategy that we have so that we can test and validate the failover to ensure we made the rto and the rpo so you might ask what is the use of testing the strategy so in case you have a resource that you have like a data store or auto scaling group that doesn't provide any added advantage when you have a failover so you can as well reduce it or increase it based on what your requirement is so if in case your disaster recovery is not able to handle the load or it's beyond the capacity level that it needs then you can change it accordingly so that is why we test disaster recovery strategies and to manage your configuration changes you can use aws config so by using aws config we can monitor the aws resource configurations that we are currently using and that is also known as managing configuration drift and you might ask me like what is actually managing configuration drift so if there is a drift so drift in english means like changes isn't it so if you think about your data center there is a lot of changes that can happen over time on your configuration so maybe due to the hardware or software dependencies so it can change over time and if in case there is a configuration drift between your primary environment and the disaster recovery environment we should be able to identify that so that we can act upon the changes in order to keep it identical isn't it so if there is the configuration drift we should be able to identify it and aws config can help us do that so i hope you got the point let's move on so this is it we are done with the 10 steps of disaster recovery and i hope you got the idea of how disaster recovery works for aws or with aws and you can read more about this in the documentation and if you have any suggestions or clarifications please do let me know in the comment section below and these videos really take a lot of time and effort to make so please support this channel by hitting the subscribe button and giving it a like it's free so please do that i'll be really happy so that's it for today everyone i hope you are sound safe and healthy i will meet you in the next one until then it's pytholic signing off

Info

Channel: Pythoholic

Views: 6,198

Rating: undefined out of 5

Keywords: aws disaster recovery, Pythoholic, aws disaster recovery best practices, aws disaster recovery pilot light, aws disaster recovery pilot light vs warm standby, aws disaster recovery test, aws backup and disaster recovery, high availability and disaster recovery in cloud computing, disaster recovery plan in cloud computing, aws diaster recovery test, aws cloudendure diaster recovery, disaster recovery plan, what is disaster recovery, business continuity, aws disaster recovery demo

Id: WYCbczFIj3E

Channel Id: undefined

Length: 36min 19sec (2179 seconds)

Published: Tue Jun 29 2021