Disaster Recovery in Microsoft Azure

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hey everyone in this video i want to explore thinking about disaster recovery for the services that we deploy in azure and maybe even thinking about using azure for disaster recovery for services in other clouds or on premises as always this is useful please like subscribe comment and share and hit that bell icon to get notified of new content now it can be really tempting to just think hey we have azure and when i deploy things to azure as if by magic then everything we deploy is just always going to be available i don't have to think about it it's in the cloud and the cloud is everywhere so when i deploy to that service i don't have to worry about anything else and there's this whole idea that there's no such thing as the cloud it's just someone else's pc i think there's a lot more to it than that but realize yes we have cloud services but fundamentally there are still physical buildings there are still racks of servers and there are things that can go wrong so things can fail at different levels at a node level a rack level maybe even a facility level and so it's really not magic there are components we have to think about and so we have to leverage different architectures and technologies to make sure our services are resilient now when we talk about disaster recovery we're really thinking about resiliency and fell over between regions so i can think about well maybe i've got two regions for this conversation so i can think okay i've got kind of over here my region two for example maybe that's east us and i can think about hey there's another region over here maybe that is west u.s again just just examples and there are many many regions like if we go and look we can look at the actual microsoft documentation and here we can see we'll look in the united states there's central us east u.s east us two east us three is coming soon there's this whole list of different regions available to us in the us alone and then there are ones in kind of brazil canada i can change and i can go and look at europe for example france austria norway you kind of name it there are really regions all throughout the world and a region remember we often think about in terms of how do we define that region it's kind of that two millisecond latency envelope so there might be multiple physical facilities that actually make up that region so i can think within that region they may be actually exposed to me in certain regions we have things like availability zones so i would see kind of an az1 az2 az3 where they have independent cooling power networking ie communications and i would use those for high availability so that's separate from disaster recovery we think about high availability availability zones fault domains i.e certain racks we think about load balancers and these give us protection from things like a node failure or rack failure maybe even a building level a facility level failure now within that region because it's that very small latency envelope we can typically do things like synchronous replication where we don't acknowledge a right for example until it's been written to a majority of members so there's some sort of quorum i might have active active configurations i can have load balances distributing between them that's a very simple thing and if you're curious about availability zones i did a deep dive video on that we can go and see more detail but if you think about disaster recovery that's where i think about i want a bigger distance between them i might think about i want hundreds of miles between where maybe i think about maybe my primary is running and where my disaster recovery would be because all about that blast radius within a region sure i might have resiliency against a node failure failure maybe even a building but some kind of natural disaster could impact all of those so for disaster recovery purposes i want to think about a huge distance between them so that hey if there was some natural disaster it's not going to impact my disaster recovery region or if it did i've probably got bigger problems because if they're hundreds of miles away and that natural disaster impacted both of them it's probably a bad day for everyone now when i think about regions to pick microsoft has a certain natural peering it actually creates these pairs of paired regions and we can see those in the documentation again all of these are linked in the description below so they have their own paired regions so you can see here it names kind of hey the idea of the different pairs that make up how it uses certain services now i don't have to use these pairings for a lot of services i can pick my own some services do align to this things like azure storage geo-redundant storage azure key vault azure backup use these pairings there are also some benefits to using these pairings because what happens is if you think about well microsoft has to deploy our updates now when it deploys out service updates it's super careful that it won't deploy the same update to paired regions because most likely customers are using those pairings for its disaster recovery so if it rolled out an update let's say to the storage provider and it caused a problem we wouldn't want that same update rolled out to the paired regions because it would take out the dr region as well so they don't roll out updates to the same time to the paired regions so using those pairings does give you some additional resiliency from microsoft kind of rolling out their updates so there's a benefit to that now as i think about actually using these then i want to use disaster recovery i have to understand what is my workload because depending on my workload is going to impact what exactly i can do if you think about your resource now your resource your workload you're doing when it's using some kind of resource now that resource could be many different things that resource for example could be a virtual machine it could be some kind of container it could be something else but then once that resource actually runs kind of an operating system there may be some kind of middleware some kind of run time and then you actually get to your app and maybe potentially there's kind of data as well so these different levels to our workload now what i have access to varies depending on which azure service i'm using if for example i'm using an is virtual machine well then i kind of see all of this so basically getting a vm in the cloud if i'm using a pas service well then i really only see that there's still a under fundamental resource and os and runtimes but i don't really have any access to those things with pads i just focus on the application now when i think about okay there's these different levels i can actually use well then i can actually think about well there's different options i can have to achieve that disaster recovery because the whole idea is hey look i'm running here saying bad happens i now want to be running over there so how do i suddenly run over there well i can think about okay these are hundreds of miles apart what are some of the options i could do well certainly i could just maybe recreate the resources so in the event of a disaster i'll just recreate them i have some template for example i used infrastructure as code i can just recreate or maybe even this is horrible i go to the portal and i click click click and i recreate my resource maybe i can perform a backup restore i have my backup data available to me in the other region maybe the backups taken there and then replicated over and i can do a cross region restore i could restore my resource over in this region or and it kind of builds into this i'm doing some kind of replication now as we just store if i wanted to do a backup restore cross region it would have to be replicating the backup data across the regions anyway now when i think about replication remember all these layers well obviously i can replicate all those different layers depending on what my service is maybe i'm replicating at the application level if it was a database maybe it's sql always on for example or some extension to another database that lets it replicate maybe i'm replicating at the os level some agent inside the operating system that hey captures the changes to disk as they flow through the volume driver the file system etc and sends it somewhere else maybe just the platform itself has a native replication for example if i had an azure storage account it has its own option to do the replication so we have these different potential options on how do i get my thing running in that kind of disaster recovery region and the right one is going to depend on a number of factors i'm going to go into more detail on this but first of all realize not everything is your problem most resources we do deploy into a region a storage account a vm a container an app service a database instance they deploy into a region and i've got videos on understanding why most things deploy into a region but realize there are some services that don't deploy into a region some of them are just globally redundant natively for example i can think of things like azure ad azure ad is a global kind of service it just has instances replicated over multiple data centers it's not my job to worry about trying to make azure a.d resilient in a regional failure scenario things like the content delivery network that's another service where i can host content that's automatically replicated out things like azure static web apps well they actually use the cdn so i don't deploy an azure static web app to a particular region it just uses that cdn so it's geographically distributed there are certain geo services for example things like front door where it geo balances http https traffic things like traffic manager it's kind of a global dns resolution system to point to other services they are just globally resilient and built in so don't have to worry that's not my job to think about those but for most other things they deploy to a region so i have to think about okay what happens if there's some kind of regional problem now i'm going to focus initially on the idea of failing over but realize i might also run active active i'm actually running instances in multiple regions and we'll come back to that later on but that would kind of solve that problem as well because hey they're just running in multiple regions so in a way i'm resilient against some kind of regional failure already but let's think about for a second okay i want to think about disaster recovery for my workload now we need to understand the requirements of the protection so what are the requirements for the disaster recovery and there's a whole set of things we need to know i need to understand the application dependencies but before i get to any of that there's kind of a key must know so what i must know really before anything else there's two things the first one is something called the recovery point objective and you'll hear this called kind of r p o recovery point objective is really saying well how much can i lose that could be five minutes maybe it's one hour maybe it's one day so in the event of some unexpected disaster how much data can i lose now you might instantly say nothing i can lose no data okay well that's gonna require a million dollar solution oh okay well maybe i can lose five or ten minutes okay that's a lot more manageable so we need to be realistic about what are these numbers because yes you can say i can't lose anything i can't be down for any amount of time well okay it's going to cost me millions and millions of dollars well actually i can lose five minutes i can be down for half an hour that would be okay in an unplanned in a planned scenario i unplanned would be where did my data center go and so i'm i'm going to expect to lose some data because it's an unexpected suddenly i just lose my resources then there's kind of an expected failover i i see a storm coming i have a few hours notice i can cleanly stop services complete some replication and start them up i would not anticipate having any unexpected data loss so i need to know what amount of data i can survive losing so i need to know that and then i need to know what is my recovery time objective so you might hear rto and this is really about well how long to be operational i.e maybe that's 30 minutes maybe that's four hours maybe it's three days different values there's not a right or wrong here depends on the service if this was a time tracking application for employees or an expense submission site you know if it's down for a few days it's really not the end of the world that's completely different from hey this is the shopping cart for my online business i can be down for five minutes so again i want to be realistic with what these are because these are completely gonna drive what solution i use now one additional thing to think about with this recovery point objective yes this is how much data can i lose but realize this does time a little bit to my recovery time objective let's say this is 10 minutes i can lose 10 minutes of data okay so i've got a solution in place that maybe replicates every five minutes so i'm safe but if my recovery time objective let's say was 10 minutes it takes me 10 minutes to fail over okay so i had a backup of five minutes then i'm takes me 10 minutes to get up and running so that's 15 minutes what was happening to maybe transactions in between that time are they stored somewhere so they'll get played in which case i'm good i'm not losing any data or is there some component that's just going to dump the transactions and fail well now i might be actually losing more than that five expected minutes because of the replication because i'm actually down for a period of time so i have to kind of think about that as well so we understand those things and these are going to drive a lot of kind of key points about how we architect our solution now we drew these idea of layers the os the app the actual resource itself when i think about replicating i can replicate at those different things now i can imagine i've got these two regions and i'm thinking about replication right here so i want to kind of expand out this idea of kind of replication that's what we're going to focus on typically the best performance the least data loss the fastest uptime the fastest failover would be if i replicate at the app level so there's some kind of thing built into the service itself that does replication for example if it was a database if it was sql it has its own kind of native always-on replication now it's typically going to be all of these are going to be asynchronous remember the point of asynchronous compared to synchronous so with synchronous i don't acknowledge the right until i've written it to multiple places that doesn't work very well if there's hundreds of miles between them because it takes time to send that request and get the response back so if i have an app performing transactions and it has to wait for this latency this may be 30 40 milliseconds so it can impact my performance so asynchronous says hey it acknowledges the transaction as soon as it's written to the local region and then on a kind of as quick as it can or maybe on a schedule it then sends it over to the other region obviously because of that asynchronous nature i could lose data because if this went down before it sent it to the other side i've lost that transaction so that's why asynchronous there's always the risk of loss of data but we kind of have to balance that because otherwise we'd really impact the performance of the app if it had to wait for that maybe huge latency every single time but that's something we can decide about and often for these services i might actually be have to have kind of a read access on this side so this was a database hey i can read right on this side but maybe i can get read access on the other side i can't write to it because most databases are kind of a single right scenario but maybe i can read this and that's going to come into play a little bit later on maybe i replicate at the operating system level so there's solutions like azure site recovery for example this runs a kind of mobility service inside the operating system it captures things as it flows through the file system to the volume driver this mobility service grabs that and sends it to a service on the other side where it writes it to disk and then it can create a vm and attach to that disk if i actually perform a failover maybe i replicate the resource level so maybe actually here i have some native replication that depends on the service maybe for example it's a storage account and i can turn on something like grs so it's doing that geo redundancy maybe i just have a backup at some level and what i can actually do is i can actually take the backup data and replicate it over and then i could obviously restore over onto the other side so over here i could do a restore of that data for example azure backup i can turn on that geo-redundant option and then there's another option where i just recreate it now obviously that's only going to work if there's no state that i care about ie data but if it was some kind of dumb front end that didn't have any state hey that's that's actually probably a really good option because realize there's all these different layers there's ways i can do this but what happens is as i kind of go up from basic levels to app level replication my cost goes up as i move up the layers like an app level database replication typically will cost more because i have multiple instances of the database actually running says more resource running cost me more money compared to maybe saying where oh it's just replicating a disk level there's no compute resource running on the other end but the downside of that is my failover speed say speed there's more to it than that will actually go down because again if it's at the service level hey there's a service running it's receiving the transactions it's checking them it's ready to fail over really really quick whereas if it's like an os level well the database could be dirty don't really know exact statements do some cleanup if it's a disk level well then i have to like recreate the vm and attach it so it's going to take longer so these pros and cons depending on exactly what model i'm actually using and how important that service is and what is that recovery point and recovery time objective realize it's not one solution meets all and we'll talk about this i don't have to pick one for my entire application you may also wonder about this recreation how do i recreate well that's why we were stressing that infrastructure as code devops if all of my resources are defined by templates if they're pushed by a pipeline it's actually easy for me to recreate things i can just replay it i can redeploy that template so that's again why we don't like the idea of creating things in the portal i can't recreate from that so we have all these things so now i want to think about okay what else do we need to know well our application has a certain architecture now i might start off saying okay my application i have a web front end now maybe that web front end are a group of virtual machines maybe my web front end is their containers on an aks cluster whatever it is there's some multiple instances of my service now to actually get to the service remember these are all within one region i have a load balancer in this case it's offering a web service so i use kind of azure app gateway we have web application firewall to distribute another option maybe it's a standard load balancer and this actually uses a public ip called white today so this is a public ip it's accessible via the internet so that's actually how that is connected to okay so then so this is one kind of layer of my service it then goes and talks to a serverless layer for example maybe this is kind of azure functions remember this just spins up kind of little jobs as the work comes in so these then communicate and this is another layer and actually now i think about it that web front end actually pulls some resources from a storage account so i also have kind of a storage account that is kind of red from i've got some artifacts in there maybe it's images probably should use a cdn but i don't i use a storage account which is in a region and that middle tier or that middle tier it goes and talks to a database so i have some database in here which it does kind of read and write actions against so i have all these different elements okay that's fantastic i understand my application what else does it depend upon i need to understand every element okay so people get to it from the public i p these are the layers of my application what else does it use um well actually people authenticate and i'm using azure ad for that so okay i have reliance on azure ad as well i'm also using um managed identity or that ties into azure id just understand all of the different requirements you have for example if these were maybe virtual machines that were domain joined then i have a reliance on active directory domain controllers which maybe has reliance then on dns which maybe relies on some network communication so you need to think about all of the different levels that may actually come into play and this database maybe is postgres it could be pas it could be irs could be running on a virtual machine again a virtual machine maybe it's the main joint i'm going to say i'm using pas i'm using the azure managed postgres sql database on this one but you need to understand all the dependencies now once i understand that there's one key question i want to answer and my question is where is the state that's what i care about where's the state well there's no state in the app gateway there's no state in the web front end there's not really any state in the azure function my state is in the database and i could also potentially argue well actually these artifacts change maybe there's some kind of state in there as well so i have to think about that's the stuff i need to protect the most because remember our options our options are hey i can replicate stuff or i can recreate stuff or maybe i can just do backup restore if there's no state in a layer i probably don't bother with any of those things i would just recreate it now potentially there might be configuration of a service that's maybe really complex in which case some services let you back up and restore the config but in terms of the resources themselves i would not bother backing them up if i can just recreate them and we'll come back to kind of what that means in a second but if it's got state well then i have to do something i can't just recreate it i can't recreate data that's in another region so hey where's the state okay well there's state here in this storage account and there's state here in this database so i care about that now for the database remember we have different options and the one we pick is going to depend on that recovery point objective to recovery time objective if it was a very small recovery point objective i can't lose much data almost certainly i'm going to use some database native replication potentially if it was a maybe not available for some reason it's running an isvm i could maybe do an os replication again if it was an is virtual machine and that would only work again if it was is if it's pass os replication i don't have any access to that or maybe if it's a really long recovery point objective again maybe it's that time logging system um i can look people can resubmit their time it's not the end of the world maybe have a backup we back up every 12 hours worst case people re-enter their time for the last 12 hours because it comes down to cost remember this is going to cost me a lot more than just having a backup sent somewhere so i don't just say hey for all my databases i'm going to use database replication what's the right one for a particular service again that shopping cart transactions i maybe really cannot lose anything i i cannot afford to lose transactions now maybe there's a way i can asynchronously replicate and maybe i can replay transactions in the event of some kind of disaster but i have to think about that and pretty much all of the database solutions in azure have some kind of database replication for example if i was to look at like the postgres documentation it has the ability to have a cross-region replica so again and it talks about hey this is really useful for disaster recovery planning things like azure sql database i can have geo replication i can have that asynchronously stressing the asynchronous level i can absolutely do that now if i was just using a regular backup if i jump over here for a second look at my recovery services vaults many solutions today will let you actually do a cross region restore i.e my backup volts if i look at my backup volt we set up the backup volt in its configuration to actually be when this is going to decide to load we're going to have that as geo-redundant so we can see here it's geo-redundant and notice here i could set the option to enable cross-region restore so not only would the data be replicated across regions so hey i'm replicating to the other region let's say i'm doing a backup of a database or a vm i could actually restore it to the other region so that's a useful capability so when i think about their state i need something to be replicating that state that that's the key point to make sure i'm resilient okay what about recreation then so i talked about hey anything without state i'll just recreate i made that nice and easy simple well how do you recreate things how do i create things in the first place so the point here is ideally i'm using infrastructure as code now if we use infrastructure as code most likely what we actually have is we have some kind of repository so i have a repo and the nice thing is most kind of repositories like github or azure devops they are already geo-distributed it's just native they're not deployed into a region my repo is not tied to south central u.s it is just geographically distributed i don't have to worry about a particular region having my repo in it that's not my problem so then and also azure id is also geo-distributed not saying i have to worry about that kind of applies across there now what i probably do is i have a pipeline so when i have some kind of pipeline my ci cd my devops my continuous integration continuous delivery continuous deployment i have my pipeline my pipeline gets things from the repo and then it can do that recreation i can re-run it to recreate my resources now i have to think about what is all the different parts my pipeline may use yes it may pull things from a repo my code certain releases of my code but also imagine i'm deploying containers well if i'm deploying containers there's probably a container registry which has my images in it if it's a virtual machine i might have a shared image gallery with my vm images on it so i need to make sure they are resilient so that's just automatically resilient for me i need to make sure my images i don't have to recreate them yes as long as i've got the repo i could rebuild the container image i could do a docker build yes maybe i'm using packer or azure vm image builder i could rebuild the vm images but i don't want to mess with that so i want to make sure i add replicas to these things which essentially now makes this geo-distributed there's going to be a copy of that in the other region because these get pulled in to the pipeline to do that recreation and remember these kind of going on down these are regional by default so i need to add those replicas to make that resilient so this is kind of a key point this is how we recreate this is why we stress infrastructure as code and devops and all those things it's not just hey it's needed to do this stuff it's it's really going to light up a lot of great scenarios in those disaster type scenarios now realize this recreation thing it does actually come back to the recovery time objective a little bit because remember creating things takes time so this decision to recreate well it does actually depend on that recovery time objective if it's fairly long i probably have the time to recreate them if it's short and this really depends on kind of your definitions and how many resources you're creating well then maybe i need kind of that warm standby either up and running and i can just get them but again the cost goes up we always try and especially if it's dr it's something we hope we never use so we really want it to be as cheap as we possibly can ideally in the event of a disaster hey then we spring into life and we trigger something that actually goes and creates stuff but ideally it's not costing us much money it's that seat belt we hope we never use so that's thinking about adding in that rto recreation remember pass services they're pretty much always deployed to a region and so if i want a pas service on kind of warm standby i'm just going to deploy instances to multiple regions my aks clusters my app service plans whatever that might be they're just going to have instances stood up in multiple regions there there really is no global version of any of those if i want a warm standby well that thing is going to be up and running now it might be i can pre-create it and stop it like an aks cluster now you can actually stop it to stop doing most of that payment virtual machines i could stop and i'm then just paying for the disk so there are ways maybe i can optimize the costs but realize there's still going to be some element of the cost there now how is this service used because great i can make sure my state is protected i can make sure i can recreate the resources but how does someone actually get to it so great i've got this in this case it's a public ip so when i'm going via a public endpoint well to be any use i've got some user that goes to the service now how do they go to that service have they been given an ip address uh that's a nightmare you do not want that you don't want things hard coded with ip addresses maybe they have a dns name now there's different approaches we could take with a dns name now one of the issues with dns is a dns name resolves to an ip address but i could update the record so if it failed over to another region it would have a new public ip you cannot move public ips between azure regions so we get a new public ip but hey i could update my dns record to point to the new public ip about this angle of time to live to avoid clients constantly remember it might be constantly talking to this service i don't want to have to do a dns lookup that takes some time and resources every transaction maybe it's doing 100 transactions a second that'll be a lot of lookups so there's a time to live ie i'm going to cash that record for this time to live if it was one hour i only look this up every hour if it was one minute i'm gonna re-look it up every minute so the longer this value is the less work on my dns server but if i change it the longer it will take for the client to recognize so i might have to balance if i'm going to update the dns record what's the time to live or maybe my client i'll return multiple records i'll return the record in region one and region two and then the client's job is to try it and if it fails then try the next one active directory works this way when i do an ns lookup for a service it returns all the domain controllers and if one doesn't respond it will go and try the next one but these i don't really want to do that what i would rather do if i know i'm going to want to fail over i would like to put some kind of geo balancer in front of the service and have the users access that instead now if it was a very basic i maybe it's not http based i could use traffic manager and then traffic manager can point to the different names and instances of the service it does a health probe to see if they're healthy and return the one that's closest to the user or this there's other algorithms i can use if it was http htbs i could use azure front door and once again it can point to multiple different instances of the service based on where you are so now it's going to balance to which ones are healthy and additionally maybe which one is closest to the user as well so this is a key thing i think about in my architecture hey i probably want to use some geo balancing solutions so i can handle people actually connecting to my service now if it's internally facing that isn't a public ip it's a private ip these don't work traffic manager front door um the azure global load balancer which is in preview do not work against private ips so i'm probably at that point going to be using like a network virtual appliance that does handle global internal balancing that's kind of a shortfall today also realize when i think about a failover i've joined all these different options okay so there's hey there's different replicas different databases different services hey some i'm going to recreate some i'm sinking it's all these different moving parts and in the event of an actual disaster i don't want to be messing about manually doing all of this so what i really want is a recovery plan i want was it kind of those easy buttons that you can just kind of push the easy button and kind of says that was easy and it does to fail over for you so that easy button in azure is azure site recovery so azure site recovery lets me actually go and create a recovery plan which could be hey fell over this vm and then that vm and then run this script which could maybe start some other things do this sql failover whatever that might be this is kind of your orchestration so this is orchestration for your failover so think of a real disaster there was a real disaster do you think your team is going to have that kind of sound mind to go for a 20 page failover document now if it's a true disaster they may be thinking about their families and other things going on and so we really want to make it a seamless experience that hey we can document in advance that we can test so these sorts of services really help simplify that complete process now when i think about disaster recovery i've focused here on azure going to another region realize that may not be exactly what we're talking about we may absolutely have kind of an on premises we might want to use azure for our disaster recovery or maybe we're going to move things and hey we're setting azure up as disaster recovery initially and what we'll do is a failover to azure and then leave it there for the migration so once again on premises we have a bunch of resources that maybe have an operating system then we have kind of the application and maybe that thing has data we can think about exactly the same options that we can hey i can replicate and again a lot of the database solutions will let me replicate from on-prem to azure instances i can run things like azure site recovery from bare metal from vmware to hyper-v to replicate from within the os to azure i can run backup so my backup data could go and back up to recovery services vault in azure then i could restore into azure so there's all these different solutions available so realize it's not just azure azure it could be like on-premises to azure as well um on-premises is likely using active directory so i would also have to think about things like domain controllers domain controller isn't really not a good fit to try and restore i really would probably want domain controllers running up in azure i as virtual machines azure ad is not a replacement for ad donate the naming for you so i probably want some dc's running if it's a long-term outage i probably want azure ad connect to be up and running to start synchronizing back with azure id but it again it's just another service it's another application with dependencies that would be part of my planning as well it would kind of filter all the way down now my next option here i've been talking all about failover well why do we really think about failover primarily it's been for most companies historically they have a data center and that data center runs the workload and then the dr location is really not very good we hope we never have to use it may or may not work well in azure that's not the case in azure we have all these regions with all these different capabilities and we can have multiple regions not just for resiliency i might want my workload in multiple regions so it's close to my audience that's using it so they get a better performance because there's lower latency so i can think about having multiple instances running in multiple regions so if we come back over to this kind of picture i should redraw we'll redraw that little architecture so once again i can think about hey if we have that region one i'm gonna do a much quicker version of the picture but hey we have that app gateway we have those instances of the service behind the app gateway remember we have kind of the azure functions with the little instances as they're called we have our database whatever that might be and this is kind of both writing and reading to that database and remember also these front ends well they had an azure storage account that they were kind of reading from so we had that whole set of capabilities right there okay fantastic well i can absolutely also have in region two and region three and region four i might have lots of these running the same thing hey i've got my app gateway got my instance of service got my kind of service list layer there and then i've got my database copy where i'm doing reads from all that's good fantastic um i can balance between them so remember my client coming in or my client coming in would talk to something like traffic manager or azure front door which would be able to resolve to one of those to distribute the traffic so far so good what's going to drive my ability to do this depends on my app architecture realize this is not costing me really any more money if i have 12 instances running in one region or three instances running in four regions it's still 12 instances it's not costing me much more now yes they're the app gateways that the breed replicas yeses might be a slight cost but it's not a huge incremental cost to actually have kind of an active active scenario up in azure so why wouldn't everyone do this because remember this is giving us high availability now as well in a way because they're actually running in multiple regions and it's given us disaster recovery and it's improving the end user's performance because it will point them to one closest to them so this this all seems really a fantastic with no downside i just didn't see from a single region failure better performance now obviously if a region failed i then want to take advantage of things like auto scale because obviously at that point if i don't draw the arrow that way i don't like that we scale in and out i would add instances because if this region went down well if it's just two regions i've lost half my capacity now these are hopefully all auto scale anyway because i get peaks and troughs of load so it will just auto scale out as more work comes in so i want to make sure i consider hey what if a region went down in my maximum scale settings if i'm deployed to four regions and a region goes down now i've only lost 25 percent so the overall hit on any one region would actually be a lot less but definitely i want to consider auto scale for that but there's a there's a problem why i might not be able to do this and it really comes down to the data tier because remember this is where the state is so if we think back again remember this is this is the state and we talked about most databases are kind of a single right scenario a single main well there's hundreds of miles between these this isn't it's replicating so we're using an asynchronous replication technology to have a copy in the other region if this actually went down this could fall over and become the primary database because of that this is a read replica so the local copy of the app can read from it but it can't write to it when it wants to write it has to go across that hundreds of miles maybe it's at 40 50 milliseconds of latency so this will determine a lot of times if i can support an active active architecture now there are things like redis cache there's there's other options i can maybe do cosmos db has different consistency models i can do a session consistency so hey i get consistent read writes within the region then it will catch up the others on best it can but i have to consider this because if i'm using a regular relational database i'm going to have to support this model that hey yeah i can read from my local one but when i do a right activity it's going to have to go across to the main primary and then make sure i've got code to handle it this goes down and this becomes read right how do i now redirect people to this as the read write instance most workloads are very read heavy if you think about most interactions with a service hey i'm looking at what i've bought i'm looking at my profile i'm looking at blogs i'm not creating updates to my profile i'm not creating new transactions so this may be okay it really does depend on the application if this is saying suitable and again if this is a direction i want to go i might re-evaluate my data tier remember that storage account remember we had one over here as well that it kind of read from well i could do [Music] grs i'd do ra read access grs so it's replicating that way again asynchronously to keep that up to date if there were changes so there are things i can do to handle that but i have to think about my complete model but this is nice because if we think about from a disaster recovery i need to plan it i have to plan my disaster recovery i need to test it i need to keep it current environments are not static i'm probably constantly changing this it's no good creating a dr environment and a dr plan and then leave it as soon as this changes this becomes invalid so as part of my change control process i need a step to say okay how does it impact my disaster recovery plan update it and test it i want to be doing these failover tests and so what the reason activeactive is so nice it's always running i'm constantly testing it so i can actually get a really good degree of confidence that hey yeah i know i i have this other option now you still need to test because hey what happens when i fail over this database there's still testing you need to do there's still planning you need to do but if you're active active or that replica regional regions is far more ingrained in how i think about things because it's always there so the key point here is really make sure dr is a core part of my thinking ideally if you can run this active active now it might cost a little bit more but it's not a huge amount more because again instead of six instances in one region i've maybe got three and three there's just a few extra little bits that i'm kind of tagging on i have to think about where is the state i have to know what is my recovery point objective my recovery time objective so i make the right decisions on how i do the replication to meet the need but not waste money hey if there's no estate just recreate it infrastructures code devops pipelines can do that for me make sure you plan make sure you test make sure you keep it updated so it's ready for when you actually need it hopefully you never will but it's there if you did so that was it i tried to kind of go over kind of those key points when you do think about disaster recovery you have options yes i can replicate yes i can back up yes i can recreate the right one depends on where's the state depends on my rpo my rto so until next time good luck and take care you
Info
Channel: John Savill's Technical Training
Views: 8,768
Rating: undefined out of 5
Keywords: azure, azure cloud, microsoft azure, microsoft, cloud, disaster recovery, dr, high availability, ha, replication, asr
Id: 8fvO3WArG-Y
Channel Id: undefined
Length: 55min 48sec (3348 seconds)
Published: Tue Oct 19 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.