Microsoft Azure Master Class Part 4 - Resiliency

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Thank you for posting these videos! You’re an incredible trainer and we’re all grateful for your hard work 💪🏽

👍︎︎ 3 👤︎︎ u/Eddie_Arcadian 📅︎︎ Sep 29 2020 🗫︎ replies

Curious if any of the topics described here would have protected from the outage yesterday. I would think not since the problem yesterday appears to have been an issue with the core functionality of Azure AD. Still interesting food for thought. Thanks for all the great videos!

👍︎︎ 2 👤︎︎ u/jablome92 📅︎︎ Sep 29 2020 🗫︎ replies

Hi John! Big fan of yours, watched your Az-900 summary vid with tips & tricks and I can genuinely it was of great help, really helped me get my certification yesterday! I was wondering if you had any tips or anything that I could use for my upcoming Az-104 at the end of October.

👍︎︎ 1 👤︎︎ u/Cronos310 📅︎︎ Sep 29 2020 🗫︎ replies

Captions

hey everyone welcome to part four of the azure master class and this one is all about resiliency for our services in azure so i want to think about well what are the types of resiliency what are our options now what are the constructs we have in azure around resiliency what do we have in terms of backups and then just the various replication capabilities we have both from on-premises to azure and kind of azure to azure and then thinking about global balancing using azure really can change the way we think about resiliency against any kind of regional failure instead of always being active passive saying waiting to take over if there's a disaster actually just running it active in multiple locations so what are options around that as always if this is useful please like subscribe comment and share then the description below uh you can see the github repository where i have kind of the whiteboard for this session uh i have the slide handouts and any other materials i might have used along with the overall playlist for this masterclass so what are we protecting against if i think about resiliency we want resiliency for various reasons i can think about resiliency against some kind of hardware failure this could be a disk in the node that i'm running on it could be on the node it could be on the rack it could be some big router in the data center it could be the power supply the communications so it's going to take out the hardware that's hosting my service could be a software failure inside my virtual machine something crashes it could be my app it could be the operating system there's some kind of software failure maybe it's even a software failure at the fabric level azure rolls out some update that breaks things for me so how do i have resiliency against that corruption now this corruption could be i write something and in my code corruption could be i delete a bunch of things it could be i overwrite things with garbled data it could be an attack that goes and encrypts my data and now i can't use it that really leads into the whole attack denial of service type threats we have to worry about and i also have to think about maybe not protection against but maybe there's regulatory requirements hey you can't host your services on a box with other people you have to keep data for seven years so i have to think about all those things so what can we do it really breaks down into two categories i can think about copying it somewhere else so what this really means i can think about well if i have my node kind of running my workload that workload could be a virtual machine and there's some concept of data and i have kind of a block of data there so one option is hey i'll copy it to another location somewhere else there's some storage and i can replicate that data to the other place that data might be part of a virtual machine disk it might be part of a storage account my public database and then if need be on that other side or i can spin something up to then connect to that and offer it as a service so i think about copying things somewhere else and i'm doing it kind of between regions but if i think about hey copy it somewhere else just from a local disk failure so hey instead of having one copy of my data in kind of this storage i can really think instead well let's think about this as a bigger construct in my storage hey i'll make this kind of a bigger picture but i'll actually store multiple copies of my data over different physical disks different back-end storage nodes so it's resilient against some failure even just in that data center so copies of data copying it to other locations that's something i can do there i can have previous point in time copies of my data so here i could think about hey i'm resilient i've got my data there but i might also think about well if i change that data in some way well i want copies of how my data used to look if you think about a corruption if i have some logical corruption or an attack that changed my data replication doesn't protect me from that hey there's some mistake in my logic and i break the data or there's an attack it will replicate to all of the copies of the data they're all corrupt so i think about point-in-time copies or backups backup could actually be to a different location entirely i might have kind of a recovery vault where i'll actually take the data and i can take the deltas so just the changes and now i can have a point of time a day ago two days ago so i have these delta based copies the changes to the data but i can put those deltas together and really come to any point in time so now if there was a corruption of some sort gets encrypted or i mangle it i can go back to two days ago and bring the data back so they're not interchangeable replication is great if there's a failure a disk fails a node fails hey i've got other copies close by something fails at a regional level well i've got a copy somewhere else and then if there's some kind of corruption or regulatory i need to go back and show the data from two years ago that's what israelis point in times are really powerful and useful to me so i can really think about those two options and so i can mitigate against those various types of problem depending on what i'm trying to protect against hardware failure software failure probably a replication of some kind a corruption at a logical level i'm thinking point in time backups regulatory requirements um probably backups regulatory requirements around hey can't be on the same box as someone else that's when we use things like azure dedicated host where i get the entire box someone else can't be on that often you're going to have both very rarely is it should i do replication or backup it's probably going to be hey for resiliency purposes i need replication maybe within my region between regions i want backups as well so i can go back to previous point in times i'm going to talk more about these so what can we do really this just breaks down to yeah we copy it somewhere else so that's replication have previous point in time copies well that's backing the thing up now i say backup can also be snapshots some services have the ability to snapshot like blob for example azure files they can snapshot at the same location as the source material and store point in time views blog for example they have things like a change feed it has soft delete it has snapshots so i can actually roll back to previous pointing times so the big difference typically between these types of things is if i think about a backup normally a backup is to some other location than the primary source of the data a snapshot typically lives on the same media the same medium as the primary because the snapshot's normally going to just store the delta the changes it's going to change track the changes to the data so i can roll back so if i have a snapshot it wouldn't protect me necessarily if the entire storage medium i'm using failed whereas a backup volt that's on a separate set of storage hey i'd be able to go and roll that back and very often when i think about backups i probably want this backup at a different physical location than the source so very very common we would kind of geo-replicate i'm gonna take my storage account that's hosting my backup and replicate it to a paired region hundreds of miles away so if the worst case something happened i still have protection from that so as i mentioned that they're not really interchangeable i'm not saying well should i use backup or replication they were attacking different types of problems so if i think about hardware failure probably going to be replication software failure probably going to be replication additionally software failure at a fabric level would yes mean replication but typically it's going to mean replicating between regions as well um microsoft has these pairings of regions we're going to talk about but when i have this the azure pairs they will never roll out the same update to paired regions so if something went wrong when it rolled it out to one of the regions i would fell over to the other region it would not have got that fabric level update maybe it was making some core change to some api it wouldn't break the other one corruption probably backups snapshots the ability to roll back in time attacks there's such a wide range of types of attacks we can see this could be some kind of isolated export again if someone's attacking me and encrypting like wannacry they're pretty sophisticated they might go and look up backups and try and delete the backup so maybe some kind of isolated export one of these kind of immutables and a blob can do things like this i can have it mutable where i can't change it i might have those immutable kind of copies of my data also things like distributed denial of service well azure has protections there's a free basic one for any public ip and there's a standard sku that gives me more control over the policies of when that protection really kicks in so there's yes there's replication yes there's backup but for those types of threats there's other things i want to do to mitigate against kind of a denial of service type attack but if someone's just hammering my site for denial of service or maybe there's scale options there but we don't want to try and attack it at the source regulatory requirements is probably going to be back up in terms of keeping data for x amount of time as it was if it was things about how you you cannot be on the same box as another person that's things like azure dedicated host where i get the entire box to myself i can fill it up with virtual machines that i am using now before i can think about protecting i have to know what i have and what they depend on so understanding the systems that are key to my business is critical as i have to protect them i might have tons of systems in my company i don't need to protect them all i need to protect the ones that are part of my line of business that enable me to do business and earn money and not go out of business many are made up of tears and i really have to understand when i have these tiered applications where is the state and what i mean by that is sure i might have lots of compute i might think about hey i'm running some service and i could think about a very simple level let's just say for example i have four web servers they're all running maybe iis or it could be tomcat it doesn't matter so they are what taking the initial entry point into my company there's going to be some kind of load balancer at the front that has maybe that public ip address and then it distributes the various requests that's fine that's kind of a front end tier you'll often hear that kind of referred to they might then talk to some middle tier it could be a middle tier set of processes maybe containers something else and then the back end there's probably some kind of database this may be where the state is and state is the data things that change the things i actually care about so if this is kind of that back end the database tier in terms of what i have to protect the most it's this this is the thing i need to be replicating and backing up and putting somewhere else because if these all failed if there's no state if there's nothing i really care about in them i can just regenerate them with like infrastructure as code i can redeploy that stuff what i have to care about is where is my data so this right here this is what i would really focus on replicating that thing having backups of that because that's something i can't recreate if everything got lost i can recreate things that don't have any state the web servers just receive things uh pass some kind of request to the middle tier the middle tier does some logic that really stores and restricts from the database that's the important part this i can recreate these are maybe virtual machines containers there's some shed image gallery which is the image i use for the front end as a azure container registry or something else that generates the containers that's all fine as long as those images are replicated somewhere else which is super easy i can regenerate all of this stuff i can't regenerate data so i have to understand where is the state in my application what do i really care about i have to understand the services that it relies on because they have to be protected as well if you went back to that original picture you can kind of imagine well yes i'm drawing my system i'm going backwards i've got this but maybe there's some i don't know maybe active directory that it uses for apples of various things so if ad wasn't available then none of this would function so i have to understand what are the various dependencies so when i'm working out what i want to protect i'm protecting all of the right things i understand the nice to have sure there might be a time tracking app i use for my company but if there was a disaster and i wasn't tracking what i spent eight hours on today is that a problem probably not how do i access these things though do i need to think about in a disaster vpn capabilities uh maybe it's remote desktops covid has been super interesting that people are working from home companies very very rapidly had to work out how can i let people remotely work so how would i support that remote desktop farms windows virtual desktop and an azure scenario and point-to-site vpn solutions i have to think about how i'm going to get to my services so this all comes down to understand the access required how would i enable this now you may look at these things and say how do i work out dependencies that's like super super hard several things to help you um microsoft has something called azure migrate now say the word migrate means hey i'm going to move it to azure but azure migrate is free and i can actually as part of that discover information about my systems and it has something called service map service mapped through the various ports and protocols used will work out dependencies between different systems so i could run azure migrate and actually work out well this depends on this and talks to this and it will help me azure monitor is actually using that as well azure monitor has service map if i was using application insights it has application map application map looks at kind of the dlls the app is using and the calls it's making to work out dependencies there are many things that can help me work out well what's talking to what so i need to work out what i care about where my state is what they're dependent upon now before i go any further i want to quickly cover asynchronous versus synchronous replication because this is super important to understand asynchronous is essentially i have my primary copy as soon as i write to that primary copy i'm going to essentially acknowledge the transaction so in data systems when there's some right performed it gets acknowledged so i can think about okay i have my primary copy this is writable so i have my primary copy okay so there's some client app that does some transaction um like a write operation so that transaction writes some piece of data now there's also a secondary might be multiple secondaries so i'll say there's a secondary in an asynchronous scenario as soon as this is written here so step one is hey i want to do this transaction it does the right and then straight away as it does the right step two is it will acknowledge it so the app can carry on working in the background it will now copy it over to here so what this means is hey it's very minimal impact zero impact on the original app is getting that acknowledgement straight away but this doesn't happen straight away the copy to the other location so there's a potential if there was a problem at this location i might lose that piece of data but why it's really useful is imagine i had a large distance between these copies this is a big distance speed of light like not even speed light over fiber routing you get something called a latency so these are hundreds of miles away this distance might take me i don't know let's just say 30 milliseconds i i can't wait that long to acknowledge the transaction so in this mode if i'm trying to replicate between distances asynchronous is the good option because it's not going to impact the performance of my app and hopefully it will get there pretty quickly so that's asynchronous so there's no impact on the primary performance but there is risk of data loss synchronous is i don't commit acknowledge the transaction until it's written to the secondary so this can impact the primary performance because now what we're going to do is if we think again let me try again oh today so here we might say well this time yeah there there's another secondary i'll do it down here so there's a secondary here but now what will happen is before this acknowledgement goes back it will copy the data to the secondary and then only after it's been written here and kind of that act yes it's done would it now acknowledge back to the application so this would be synchronous replication so i'm going to write it to the copy and only once it's confirmed to the copy do i tell the app it's actually been committed so this is good if the distance is tiny so i don't have that latency to impact performance and there's no risk of losing data because i don't acknowledge the right until i've written it to the copy as well but it really only works if the distance is small i it's the same location so what you'll see is we often use synchronous replication within a location because the latency is sub millisecond and we'll use asynchronous between locations because we can't wait 30 milliseconds every single transaction we perform so we have those two different types and so async between sites between locations synchronous within a location so just important to understand those concepts so now let's talk about the constructs in azure that i'm really going to focus on when i think about resiliency so the first one is a fault domain so if i think about azure we talk about this cloud service but there really is no magic fundamentally azure is made up of racks of servers so i have these individual nodes in a particular rack and i can think about a certain rack as a fork domain let's say this is full domain zero and then i have another rack again filled up with servers say for domain one and then another rack that is obviously more than three but i really can't be bothered to draw more than that let's say ford domain two now these racks have things like a top rack switch their power supply units so within that fault domain individual nodes could fail uh a whole rack could fail because of some of those shared components like the power supply unit um like the token rack switch and i think about when i deploy fundamentally my resource gets deployed to a certain fault domain one of these things to give me those levels of protection so that's kind of a core building block so i can think about as if i think about blast radius things that can go wrong if my fault domain had some failure it would not impact things in different form domains so i want to try and a minimum distribute my workloads over multiple fault domains and this is where availability sets come in now this gives me an sla of 99.95 and what availability sets do is i create an availability set within a certain region so i'm going to go ahead and i'll say hey i create availability set um iis1 for example and i then just add resources into that availability set i'm going to say virtual machines and what it will do is it will install them in a round robin manner over typically three full domains sometimes there's two but normally you're gonna have three full domains if i keep installing it's gonna kind of round robin distributing them over three different fault domains so i don't tell it which full domain i just deploy into the availability set and then it will automatically distribute them evenly between the three racks that it's allocated so these three racks are in the same region generally in the same kind of facility but it's not guaranteed because there's some stuff that goes on so now i've got protection if iraq failed two-thirds of my workload would not be impacted by that because they are in different fault domains now they're in the same facility since the entire facility as a blast radius went down they're all out but at a node or rack level uh unprotected from that now there's also something within this availability set that applies to other things called update domains and typically we can have from kind of five to twenty so i might be in three four domains but this might be update domain one two three four maybe i've got five different update domains update domains are used when the fabric like azure has to update that host with a new version of the os it doesn't patch it just reboots to a new version that update the main would be paused for seconds it's super fast um it uses something called vm foo preserving host update so it pauses it reboots to a new version of the os this is the fabric not your vm azure's not patching your vm this is the host that's running your vm if we boot it to a new version of the image and then it unfreezes it so for that period of time your vm is essentially paused it doesn't move it between nodes it just pauses it this is why we need multiple instances of workloads so during maintenance my service is still available on the other four-fifths or if i had more update domains whatever percentage that would be those update domains are also used if i have things like applications i'm rolling out new version of my app but they live within there but from a resiliency perspective an availability set gives me 99.95 because it's distributing my workloads over three racks within that facility okay the next construct is an availability zone and that gives us 99.99 sla so what availability zones are doing if i i drew those racks well the reality is those racks have to kind of exist somewhere and fundamentally they're living within a certain building so there's a facility there there's another facility with another bunch of racks and then there's another facility with its bunch of racks these have exposed as kind of az1 az2 and az3 now all of these are within the same region so if i think about azure all of these three azs are in the same region region we just call it for example east u.s a region is kind of a two millisecond latency envelope so all those various azs are within a two millisecond latency envelope now an az is logical what i mean by that is there isn't a building called az1 or az2 or az3 an availability zone i think that what it's giving me it's independent cooling power and communications from any other azs so if there was some failure of power or communications of this physical facility that powers that a z it should not impact az2 or az3 so my blast radius now so i think about availability sets was fault domains or rack level azs my blast radius is now a physical facility now an az may be made up of two data centers one data center it varies but a different a z would be a separate facility with those independent calling power and cons so i'm now protected if i deploy my services to three different azs from any particular physical failure at a location basis that particular facility now i said they were logical and the reason is those az one two and three are per subscription so if i had two subscriptions az1 in my first subscription maybe az3 in my second subscription or maybe none of them there might be 16 different facilities in a particular region my subscription just gets mapped to three of them a different subscription may be mapped to an entirely different three of them so there's no correlation between azs between subscriptions they're just a construct to guarantee hey i have resiliency from any particular facility power communications calling failure so now i would think about hey i'm now going to deploy my resources again they're still racks and nodes i'm just not drawing them hey i'll deploy a vm here i'll deploy a vm here and i had one up here as well so now i've distributed over three different availability zones those update things still apply they're still rolling out changes because those separations but now if this facility went kablooey power failure whatever two-thirds of my stuff is still running now they behave slightly differently with an availability set all i do is i deploy to the set azure automatically distributes them between the fault domains availability zones i say deploy this to az1 deploy this to az2 deploy this to az3 i'm picking them and that's for virtual machines say like virtual machine scale sets or something that has zone redundancy support it will distribute them for me so that's the big difference but if it was just like a virtual machine i would actually have to pick which a z so i can show that if we jump over to the portal super quickly and here if i look at virtual machines and we'll add a virtual machine what we can actually see in here is if i look at the availability options so i can see here availability options so i've got that right here currently it's no infrastructure redundancy required now i can select availability sets or if the region supports it availability zone so let's change this to east us too and now i can pick availability zone notice availability zone i pick one two or three i deploy which one i'm going to deploy to and note here not all region support availability zones today they're rolling it out to more and more but not every region so you'd want to go and check that if i look at the list of azure regions it shows me which ones support availability zones but if i picked availability set i just pick the set i don't pick which fork domain i will see if i go and look at an availability set it would show me hey this vm is full domain 0 and update domain 2 etc etc but i don't pick i just say hey deploy it to the set notice i can't pick both i can't pick a set and a zone it's one or the other is what i can actually utilize as part of my deployment so that's where i have those various constructs so zone and sets obviously zone gives me a better sla see if i can i want to use availability zones then we have regions remember i can use multiple regions in my subscription in the commercial cloud there's loads of regions and i can use those so when i think about resiliency sure me here availability zones gives me this great 99.99 sla but if there's some big natural disaster or something and the region goes out i want to think about well there's lots of other regions so maybe this is west u.s and i would use the same constructs over here so maybe it's a z's availability sets for that region and i would have my service running here as well maybe in a dr capacity it would fail over i'd have to make sure my data was being replicated my state remember i care about those things there's other challenges to geographically distribute which we're going to talk about but now i deployed to another region and that protected against a regional level failure and again the point of these is these are kind of hundreds of moles apart now if both of them are down because of some natural disaster and the chances are we just don't care anymore anyway grim but there's the reality of it um but there's all these pairings that are built into azure and again i'm going to talk more about them but these are the key constructs within a region make sure i'm using availability sets or preferably zones make sure i'm also deploying to at least two regions so i can survive any kind of regional level problem so i want to use those things now um let's look at this for a second you may have workloads where i can't have two instances all of my pretty pictures there if you think about it depend on being able to have more than one instance more than one to spread over fault domains uh more than one to deploy over availability zones whatever i have some older workload that's single instance i can't run two if i can't run two i can't distribute over availability sets i can't distribute over availability zones and a really important point i meant to stress do not mix workloads in an availability set availability set is just blindly that that that round lobby so if i mix different workloads in the same availability set through sheer bad luck all of the iis may end up on that full domain all the domain controllers on this one all of the sql ones on that one i create an availability set for every individual workload iis app one gets its availability set iis app two of a different bunch of servers gets a different availability set sql servers get a different availability set domain controllers a different availability set never ever mix workloads i want that distribution for the instances of that particular service that's super important to think about but if i do have those single instance workloads go away there is slas so here i'm looking at the page and you can see look if it is a single instance workload as long as i'm using a premium ssd i get an sla of 99.9 if i use standard ssd well then i get 99.5 if i use standard hard disk drive i get a 95 slo so what this means is there's there's still slas even if i'm kind of a single instance realistically you'd use the premium or ultra that gets you the 99.9 percent at least it's three nines but ideally you want two there is also something called proximity placement groups this is not a resiliency construct it's kind of the opposite approximate proximity placement group is designed to keep things close together and i can use those with availability sets or availability zones so the point of a proximity placement group is hey i'm trying to keep things as close as together as possible for low latency so i can go ahead and the first thing i do is i create a certain proximity placement group i create this ppg-1 the first resource i put in it if i put something in it's in az2 it would then pin that ppg now exists here and it's going to try and keep it as close as it physically can so then everything else i put in it will be super close together in terms of latency these could be vms from different types so what i'm going to do here is different class to support different types of vms my most exotic biggest virtual machine i'm going to put in first to make sure my proximity placement group gets pinned to a zone and facility that supports that exotic type of vm if i'm using my g series or my nv i'm going to use that first and then put in my ds and my ease and my other stuff so this proximity placement group will pin essentially that proximity placement group to the az where the first resource i put in it so there is a cheat here if i create proximity placement group put a resource in in a particular az if i then create another resource in an availability set we're going to pin that availability set here as well which means i'm putting an availability set in an availability zone which is normally not possible so i can cheat using proximity placement groups i could now essentially create an availability set in a particular availability zone but ppg proximity placement group is not a resiliency construct i use it to put things really close together i can use it with virtual machines so here if i go back to my create a virtual machine and again i'll switch back to saying the supports availability zones for example i'll say az2 if i go to my advanced what we'll actually see is proximity placement groups i don't have any but i can actually go and create proximity placement group and put it in there so that would pin that proximity placement group to that particular az snare anything else i put in that proximity placement group including availability set well it would now be in that az as well so cool stuff those are the key constructs that's super important to understand they are the things we care about okay so what supports these things what supports availability sets and fault domains and availability zones and regions some azure services are global and they're resilient against any particular regional failure things like azure active directory azure traffic manager azure front door the azure global load balancer those things are global i'm not worried about protecting them from a regional level failure that's not my problem that's azure's problem most are regional most when i deploy a vm or virtual machine scale set or an app service plan or a serverless function or a standard load balance or really any storage account you name it normally i'm picking a region so i'm deploying it to a particular region now when i deploy it to that region different services will have different levels of support for those constructs like availability zones that i just mentioned some of them are regional only what that means is if i say hey i want to deploy it to this region i have zero knowledge of what facility it's putting in it's not letting me pick an availability zone it's just deploying it somewhere but you can be pretty safe betting it's in a particular facility so something happens that facility it will go down some of them will say zone redundant now in this model it's deploying multiple instances of itself across different availability zones so now if a particular facility failed my service would be resilient some are zonal zonal means it's only deploying it into one availability zone but it's going to let me pick the availability zone so i can then architect accordingly say okay well i'm going to deploy these services to az1 these to az2 these az3 like nat gateway vm scale sets i could deploy to a particular az or zone redundant depends on what model i want to use i can actually look at this and it will give me an idea based on the current services showing me what they support so here we can see for these key services it's saying z for zonal i i can pick the availability zone z off with its zone redundant i.e it's going to automatically distribute among the different availability zones now one thing i would say somebody's a little bit out of date like aks does have a zone redundant option so just go and check on a per service level but my point is different services have different options of what they support so it's important to understand those when we think about our architecture because what i don't want to do is mix them what i don't want to do is if i'm architecting something i don't want to think okay i have as building blocks availability zones so i've got these let's say three availability zones so i've got my az1 2 and 3. i don't want to do a zonal deployment i say okay i'm deploying this aks cluster 1 here aks cluster 2 here and aks cluster 3 here so that's zonal deployment of aks perfect and then gonna deploy an azure standard load balancer so that is zone redundant so i'm good there as well that's protected against any zone failure that's gonna balance between these different services all good then i do something silly and i make these all depend on maybe something that's just in one of the zones or it's something that's regional that i don't even know where it is but they all depend on it now if that goes down my brilliant highly available architecture will now fail i've put in some dependency on something that doesn't align with my zone redundancy so i was using azure storage or i would make sure when i create my storage account well my storage account there is a zrs option it's zone redundant storage is protected from any kind of zone failure you see how i'm mapping these things imagine they're using that gateway for external network access net gateway does not have a zone redundant option it's regional which i would not want to do here because i can't control it or zonal so what i would have to do here i would deploy um that gateway in each of them however that gateway works by assigning a subnet to use in that gateway so now in my architecture i'd need to make sure i'm using kind of different subnets this would be more important if i was kind of i couldn't do an aks that zone redundant so it's all deployed to the same subnet i would have to do three separate aks per zone because now this aks can go to subnet one this aks could go to subnet two this aks would go to subnet three so making sure the only things in subnet three are in a z3 the only thing's in subnet two in zone two and now i could map this to its corresponding zonal network gateway so do not mix and match it's okay for something that zonal to depend on saying the zone redundant that's resilient but don't make something in one zone depend on something in a different zone that's so normal like don't make all three of these use the same network gateway in zone two if zone two goes down these don't function anymore don't make them dependent on something that's just regional i have no clue where it's running so i have to understand what are the options of the components i'm using and architect accordingly to make sure that architecture is actually resilient so here i'm looking at standard load balancer hey i'll deploy in zone redundant mode hey the storage account is using our used zone redundant storage it's distributed over three zones as much as possible if i can use pad services use pass services there's just less for me to manage less for me to worry about if this was a database maybe it's azure sql database i would use the zone redundant um just announced that ignite now general purpose support zone redundant if it was maybe postgres i could use the flexible server and have a replica in another zone i can do those things but make sure i understand every component and don't put in some reliance on something that's in a different zone always not zonal at all i'm making myself weaker i want to keep those components together so make sure all the components match and make sure we do spread out different instances so i am resilient against any az failure for example what about multi-region okay so great i'm talking about within the region i'm doing these things but i said a natural disaster big tornado comes through and it wipes out all the data centers it it's a big electrical storm and it powers lost on all of those data centers what do i do then so for true resiliency i have to deploy to at least two regions there is no concept of hey i mean availability zones i'm protected there is no minimum distance guarantee for availability zones most likely they're kind of uh they can see each other they're super close together so there is no distance guarantee if you've heard there is there isn't azure regions are paired now this pairing is visible in some services azure storage if i pick the geo-redundant option it's using these pairings and i can't change them for how it replicates the data azure key vault uses these pairings to have a replica of my key volt content so if we look at these pairings these are super important for a number of regions so i can actually go and see which regions are paired never outside of the geopolitical boundary well that's a lie there is one exception it's brazil so if i go and look at this list if we look at brazil well brazil replicates the south central us it's because there's only one region in brazil so it has to replicate outside its geopolitical boundary south central us does not replicate to brazil if we go and look we'll see down the bottom here north central us replicates with south central us so things do not replicate outside of america to brazil it always tries to keep it in the geopolitical boundary so korea to korea japan japan india to india germany to germany france to france etc etc so they have these regional pairs for the built-in replication and it does more than just that so these pairs are designed to yes it's going to keep the services hundreds of miles away so we go back to this picture the microsoft pairing is guaranteed kind of maximum distance within the geopolitical boundary but what it also does is if you think about when azure rolls out updates it will never roll out the same kind of fabric update to the paired regions at the same time or do it in series simply do it to this example east u.s make sure it's healthy make sure it's not causing a problem and then roll it out to west us so by using the microsoft pairings it gives me protection from some update that azure makes from impacting my app additionally if there's a problem they're going to focus on getting that replica the paired region up as quick as possible so i guarantee to at least have some of my workloads then there is a a counter argument to using the pairings and it's if there was a real issue you just have to think of capacity everyone's failing over to the same region so in my eyes i definitely want to use this region but if i can i'd like to be having another region as well just so i'm kind of covering my bets and many services actually let me pick what regions i want to replicate to so yes some of them like azure storage it has paired regions key vault has paired regions but even azure storage now azure storage has blob level replica that i can configure i could actually say from blob storage hey um i've got my storage account and my storage account has container one and it has container two there's a bunch of blobs in there i can have a storage account over here in a different region and another storage account over here in a different region or that could replicate over to here this could replicate over here it's asynchronous it's different regions but i can do that now vlog has that capability as a sql database i can pick the regions to replicate to um postgres my pretty much all of those sort of database services let me pick the regions to replicate so when i think about the state i try and keep stay out of the compute layer i don't want it in my aks environment i don't want it in my app service i don't want it in my vms as much as possible get it into the database and then let the pas database services if i can handle that regional replication and use the azs and even if i can't if it's in vms i can do i have that control of how it's replicating but yes so those paired regions are useful in that they're going to do serial updates to the azure fabric now this could be active passive or it could be active active and this is an interesting one we think about multi-region if i think historically on-premises historically on premises if you thought about dr you had your primary facility where you were doing stuff and then you had kind of a dr facility where you hoped you never ever ever had to do stuff this is where we kind of ran all of the resources we had some kind of replication and we had very little confidence this thing would actually work but this was active all of the work was going on here let's zoom back in again so this was the active one replicated here there was some huge disaster then we would fell over we'd start things up a bunch of things wouldn't work very well but we did active passive because maybe this was like a a warm stand by a cold standpoint we weren't paying a lot of money for it so active active wasn't really even an option now in the cloud there's hundreds of regions so i've got a region here and there's another region over here i can just as easily run six instances of something here as i could run three there and three over here the cost would be the same paying for three vms or stream no worker nodes in my node port or whatever that might be it hasn't cost me any more money the benefit here is i've got the same scale i've still got six instances i've now got even better protection because instead of having six spread over availability zones or availability sets now i've got three on three availability zones here and three in another region also my users all my users might be here and here and here so then not only have i got better resiliency because i've got regions built in but now they're going to get lower latency they can go to the ones that are closest to them now the geo balancing solution i'm going to talk about that but this is a nice option i like this so why wouldn't i do this and it really comes down to that state the data if you think about this model for a second and let's assume that there is a database and this is where all of the data is and let's say i'm going to replicate to a secondary so i'm going to do some kind of replication but remember between regions it's generally asynchronous and in any kind of relational model the way it would work is this one would be writable this one would only be readable so if i had my compute running in both of them the way it would have to work is these here great they can read and write these ones here well they could read from this but they would have to go and do all their rights over to that one so how how many rights did the atoms do how latency sensitive could it support 30 50 milliseconds of latency every time you did it right if it's read heavy maybe this is fine also it has to guarantee it got the most recent right so if it wrote over here if you try to read it back straight away you'd have to make sure it read it from here as well because this this one's out of date doesn't have it yet that asynchronous replication hasn't put it there so i have to understand my app i have to understand its tolerance and i have to can i even support these type of architectures so the active active model is great if i can handle the data because typically only one of them is writable now there are features in my azure sql database another databases that give me listener pointers so i know which one is writable if this went down this would take over and become the primary and now it's writable so these would now point to this to go and do right but it's really my data that's going to drive can i run active active there are other options though these are kind of relational add a sql database post with it there's things there are solutions that enable multi-right in an azure world ideally we would get to cosmos db cosmos db is kind of a no sequel not only sql it has different consistency models and as long as it's not strong consistency i am guaranteed to have the same um read from any right i can write to any region i've set up a replica of my cosmos db database now the consistency is commonly session so that means is these would all share a session they would read and write from their copy and then asynchronously as as quick as it can would replicate it over to other regions that would work if the app only had to guarantee it was getting the latest right from things in its location but could eventually catch up to other locations so if i can switch i can get to a multi-master level so if i went to a cosmos db they'd be reading and writing to their local copy they'd be reading writing to their local copy that's generally going to be some app changes for most companies that they're not there so if they the decision to go active active depends on this how can my app behave and support only one writable copy in a different region but maybe just get my reads from here but it is now an option it's generally not going to cost me any more money it just depends on my app and my data model if i can support those kind of active active scenarios so cosmos db would give me multi-right capabilities but again it's not a relational database i'm going to change the models of how i'm doing things i need to make sure all of the core elements are available in all the regions again dependencies it's no good replicating my data my state and then i forget about some key dependency think about what i mentioned earlier think about what i said about state so in this model i said we draw this again let's imagine for a second my app is fairly simple it could be multi-tiers whatever let's imagine for example i'm running an active passive and i have a whole bunch of iis web front ends this could be like a vm scale set even and then i've got my database okay now in another region i have the copy of my database and it's asynchronously replicating to it but this is active this is passive nothing's going to this other location if these are stateless be it vms virtual machine scale sets aks app service plan why would i bother having them running over here on the passive one i'm just paying money for no good reason if these are any kind of stateless and i'm using like a declarative technology to create them infrastructure as code why bother replicating them over here at this point and having them running and paying remind paying for the compute charges of these things database is replicating these are built off of some image they're virtual machines then within this region i'm going to have kind of this shared image gallery so this shared image gallery so that's where it's my custom image if it's the microsoft provided images they're already replicated everywhere maybe it's just a repo with my powershell dsc or my puppet or my chef to configure them maybe it's my app code is in a a container or something i just have to make sure it's replicated there's a copy of my gallery or my repo in the other region things like github automatically they have spokes and it replicates my content if it was a shared image gallery i can set replicas i need a replica of my gallery over here as well there's an azure container registry if they're containers i can have a repo over here so now if there's a disaster it would go and create them at the time of the disaster and connect it to the data that's being replicated i don't need to pay to have the computer sitting there doing nothing as long as it's stateless and i have the right declarative configurations i can build them out super easily why bother replicating vms that have no stake why have them sitting there create them at the time of the disaster it may take a few minutes extra to spin those resources up chances are in a true disaster you have a few minutes you're going to have things like recovery point objectives and recovery time objectives so when you talk about those things you have recovery point objective recovery point is how much data can i lose because it's asynchronous let me lose stuff rto is how long to be running so i have an rto of two hours in a true disaster i have two hours to get this up and running recovery point objective i can lose five minutes so it's async it's a true disaster it was unplanned i didn't see it coming i can lose five minutes of data because it's async it just went bluey sure i may lose data it's unavoidable any kind of unplanned disaster fellow if saint just disappears i'm gonna lose data because it's asynchronously replicating if you have workloads that you cannot lose data it's that critical then there's some kind of synchronous has to be going on so you're going to pay a cost for the impact of the app performance because you cannot commit it until you know it's in the other location as well so you're willing to take the hit on the app performance to get synchronous replication between the regions but i understand these rpo rto constructs because that can actually impact how i'm going to do those things but again if there's no state create them at the time of failover just make sure you have the things required to create them i have the image that i create them off of i have the repository that has the app code that has a declarative configuration i have and the container registry that has the container image make sure everything is there that you depend on in the event of the disaster make sure there's the ability people to connect and use it make those vpn connections maybe those windows virtual desktop think of everything i need in the other region to actually be able to use it some resources cannot move between regions um eg public ips public ips are regional i cannot move a public ip from one region to another so if i failed over it's going to have a new public ip so we would have to hide that we would have to have something on top of the public ip so it's invisible maybe it's a dns name but there's a dns timeout dns records live for amount of time or is a different ip that's global that can point to different ips that are regional i have to be able to balance between the regions so this drives into this if i want active active or even active passive and the ips are going to change i have to have some way to balance between the regions and there are a whole bunch of different options azure traffic manager azure front door global load balancer and dns so let's think about this for a second i'm going to simplify this picture i have just two regions but it would scale that doesn't matter how many regions i have the point is in each of these regions i have a public ip public ip1 public ip2 now they might then point to them on standard load balancers that point to tons of different resources but that's the outward-facing service but they're different regions so this is kind of region one this is region two now there's there's me very happy got no hair used to it wear hat i want to use these i don't want well take a step back technically if the app was smart enough i'm running an app and i'm writing this as a client app this could be a server app maybe the app just knows ip1 and ip2 the app has the logic to go and check can i get to ip1 if i cannot use it if i can't i'll try and go to ip2 instead so i could build that into the app that it is just aware of all of the possible targets and it will try and use them so i could think about the way to do that is it could just be dns i might just have a whole list of records that say for this service hey there's ip1 and ip2 it gets returned and it will try them active directory works that way at the directory when i find a domain controller it looks for a service record for underscore ldap to underscore tcp dot domain name gives a whole list of them it tries them until it finds one that works so there is an option there dns and i could update this maybe it only points to ip1 if this goes down i could have an automation that changes it to point to ip2 there's anchored time to live machines cache records for an amount of time if i set it to a low time to live takes five minutes within five minutes it would see the new address and point to the new record dr terms five minutes is probably fine the smaller you make the time to live the quicker it will realize something's down but the more often it's pinging your dns server but let's say we don't want to take that approach as kind of our worst case yes we could do that so the next thing we could do is azure traffic manager and the way azure traffic manager works is it's a dns based solution and what happens is i'm going to have a dns name dot traffic manager.net that name will resolve to the dns names of these services now it can also point to things running on my premises but i'm focused between azure regions right now and the way this would work is me as a client it's going to look up hey this dns name dot traffic manager.net i would hide it i would put a vanity name so i would say hey www.savtech.net point to this traffic manager dns name and it would then resolve based on different types of balancing options that i can select a common one is performance based on my location it would point me to whichever one it would resolve to the name of whichever one was closest to me it would point me to region two so traffic manager is dns based doesn't care about the service it's not giving me any kind of offload or acceleration i go to a name and it resolves it to whichever one is closest to me if i use performance it can do active primary it can do probes to make sure they're healthy it could round robin it can base it on geography so it keeps me in a certain geography goes to this version i have a bunch of different options so traffic manager is dns based that's the key point so it's going to really work with anything but it's not really giving me a whole bunch of extra benefit other than it resolves to something my next option are all kind of working in similar ways if i think about the azure network we have this great big azure network and these regions are connected to this great big azure network there's also a whole bunch of edge sites might call them points of presence all throughout the world tons of them so what we have is thing called azure front door so what front door does is for http and https only it gives me a new ip address that is anycast so for my service i'm going to get ip3 ip3 is actually available it's broadcast on every single one of those points it's anycast that's the whole point so now what happens is no matter where i am so now i'm over here when i go to that ip address i'm going to go to whichever point of presence is the closest to me in this case i i hit this one with front door this is actually going to terminate my connection so it's going to do the tcp establishment it's going to do the http the ssl you can do the ssl offload then over the azure backbone it will go to whichever one is closest get the content you'll get big chunks of content rather than tiny little bits at a time get a big chunk of content cache it here and then feed it to me a piece at a time so not only is it caching so it's gonna speed things up for the second person it's grabbing big blocks of content so there's less trips it's terminated my connection close to me so it's gonna over improve my connection now because of this termination i don't see the true client ip address for my service that's behind here but one of the nice things about azure front door is it's just working with services on the azure network and things like bing have been using this for a long long time so any car service i can have now resources in tons of regions and it's going to use whichever one is available again it's doing probing to make sure it's healthy and it will use that one and get it to the user so the user gets the best possible experience and again all that traffic is going over the azure backbone which again is going to be super efficient really focus on lowering those latencies but it only works for http and https what about if i've got something that isn't http https and i don't want to use dns so what they kind of have just announced is the azure global load balancer so now i'm back to again i'll just say i've got two regions for now and i've got that kind of public ip1 an ip2 and these are actually in front of standard load balancers what i can now do is have kind of a global load balancer which has kind of a different ip address that points to the regional load balances and once again this is anycast so now when i go and hit across all those different points of presence on that azure network i'll go to whichever one is closest to me and then azure will redirect me to whichever load balance on the back end is the regional load balancers that it's pointing to that is closest to me and i can maintain my client ip so my ip address here will actually get sent through to those backend load balancers so they'll see my true ip address and again it works the thing is outside of http https as well because it's layer 4. so if i have http https ideally use front door if it's not http https then i can use now this is in preview though the azure global load balancer but these have to be public ips they have to be public facing or if not i can do dns based with azure traffic manager so all of those solutions are basically going to balance between different regions they're all doing kind of probes to check hey is it there is it healthy so this is how i can have things in multiple regions and balance between them that's really kind of that that key point so we covered a lot there okay use infrastructure's code don't go and create things from the portal it's going to ensure consistency it's immutable if i'm if i'm declaratively defining what i'm deploying it's always going to be the same i can easily rebuild things i can use policy to enforce consistency between deployments to make sure it's always going to look the same way so i really want to focus on not using things in the portal use templates use terraform use containers use declarative technologies to build things out and then when i talked about hey don't have to replicate necessarily just build it as you need it this is going to enable that so i have state okay um how do i replicate it so i can do native application some service might have multi-master replication capabilities there might be a dr type replication in the application maybe i can use some hypervisor hyper-v replica vmware had kind of the site recovery manager maybe it's an nrs level kind of replication maybe it's a storage based replication something like storage space direct in compatible with storage replica maybe i just restore a backup recreate the vm remember this is if there's state so i can't just recreate from scratch i need the state back or i can not have any plan and i just leave the industry in disgrace we're generally trying to avoid that one that's that's not choice one really trying to stay away from leaving industry in disgrace never a good thing so remember if there is no state i keep saying this it's important do i need to replicate it just have the process to recreate it so what i'm saying about this so let's take a step back forget about azure for a second if i was on premises today if i was on prem and i had a workload okay and then i had my kind of dr location over here and i had another set of workload so i have storage at my primary well i have storage at my dr site i might replicate using that saying like srdf in a cluster i can have storage replica inside windows 2016 and above then i have the hypervisor there's a hypervisor over here so i could replicate the hypervisor like hyper-v replica then i have some os running inside well maybe they're saying i can replicate inside the os and then i have my app maybe i could replicate at the app level something like sql always on availability groups so like active directory domain controllers the further up i go generally the better the experience if the app is doing the replication and the failover it's going to know what what point the app was when there was the failover it's going to fell over the quickest the cleanest way as i go further down it gets a little bit maybe murkier if it's just the storage replicating it well it starts up and this bit's been marked dirty it has to go and do some consistency checks it's not as great of experience so generally there's different levels i can replicate and i try and go as far up as i can now think about azure it's known as azure i can really apply the same logic i could create a virtual machine a fundamentally a vm from a state perspective is the storage the managed disk says my storage that's connected to the virtual machine now that could be up and running and i could have the app running inside that virtual machine and i could do app level replication i could have sql here sql there domain controller here domain controller there that's going to give me the absolute best performance but realize if azure is a little bit different i'm now paying the compute charge of that vm running all the time or i could maybe replicate at the os hypervisor level to storage the vm doesn't have to be running now what i'm paying for is something called an azure site recovery license so it's not as good it's not going to fail over as quickly because this thing is an up and running but it's going to save me a bunch of money because i'm not paying a compute charge anymore so we make kind of these trade-offs it depends on the workload if this workload is my tier one database that needs to be up as quick as possible a domain controller at least one i'm probably going to do an app level replication it's worth the dollars for the compute and remember in a true disaster i can probably have a little bit of downtime this could be a smaller instance compute if i actually fell over i could stop it resize it to a much bigger instance and then start it again so it cost me more compute but it's normally small just to accept the replication i just have one to make control i can quickly clone them out to get more but if it's not a tier one it says then maybe i'll do it the os the hypervisor saves the compute money and just replicate it to the storage and then if there's a disaster then it goes and creates a virtual machine then i start paying the compute charge then it connects to the disk and it's up and running for the duration of the disaster so i have kind of those there's different options available but but think of the layers and then in azure because i pay compute charges based on what i'm consuming realize there is a difference so there might be some workloads that even if i could replicate the app it's just not worth it financially i'll just use azure site recovery and replicate and again remember there's no state i don't need to replicate that vm at all just make sure i've got the image required or declarative configuration stored up in azure somewhere and it gets updated if i update it here by change control and i can just recreate the things if i actually need it so that web front end with no state i don't replicate those vms i'll just recreate them if there was an actual disaster so what about virtual machine replication now from on-premises to azure there's actually two different solutions by azure site recovery if i'm running hyper-v i can use hyper-v replica essentially i can hey i've got my on-prem location i'm running my hyper-v server i've got my virtual machines and what hyper-v replica essentially does is at the hypervisor level it does a replication of 30 seconds or five minutes to these virtual hard disks so that's again azure site recovery in the event of natural disaster well that's when it would create the vm and attach to it or if i'm not running hyper-v let's say for example i'm running an esx or it's just a physical box so now i essentially on top of that esx i have an os instance or i just have an os instance running on bare metal so what it does is there's a service there's this mobility service if i think about operating system fundamentally you have kind of a file system and a volume driver then the app talks to the file system so it says hey it goes down through those layers so what this does is in os it injects between the file system and the volume driver this mobility service and what happens is now has those rights go through it actually splinters the rights it still goes through to the volume drive from the storage now it takes a copy of that right and it puts it on kind of this this appliance thing and it's kind of a joint configuration and process server so there's this process here that essentially takes it and sends it up to the cloud there's a hidden master target that we don't see and it writes those to a vhd so it does that conversion for like vmdk because it's not seeing a vmdk it's just seeing rights to a disk esx or physical and so again that's asr and this is continuous so it's not on some kind of replication of always just streaming it as kind of quick as it can so that works from vmware or an os anyway this could be a different cloud because it's inside the os doesn't care where it is now the important point is the fail back um so for hyper-v it does a reverse replication back when i fail back to populate this for these it can only fail back to esx so if i was replicating from a physical os i can't sell back to the physical bare metal hardware i have to fail back now to an esx vm so just realize that that's why you'll see things like yes i can use this from other hypervisors other clouds but i can't fail back to them so if i was migrating to azure hey i could use this technology it's free for 31 days so for 31 days i can use these replication things this is how i used to do migration but that is actually an azure migrate service azure migrate is free for 180 days per instance and actually with azure migrate it's not the focus of this topic but for azure migrate because it doesn't have to worry about failing back it's a migration service the esx it can actually use snapshots of the hypervisor to populate the virtual machine so i don't have to do the in guest if i'm migrating so if i use azure migrate a different solution it can actually use esx snapshots again it's free for 180 days but in terms of replication i want to be able to fail back there's different solutions for both hyper-v esx and physical azure to azure it also supports and funnily enough for azure to azure it uses that same ability service so you think oh it's hyper-v and azure it uses hyper-v replica it actually doesn't so if i have a vm running in azure so i've got this vms running it's connected to the vhd again it's the os what it actually does is it installs an extension that puts the mobility service in so remember that file system and the volume driver that mobility service it uses that to replicate to another region where it has the bhd and again there's these internal process servers and master targets but fundamentally now it's taking that and writing it over there and that's asr i can even asr to a different availability zone in the same region if i want to that's now an option so if my vm is deployed to an availability zone i can replicate to a different availability zone it doesn't have to be a different region anymore so that's another option so to replicate between regions if there's state in this thing i can use asr to replicate the vm to the other region now once again my other option would be if the app has its own native capability of replication i could just run a vm here run the app connected to its vhd and replicate at the app level so that's the app replication realize the difference i'm paying compute charge all the time so again it's that balancing act depends on my requirements depends on the app what it supports what i can do if there's state because once again again if there's no state just create it over here if there's an actual disaster but it's your job to think about this level failover don't think oh okay uh it's in the cloud i'm good this region goes down might have to bring it up no they won't you have to think about these things if i'm deploying sank to a region my responsibility be a vm app service plan aks database storage account i pick grs to make sure there's a replica to another region it's on me to make sure i'm thinking about and using these solutions also a nice thing about the asr is i can actually have recovery points so what recovery points give me is not just the latest state i can actually say hey i want to fail over i want to fell over to how it looked four hours ago and i can do these multi vm consistency groups say hey these four vms i want to fail them over they all have to be the same point in time maybe there's some synchronization between the databases i can't have one two minutes ago one three minutes ago one seven minutes ago they will have to be at this exact moment so i can define these vm consistency groups so now i can fell over to a point in time and all four of the vms would start at the same thing and that works for the on-premises esx as well that in os agent and i don't think it was for hyper-v if i need it for hyper-v i would use the in-guest agent so that would kind of be my option don't pick one and what i mean by that is there's all these different options pick the one that makes the most sense for that particular workload don't say hey well yeah this app has this replication capability but this other one doesn't i'm going to use the in os so i'll use in os for everything use the one that makes sense for my requirements it's going to give me the best experience don't use lowest common denominator there may be some extra considerations from using different technologies but we're going to use things like recovery plants that's part of azure site recovery i create a plan and that plan can say fell over this group of vms then run this script wait for this manual action so i'm just going to pull a big lever then go and migrate this set of vms run this script so i can do different things i can orchestrate i give myself a big red button that i can push to do the entire failover process i don't have to simplify it just to get the lowest common denominator so it's the same technology i will have to consider the cost though this is disaster recovery this is something i hope i never ever have to do so yes maybe there is an app level replication with tons of these things but my recovery point objective says my recovery time objective says i don't need to fail over that fast i really need to try and minimize my cost because app level requires the compute to be running remember if i want to do app level it has to be running whereas if i use azure site recovery it doesn't water on prem for this one it's not the most important data in the world i don't have this time uh i'll just i'll just replicate the storage level um i can't cluster if i have work clothes um i can cluster in azure if i'm using things like sql always on i mean azure in hybrid scenarios i have to be able to use windows fell over clustering my recommendation is to use windows server 2019 the challenge is a lot of clustering like to use ip addresses that move between os instances we cannot do that in azure so the solution in the past would be hey look we'll put a load balancer the load balancer would own the ip and then we'll have like a fake ip in the virtual machines but now we have this concept of a distributed network name this was introduced in server 2016 for some resources in server 2019 even the core the cluster name object can be a distributed network name what that means is it's not a specific ip address anymore it will register all of the ip addresses of all the cluster nodes so i don't have to have some ip address that can drift between nodes anymore it really really simplifies my clustering experience i can use cloud witness so i can use a storage account as the witness for my cluster for the cluster i'm going to go and create in azure and there's even shared disk support so if i do need shared storage ideally if we try and stay away from that i'd rather replicate the data if i have to have shared storage now we have shared disks i have the ability to set a max shares number on like premium disks uh the bigger the disk the more people that can connect i think the smallest one is two uh i think that's a p15 i think a p30 is five a p60 is ten so now i can have multiple vms connecting to the same disk and i can have a shared disk within my cluster so i i have that ability again i've gone long i apologize nearly done bear with me so disaster recovery i have planned and unplanned there's really three types planned hey look there's a storm coming i should probably fail over so i know it's coming i do a preemptive failover to my disaster recovery location i should not have any data loss or at least no unexpected data loss i can cleanly shut things down fail it over start things up it's a planned amount of downtime and again i really shouldn't lose data so i'm going to stop the process here start it on the other side unplanned hey look our data center's gone big storm went through it well you're gonna have unplanned downtime you're gonna have data loss unless you're really really lucky that asynchronous replication is probably going to be some stuff you lost then there's test people don't do this very often you really should i don't want to find out i'm missing a key component when it actually matters do tests of my process to make sure i know hey i'm missing that thing i need to go and fix it make sure my change control process for production i'm adding this new system i need to make sure i add that to my dr location as well backup um i've been really focusing on sort of the replication the architectures for resiliency azs but don't forget about backup so at the simplest level azure has backup services via recovery vault so recovery vault is a portion of storage and i can use that to store backups from various services vms is an obvious one but there's others as well so these can be used to actually give me at consistent backup so if you think about an app consistent backup ordinarily i can think about well i i have my os instance virtual machine whatever and i have my app and my app is running and it's constantly doing reads rights to the file system if i just did a backup of the storage at a point in time the app might be halfway through doing some right so i would get something called a crash consistent it's a frozen copy of the storage but if i actually try to use it it may say hey this is an inconsistent state i have to do this repair what app consistent does is there's in windows there's something called vss this is the volume shadow copy service there's actually an extension we have in azure so if now azure backup when it takes a backup it actually calls via the extension vss vss calls vss writers that are created by the apps what it does is it tells the app hey flash everything out to disk and then pause quest and any future changes now at this point it can take the copy now i'm going to kind of take a backup of this state because the app has flushed everything out the disk and is pausing takes some probably sub second but i think 10 seconds is the max and then it releases it and then the app can start writing things again there's even something for linux so linux has some pre and post scripts that can get called to flash out linux apps as well so i get app so this is now app consistent that's what we want that's kind of the the thing we like for our backups we know we're getting a good clean backup because it's calling it's flashing all that good stuff out again linux has these own pre post scripts to make sure we get that so with azure backup we can use extensions to get the app consistency we can then recover it as we need i can restore the vm i can restore files from the vm now i can restore files from the vm but this backup is not appleware with the exception of sql sql is a special extension to understand the databases but if i had other apps like sharepoint i can't restore a sharepoint item if i want appleware restoration i have to do an appleware backup so i'd run the backup agent inside the guest os so we could see the apps that installed see the applications constructs it's delta based storage with many many recovery points so if i think about this backup when it does this backup then i can think there's this great big recovery vault so the recovery bolt so recovery voltage storage using azure storage that can be geo-replicated to another region and it's only storing the delta so it's taking what has changed and it stores those deltas at many different recovery points so i have lots of different recovery points in here and i can set what those recovery points are so if i actually go and jump over for a second if i go and look at recovery so i've got my recovery services vaults i can set policies and let's look at what i have and under my backup policies i can see i've got an annual retention of five years and what you can see here is super cool i can say hey i want a daily i think i can have up to two backups a day so i can pick when i'm doing a daily i can retain instant recovery snapshots that's just going to keep it on the local storage so i can restore super super fast so not only is it copying it to the recovery services vault it's actually keeping it on the local storage as a snapshot to restore instantly doesn't have to copy it from the vault i can say hey look keep a daily for 14 days then keep a weekly for four weeks then i want to keep a monthly for 12 months and then keep an annual done at the bottom for five years so by setting this policy um azure will actually take care for me keeping those so if i have those regulatory requirements and keep in seven years great i can do it and it's only storing the deltas between those recovery points so i pay for the storage i'm only paying for what's changing between those so i have all those retention settings and those vaults can be local zone or geo-redundant additionally some services have their own backups they don't use azure backups um things like azure sql database postgres use their own capabilities for the backup to store those things some utilize snapshots like azure files snapshots to the local storage so this is interesting azure backup can orchestrate the snapshot so i can actually backup azure files from azure backup but it doesn't copy the content to the vault all it is doing is creating snapshots on the azure files that can then be used but it's not actually storing them separately in the recovery services vault i'm using that orchestration feature of azure backup to control creating the snapshots which are in place previous point in times views of the azure file system azure blobs has snapshot it has soft delete if i delete it i can undelete it azure block rob has object replication i talked about earlier i made it a custom solution if i have my azure files i want it replicated to another region and i can't use grs because it's premium i'm doing something else or i can write things with things like az copy to do delta based copies manage disks i can't set up as replicas normally i'd use asr i'd replicate it that way but again i can create a snapshot of a managed disk and then i could replicate just the changes to the managed disk to another region so for these different options we covered a massive amount again i apologize i'm absolutely useless at keeping any kind of time there's a lot to cover key takeaway from this make sure your services are resilient within the region use the right constructs make sure i'm not depending on something in a different zone don't introduce a point of failure that's in a different construct than where you are i was one of you saying better if i'm zonal rely on sync in the same zone or as i'm redundant and then have independent copies in different zones regional maybe i'm an active passive depends on my data what i can support in terms of that replication normally it's single right then i have readable secondaries can i support the latency of writing then make sure i get the latest read from my app if it's in another region geo balancing solutions don't forget backup replication is not a replacement for backup if i corrupt my data that corruption will just replicate so i really hope this was useful please questions below i'll keep an eye on those and until the next one take care yourselves you

Info

Channel: John Savill's Technical Training

Views: 12,787

Rating: 4.9915791 out of 5

Keywords: azure, azure cloud, resiliency, availability sets, availability zones, disaster recovery, high availability, replication, global balancing, zonal, zone redundant, azure backup

Id: zLMXu4rtlEk

Channel Id: undefined

Length: 102min 22sec (6142 seconds)

Published: Tue Sep 29 2020