AWS Developer Workshop: How to Build Multi-Region Applications in the Cloud

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

[Music] all right good afternoon everyone it's great to have you here hopefully you still have some energy after a long day I'm sure you you've walked around and stuff at least I'm tired of all that stuff but today and this afternoon I want to talk about a very very interesting topic of mine totally fascinated with resiliency resiliency is the idea that failure always happened and that we should build application and systems to end all failures in fact in large scale systems you all build micro services now there's some micro services might I have 10 20 100 different services it's very common when you have micro services especially at large scale that at any moment some of them will fail in fact if you have 4 or 5 micro services it's kind of normal to always operate with all the services working right that's what we call full operational system but once you go at tens or a few tens or hundreds of thousands of micro services it's impossible right and we call this partial failure so we operate in partial failure that means we still always operate but we handle failure and this is what we call resiliency and I've been working on resiliency for about 10 years I joined the WS 3 years ago but before that I worked in scaling and businesses and startups and especially back-end system on AWS and in 2003 in 2012 there's something very interesting that happened and I will talk a little bit about this later but before we go into talking about resiliency in multi-region I want to remind you couple of stuff but it's very important when you talk about resiliency is to understand first availability and availability is the time your service is going to be able for your customer availability doesn't mean reliable you can be available and still returning an error so I'm just talking about availability so every ability is the time your service will be up and very common in the industry is to say our service our four nines of three to four nines available that means yearly your system needs to tolerate 52 minutes of downtown one of the big outage I had in my life that was like six years ago our alert system failed our escalation past failed that means 26 minutes it took for us between the system was down to actually receive alerts and in fact it was customer telling us your system is experiencing issues so just one outage and we already had burned half of our credits so I just want to show you those numbers because this is the most important thing when you build a business and when you want to make a resilient system you need to know how often it is to be available and how often it has to go down actually this also tells you that you have very very little time that means you're ready to automate everything okay but I want to drill in particular in this one this is a very T's in series and when you have micro services when you build apps very often you put them in series and there is this equation and this equation tells us that if part X of the system is 99% available and it's in series with another system which is 99 available the overall availability of the system actually goes down this is availability in series and this is something that very often we forget there's another one that is very very important and actually that's the topic of this presentation its ability in parallel and this equation is your best friend let me tell you something if you have a system X which is 99% available and you put it in parallel without doing almost any change on your application you actually make it for nine available if you make it three times in parallel it becomes six nines every row you haven't done pretty much any change in the code the only thing what we've done is actually duplicating the application making it in parallel and this is really the first thing you need to understand when we talk about mobility and resiliency and it's also the very reason why in the cloud we always architect around regions and nazy let me talk a little bit how we define at AWS region so we have in AWS about 18 region worldwide a region is actually a set of availability zones now it's a common mistake to think in one every blizzy zone there's one data centers there's not this is one too many is actually some of the biggest regions can have up to five data center per availability zones right and some of the very big region can have 15 or plus 20 data centers within one region and we organized those AZ so that they are physically separated they are physically separated so that they also have a different electrical grid a different floodplain a different fire plane so that in case of a problem in one a see the same region will continue running because we have other AZ available and in one region there usually 3 AZ up to up to 6 ok so when you open an AWS account and you select the region you are given a set of abilities on on where you should deploy an application and this is also the reason why we always always tell our customers to architect their solution across multi AZ actually the first thing you should do when you have an application is really deploy it so that it works across multiple availability zone when you do this there is few requirements that you need on the application one of the most important requirement is that your application is stateless right so that if an application is treated with one AZ the other request can take and continue the job right so the share the states between different AZ so that means your application does not store the state locally is towards the state within the region and that's really the first level of designing an application which is highly available and resilient in the cloud how many of you are already doing this today so a lot of you now this is the next level and the next level is doing multi region and this is really the topic of this presentation and you might be wondering why do we do multi region there's few reasons and we'll go through them later but what is very important here is to understand that even if you go multi region there's a lot of work to do usually you need a DNS DNS with a cname and then a policy which will distribute traffic between different region so what you did is you just took an application in one region and copy/paste it and duplicate it inside the second region okay so that's what we're going to talk about today and we have about 18 different regions so when you want to do a multi region you can really almost architect your application across 18 different region I'm not going to demo that today today I will demo you across two region and we have 18 region with 55 ability zone so that gives you a lot a lot of places where you want to deploy your application so why would we want to build multi-region back-end well the first region or the first important thing is to understand latency I'm show you all of the world latency latency is the time it takes for a packet to go from one place to another okay and that's latency in electrical fibers or in optical fibers is actually bound by the speed of light and the speed of light no one yet has hacked it so when you take a region and you want to deploy services for example in Europe and you have users in us your users will automatically have a latency of about 140 milliseconds pair direction so that means a round-trip will be 300 milliseconds now if I tell you that we did some tests on amazon.com the retail website and adding 100 milliseconds latency on the page as an effect of 1% drop in sales how many of you can detect 100 milliseconds no one in fact the human a very fast human can go 250 minutes against 300 milliseconds average person is 400 milliseconds but that means your unconscious mind will detect 100 milliseconds and you will feel naturally the feeling or you will have the feeling that the application is not responsive it's not fast and you'll just get bored and move on and 100 milliseconds it's not something you can do by feelings it's something you detect by tests and data and trust me at the scale of Amazon 1% drop in self is a lot of money ok so 100 milliseconds is a lot so when you deploy your application globally that's one of the most important reason it's actually 10 years ago it was very common for application for companies to have local applications right it was not common to have a global reach nowadays when you launch an application of one you launch an app you want to reach almost the global market when you put your application on the iPhone or Android or Windows Phone they have a marketplace and users from pretty much all over the world can download that and you don't want your user to experience 600 milliseconds latency if your users are in Australia and you are in Europe right so there's an intent incentive to really start deploying application across the world so this is one of the most common reason why people start to have multi region systems another very important one is disaster recovery disaster recovery that means you have a primarily region and the other one is passive and we call that setup active passive so that means one region takes all the traffic and the other one in case of a failures in a service in one region will actually generate a switch of traffic to the other region okay so the passive regions all of the suddens become active and this we call disaster recovery and some customers have done that for a long time now the problem with such a setup it takes a long time to actually move the traffic from one active to one passive and especially it takes a long time to warm up this region and the problem with passive is very often the cache the cue the messaging all the systems that you need to operate in a very large scale will have to warm up and when they warm up you have a possibility to possibility to have a very big outage because it's something that is not tested very often so going from the passive to active as a potential for big outage simply because this is not practiced often so what we want is really to have a system where pretty much both regions are working at the same time and we just have to switch traffic from one to another and then scale the other one to handle the traffic here and this is what we call active active and back in the day having an active active architecture was very very difficult okay and I'll explain you why in 2012 actually if Christmas 2012 Netflix experienced an outage and it service of Netflix went down for a few hours and the problem here is Netflix was not responsible in fix and in fact Netflix Netflix is on AWS 100% using AWS for the service and they've been doing that for a long time but on the eve of Christmas 2012 one of the load balancer in the region us is one experienced some issues and it created a cascading effect that took the service down for a few hours and Netflix couldn't do anything because the load balancer where Amazon or AWS responsibility we've now affixed all the issues but at that time Netflix thought that this can't be because if you are competing with the normal broadcast people open the TV and it works right so Netflix want the same experience for the user they don't want to say oh it's over the internet we can be down a few hours and that's it they want that every time you actually play a movie on Netflix it will work so this is decided in 2013 to do the first multi region active active system and back in the day it was very very difficult there was no managed services so they had to do a lot of engineering work like replicating databases replicating big logs synchronously between region maintaining connections between regions and all that over the Internet and when you do this over the internet you are at the mercy of change of traffic patterns latency change is not consistent so it was really a big engineering working to big team of engineers and few months to be able to achieve that and they haven't stopped there in fact in 2016 16 the open services in Europe right and now many of you are probably consuming Netflix right and because they wanted you to have a low latency experience they actually started to have a region in Europe so now they have three active regions they have two regions in the US east west and one in Europe so Netflix as the kind of architectures that we're going to talk today and to test those architectures your unit test of functional testing and integration testing actually don't work very very well anymore because those are very good for single region isolated application but when you talk about the distributed systems you have to take into account a lot of different problems and to be able to test this multi region architectures they created some tools cause the chaos monkey how many of you have heard chaos monkeys yeah so cash machines are cool cool little monkeys actually software system that you launch or use in your application and they will just randomly kill stuff they will randomly break things to see if the system recovers self-heal and starts and so that your consumer and people can consume their service if you are interested into this actually have a full talk tomorrow around chaos engineering so please come back but this is actually pretty cool now let's talk about how do we actually build those systems so these few things that you have to understand when you go multi region you are breaking the barrier of synchronous replication in fact we say that to have a synchronous replication we have to be under five milliseconds separation between component and you were you've seen the latency between regions is usually 100 milliseconds to 300 or even more so when you go multi-region you go into the realm of what we call a synchronous systems and you just don't have a choice but you cannot go around this you have to handle a synchronous systems now it doesn't mean that the entire system is to be a synchronous but it means most of your operation will have to be in synchronous because if you have to be synchronous you usually have locking systems right and when you go a synchronous you enter what we call the real move the cap theorem how many of you know the cap theorem if you go into software engineering is very often a problem that comes into mind the cap theorem just says that in the presence of a partition so a place where you want to store data in a distributed systems you have to make a choice between consistency so that means being able to have the same data on each in every place in your system at the same time or being available having the data available at any time when you have a consistent system and you inject system in this one you usually have to lock it and then replicate the data and then release it and that locks makes these nodes not available right because you want strong consistency but in a distributed system usually ninety percent of the time or even more we want to be highly available so that means you want to be able to write in a node here in this node here and here at the same time and you want to be able to read at the same time regardless of the consistency and when you do this all of a sudden you make another very important decision it's called eventual consistency and eventual consistency means that at any given time in your system you might have different version of the same data in every node so for example if I have a value of a and I put two on it after a certain moment of time maybe 10 seconds or a bit less let's say 100 milliseconds I'm gonna add a equal to you know the nodes okay now let's say I make an update on a I make an update on a and I give a equal five imagine I read a here right after I put beef a five year well it might return actually the previous value because maybe the replication hasn't happened right and we call this eventual consistency it means that after a particular certain amount of time all the nodes have that value that also means that when you design UI an application you need to have this in mind you cannot expect all your user experience and your UI to be strongly consistent and this is one of the biggest problem in applications that want to handle multi-region systems embrace eventual consistency now when you want to do replication across multi region there's something very important to do you noticed Netflix took many years to be or many months to design their architecture that's because the traffic back in the days was going from one region to another across the internet and this is something that was a problem both from Netflix in actually for us when we deploy services when we deploy region when we maintain services we actually use a lot of cross region deployments and in 2016 to be able to control this latency between different region we actually built a full global network of our own fibers okay and it's a network that actually links all the region together through direct connection actually due to 100 the dual 100 gigs network circle around all our region so that now if you want to send traffic between one region to another it's actually not over the internet but over our network it's encrypted and it's especially going through a system where we control the latency we can control the error rate and we can make it as performant as we want now having that gives you a possibility to do once a very nice few years ago when you want it to link region between each other you had to have what we call VPN appliance between regions right and that was going over the Internet and very often you had to have two of them because you want to be resilient so you need to have two in case one goes down that's a lot of operation since we launched the global network now you can link every region through what we call VPC peering so you have a VP see a network configuration in one region right and then you can link it through a peer connection to another region with one click all right so you don't have anymore to duplicate appliances VPN connection you don't have to maintain them make sure that they are up sending traffic time all this kind of stuff that means very very annoying and complicated to do so that gives you a lot more time to actually work on your business so that's a good thing and we launched that last year another thing that there's become possible with the global network is cross region replication a cross replication on s/3 s/3 is our storage object storage service where you can put a lot of data files or videos or JavaScript anything what you want and when you put this in an s3 bucket traditionally we never ever move that data out of the region but now if you want and only if you want you can enable cross region replication and that means you take a bucket and you say as soon as you put data in this bucket we will a synchronously replicate that same data into any other region this is very good for data disaster recovery but also for active active system of course it also means that you will have a latency between the data available here and here but we call that eventual consistency and as I said is something you anyway have to deal with it the way you could avoid that is storing do to put into all the region or make a put in or the region but it's not very very friendly so just embrace eventual consistency and handle this a synchronous replication and this is a very very common way to move data from one region to another now there is another things that has been made possible having how many of you are using RDS currently is our managed database so RDS is our relational database sisters and it's managed so we give you a managed cluster you can have a master and you can have what we call read replicas with replicas are really good because if you want to scale an application you can move the traffic from the right to the reeds so that means your master will handle only the rights and the read replicas can handle all the rids RDS handles five different replicas if you is the aurora engine or RDS it can handle 15 of them so you really really can shard your reason you're right and scale very much but until last year the reed replicas were only within one region right now if you want to go global and you have a relational database you can have cross region read replicas so that means that the data that you put into a master will be a synchronously replicated to all the replicas that you have deployed in other regions that gives you a capability to scale the reads to the region but that also as a little bit with anti-pattern right and I'm sure you all noticed is that only have one master so if the user that is using the system it can read very fast because it's close to the reeds is close to the river época but if he wants to ride a right he needs to do what we call a cross region right which is an anti-pattern but that's the way to do it because this is transactional database so our customers said okay but that's not really really cool so we announced actually that now you will be able and very soon be able to handle what we call multi master and multi region right and we are gonna release Ora with capability to have first multi-master within one region and then multi-master across different regions so that means your application will be able to write and read from any region okay and this is a capability that is going to come within 2019 so there is not much time so just check up the what's happening soon but that's a very very good capability now if you are a little bit impatient you might have heard of dynamo DB I mean your view of her DynamoDB so DynamoDB is or no sequel database it's actually a project that we launched in the mid 2000 so 2007 2009 Amazon becomes was trying to scale and we had a lot of issues with scaling our transactional database we were running Oracle database back in the days and we couldn't scale them up anymore so the traffic was jamming it was not a good user experience when we listened or when we did an audit of all the relation all the database queries what we realized is 70% of our queries we're non-transactional and we were using transactional database with 70% of our query non-transactional first lesson here is actually audit your queries buggy because very often we think queries are transactional when in fact they're not the second thing is we went into building a system called dynamo which would allow us to scale so we moved many of our system to dynamo transitionally and now most of amazon.com is running both on Amazon DynamoDB and aura RDS and dynamo DB if you wonder if it scales just let me give you a couple of numbers you know prime day prime day is a day in the year where we actually open the gate we give a lot of sales so you have a lot of people coming and buy and this is the time of the where the traffic grows dramatically and we use DynamoDB to scale right and it was handling 13 million requests per second at peak right this is dynamodb okay 13 million requests per second it's kind of a big deal actually in all my career I've never seen a system handling that much traffic at peak time okay and dynamodb just released really a few few a few months ago actually I think 12 11 months ago a system called DynamoDB stream DynamoDB stream is a way to capture changes within DynamoDB so if you write into your table we output out of DynamoDB a stream of information of everything what you've done in the table so the reads the writes the updates and then you can capture that with a lambda function for example and do computing on top of this and when we launch this a lot of folks started to use that system to replicate data from one DynamoDB table to another so we thought okay let's make this simple for our customers so we launched something called global tables and this is already generally available everyone can use this and this is a multi master multi-region endpoint for dynamodb so that means that you have a global table that is available in most of the region that you can access anywhere you can write and read anywhere around the world and this is going to be actually the main tool that I will use for the demos because we're going to do some demos I'm going to build one architecture from scratch with you global table is actually very good for distribute the application bear in mind that the replication is a synchronous so you need to end global table is eventual consistent we don't have true consistency so that means you cannot lock a table in one region and wait for the data to be replicated everywhere so it's eventual consistent so when you design your app it needs to handle eventual consistency but eventually all the data in the nodes will converge to the same same value and it's used in a lot of scenarios so now folks started to use it for disaster recovery but actually a lot for active active architecture when you go multi region there is very important tool that you need to use and that's what we call a load balancer but you don't want to load balance within one region you want to load balance within multiple regions so we have to balance the traffic on the dns level so you have a DNS request and then we move the traffic from one region to another and to do this you need to have what we call routing routing policies sorry my French accent when you have broaching policies you have few scenarios one is called latency based routing and this is what we call like is to kind of prioritize for the lowest possible latency so that means if I have resources in one region or another we always going to send the traffic to the lowest latency order - for that particular user so if I am a user in a US and the region is in the US here is going to go there if I have another region in Europe the latency is going to be over 130 milliseconds so we're not going to send the traffic there but it's only latency base and latency can change right so you can have sometimes resource be an a that has different value and maybe add a different time this one might have a bigger latency than this one but this is a very good policy to have the best and fast latency and this is the policy that is often used for gaming companies where latency is very very important another policy is called geo dns policy so that means your DNS or you rub the traffic based on the location of a user if my user in the US I will automatically rub the traffic to the US region because I'm a your user in the US I will be there and it doesn't necessarily follow the latency so I can sometimes have stronger latency here but since my location is us the traffic will always be going there and this is very good routing policy if you have strong compliancy requirement or if you want for example that European users are only in database in Europe or us users only in database in US so this system is also used a lot for this kind of operation and then here what we call weighted round robin a weighted round robin is like ping pong is like you go from one to another and you can affect affect the traffic so your request will go one and then the other and then one and then the other and in case and the fourth one sorry the first one is failover and if you have any of those three policies in use and all of a sudden one of the resources one of the regions has issues with one of the services it will call what we call a failure so you have DNS failover so that means my entire traffic will move to one region so you can combine active active with DNS failover so that's kind of the perfect scenarios for us right so another feature that we launched by eight months ago it's called support for custom domain names so API gateway is a service that can put in front of lamda ec2 instance or containers and that gives you an API endpoint a way to manage your API a way to staging which was throttling of this kind of functionality but when we launched API gateway a didn't support custom domain so we had we gave by default domain name to the to the API endpoint and obviously if you have a deal domaine in your API endpoint than you have in your DNS you have C and C C names mismatched so you can't have this kind of routing so we launched what we call support for custom domain names so that means now you can take the same domain name that you will have for your DNS and use it for your endpoints and now we have a capability to do all of a sudden server less multi-region active active sir architectures okay so that will be the demo demo we'll do something like this so we're gonna have a domain so it's called global add or not me and I will be one user I will send some API requests to to region my regions are identical they have one API gateway with some lambda functions that supports a bunch of different API I get a post and a health check and then I'm using dynamodb global table to replicate the data asynchronously between two region so that means if I can if I make a request right here and then I made a read here I should be able to see the value okay so I'm just gonna show you how to do this so I'll get off the slides and let's start building some global table so this is the this is the console for dynamodb so when you go on to the edibles console and you want to create a table you can do something like this trade table and let's give it a name so let's call it Lisbon is gone demo and let's give a part string cheese so let's call this item ID and this is all I need to create the table a table in dynamo DB needs the name and the primary key a primary key is a way to address a particular item in the table okay so for example here I'm gonna store some integer or some string and actually this is a string so I'm gonna store a string and I'll be able to call this item by this idea when you do a global table the global table currently as to have zero data into it eventually you'll be able to migrate DynamoDB table to a global table but currently when you start a global table it needs to be empty so now I have my table called Lisbon and you see there are zero items in there and if you go into the console there is this little tab here or global table and when you click on it you see there actually that message and this is the stream the DynamoDB streams that I talked about which is actually the service that take a particular event on the database and replicate this in a stream that is outputting outside the database so when you want to use global tables you need to enable strength so let's enable stream enable stream we'll take the new object the old object so you have all the information and now I can start creating what we call a global table and I can start to add regions to it so when you do this you just select the region in where you want to have the replica so the DynamoDB replica here we have created the region in Oregon so let's have another region let's say in Virginia and then I can proceed that's it so now my table is being replicated into another region and that's pretty much it it takes a couple of seconds and then I can start operating to it and when you do this by default you get five grid capacity and five write capacity but actually you can enable auto scaling if you want and add that automatically scale up on your traffic now you can see I can go on the table there and now I have my table here called Lisbon demo and this is in Oregon and then I will have a Lisbon demo in Virginia okay so now my table my is global between two region if you want you could definitely add more take more region actually you can add all the regions and you can support pretty me pretty much all that now when you add region you have to bear in mind that you also augment the number of write and read so it will be more costly okay so only add region into the regions that you really need so let's let's test that so I have two tables in in two region one Oregon so now I can start to add items so let's create an item and it's called that full bar because foo bar is the most popular idea we can have so now I have an item in the database and you see when I create an item it actually adds fields to it those I didn't add the fields it were added by the global table and this is the origin origin of the object okay so West two which is Oregon and this is the time at which the object was installed and now if I go into the other region it will have the item in the table so you can click on the item and you'll see now the item embossed region bar so then you can also start from the the other other side so barfoo and then let's do that and then let's go straight here and see if this has been replicated and you see item now is in global table and this is between two region is about 150 second big latency so now you're able to read and write from any region so if you look at this now I have that built okay any question with global table to start with it's good you can add as many regions as you want okay so now let's sorry let's create what we call the API endpoint with the lambda functions and to do this I created I created server less server less server less template so I'm using the server list frame many of you know the surveillance framework so when you when you build surveillance application you have the possibility to use a bunch of different service framework I started to build stop using the service framework but you can use Sam CLI you can use sparked different on the language of the on the wheels and several ice templates you define your API using a templates but it's a template that is built-in llamó and you have the name of the service on top it's called block version two I can see the provider here server let's suppose different clouds so you for this demo we do AWS then I can define the security groups that I want the memory the environment and you see here I'm using variables like this so that means I'm taking environment variables from a file emma file is this one and i have a value called statutes okay so basically i deploy an environment with some environment variable called 200 then i can have a lot of stuff so this is the resources that i want and give it some security and roles define some actions but the most important here is this function so i have three functions which going to be deploying lambda lambda is a server last function that you allows you to run code just the function you don't have to deploy the entire instance or anything you just give it a function and it will run it of three function one called gets another one called post and another one called health this is to store data in the dynamodb disease this is to store data into dynamodb this is to get this item that i store and this is to check a health and when you do routing you know we talked about the DNS failover to understand if a resource is working or if a region is working you need to support what we call health checks the health checks that are built here is very very simple is simply returning the statute is simply returning 200 so by default when I'm going to deploy that is just returning 200 I want to demo to you the failover so at some point I will switch that to 400 it will return an error so we can have all the traffic move from one region to another okay this is the purpose of the status my post function is taking input like an item ID here he has also a body text the item ID and extract the session comment and the item ID from the body of the message and then stores this into dynamo DB so I put the items and when item ID and then session comment into my global table I have another then get item gate item takes the ID that I want to retrieve and then extracts this from dynamo so table get item and you can you can set up your dynamo DB table like this your resource this is just a hack why it's not necessarily production code is just to give you an idea of how this works so I have now get a port and the health and then I can deploy this in two regions so you see I'm supporting us 1 and us two so when I do my server list queries when I go there so for example let's me go into my project whoa but where is my product here so what I can do I can do something query is called several s deploy and I can define in which region I want my template to be deployed so just for the sake of it let's do deploy us east so what it does it takes the templates takes the lambda functions and then start deploying the lambda function in one region here the US east and you see this template is just taking this is going through all that take packaging these and then packaging all the lambda function into into package and then deploying this to lambda so when this is done then this gives me basically an endpoint like this and I can start doing something called let's do HD people do you see in the background no so you can do something like this and then I'll give what did I do it's the wrong endpoint don't want does what does he do this and I'll create another item ID and I'll give you give it the name bar and what it does is now he's taken my object and actually store that into my table so if I go inside my global to table I have an item now called foo foo bar and this was just deployed now so that means my deployment as work okay and just to show you because I change the name of the table is in this demo I'm just using a table called global - okay so then I can do the same into the other region I can do SNS deploy and do this in the US - and then I'll have endpoints in both regions right so let's leave it do that so I want to show you basically what has been deployed so this is my my lambda service in AWS in the console so you can see a bunch of lambda functions here and then I can I can sort them so I have a bunch of services the ones I'm interested is my services called blood v2 what does the name of the service that was in the template so you remember you remember the template here block b2 this is the name of a service so all my functions will have block b2 and then the name of the function and then I have the functions here so you can start seeing the function and what it does is shows me that my function is actually linked to API gateway this is good because I give it an endpoint in the inside this template and then it supports x-ray CloudWatch DynamoDB and also easy to so this is a lot of stuff that I've been enabled but what is most important is you remember this statues environment variable right so you see my environment variable has been deployed now means that this function will return 200 okay so when I will probe that through an else check it will always return 200 I can change that value a bit later but first I want that to return 200 so I'll call my lambda functions that being deployed so you see this now has been deploying to another API it's called our CP okay and then this one is 3b ox okay so now I have a system that is that is like this let me show you the slide so that you have it this is my WB global table so now something like this okay now if I go away this is now what we have we have deployed the API gateway with the lambda functions in both regions and you can go and check actually in the console or so that your API gateway is correctly deployed so here's my gateway and you see I have my dev blog - API that is being deployed in Oregon and I also have another one that is in Virginia so if I go in Virginia you'll be able to see the same API endpoint there okay so here I have my API gateway there and you see myapi gateways she pours the paths that are designed so create this is the post function has the get to get the item and the health which is also a get so now I've both systems deployed globally so this is what we have but not that's not enough okay so what we need is really to have routing policy between different systems so now let's let's back up so I have a domain name this is my this is my own I don't at me and what I do when I did there I created a cname called global global a doorknob thing you see when you look at it is actually define an endpoint called global Adam dot me and what it use is use a traffic policy my traffic policy is called blood nice so let's go and have a look what that is and we saw the traffic policy is what it does to route to balanced traffic between regions so you had the weighted round robin you had latency based routing or you had the geolocation based routing so here what I've done is I created what we call a weighted round robin policy so I have two weights of 50 you see in the end of the room it's good so I have two weights so that means I will balance the load between the two region 50% and 50% and I'm heading health checks okay and the health check will target the lambda function that returns 200 okay this to policy they go and target two end points and the endpoints with value called D and D and this is okay this not the end point that we had for API gateway and you write actually when you deploy API gateway you see it as it has an API that is very different than what we had what we have in this target right and this is because you're here using the normal see name of the API gateway but we want to use a custom domain so when you are in API gateway and you want to go global you need to use what we call custom domain and here you see I imported my certificate I don't got me and with the cname global and assign this to my API gateway okay so now my API gateway you can see the path when it takes it takes my API gateway and then I can start watching traffic with the right don't see name so that means now I can really have a policy to between that target this is the target domain name in one region and there are other and this is the target name is the one that you use in the routing policies okay so and route 53 will route traffic it will wrap traffic to the different target names okay so now I have the system pretty much ready right so let's have a look at the health check because now my health check is supposedly returning 200 so when you create sorry when you create a health check let's create one from scratch so that you know you know a little bit what it is you give it a name so for example test you define what it is and here you can give it a cost domain name okay and when you do a domain name you can see also advanced configuration and this is what I want to show you you can request on your health check every 30 seconds or every 10 seconds okay the most important is what we call failure threshold a failure threshold is the number of time my health check will have to return error to believe that this is actually down okay and this is done to avoid intermittent errors in distributed in distributed system one of the biggest problem is errors that are not real because you have such a big network this goes over the internet so at any given time you can have a network error it doesn't mean the system is down it just means maybe your time your request is going to timeout that didn't go didn't go through so you don't want necessarily to over react and this is a very very important thing when you do DNS failover do not over react because we will have intermittent errors so you need to figure out in your system what kind of errors you are willing to accept before believing that the system is actually experiencing real issue and here we are we're in actually the system I have we're doing a fast check fast check so every 10 seconds and I will have a threshold of 1 so that means that the first errors I will over react which is absolutely what not you should do ok so do not do that but I want to demo so I want the thing to fail fast so I created two health check okay and the health check what they do is they target this URL which is my API gateway and it calls their check okay so now I'm targeting the API gateway I say if you have errors then fail overs I need so fast one 10 seconds with a fetcher failure Trish hold up one and both are like this and now you can see actually that all the health checkers are returning 200 ok so this is the health check of us east so that means Virginia so let's now go and break the whole thing actually now let's make a request first on the system to show that the data is going to be replicated I don't want I love breaking things that's the problem so let's first actually create a tab and then let's send some data into into that so what I'm doing here I'm doing a for loop between 0 70 and for each of the I between 0 and 70 I create an item called foo bar 0 1 2 3 4 5 6 7 ok so I'm going to post a lot of data there and there it is what I'm doing is I'm gonna create another tab as well and I'll explain you why so when you do requests like this in HTTP your local machine actually is loped is caching the dns resolver so sometimes you have stickiness on that query so what you need to do is create two threads so now I have a lot of items that is being created into dynamodb so I can do then into the console and you'll see in my table called global I'm gonna start adding items that is being created from a lot of different regions so you see now items that west east west east so it goes a bit well in this case like this and this is something that is very important to realize is when you are routing policy 5050 it does mean that it's one here one here one there it's after about thousand requests eventually the load will balance across both regions so this is the first thing that you need to realize that's why I don't necessarily have a wide range of US and West but you see for example here I do have some East I do have some West so you see let's go have a bunch more items so I can I have some East you see here and then I have some West at the end okay so eventually 5050 between this so not much traffic is really distributed between two region so let's go and see okay this has stopped this has stopped so now let's delete all the items here in the database I have 100 items yes and then the next 70 or the next 40 sorry 27 to 1740 something is wrong there let's delete everything now we have hundred for it okay so all my data has been deleted in all the regions okay so now my region is empty of data okay we're all clear we are empty of data and all my health checks written 200 so now let's break stuff so let's put here a 400 and let's save this and let's very fast now see this and you see already my data is gonna start reporting errors oops so which region oh I'm in Oregon up wrong region so you see now my API is starting to see failures of this endpoint so that means my system is gonna start thinking that oh there is a lot of mistakes and let's start to fail over and the system when it checks an API it checks the API from 6 different regions currently and three abilities on so that means 18 18 different end point so we wait that all 18 of them will fail okay so that means we have to wait a little bit 10 seconds distributed so that's actually goes quite fast and actually you'll see now the system is already unhealthy so that means all my traffic will be pushed to one region so let's verify let's go there and let's move more traffic and let's redo the whole thing and then I can start checking the dynami DB table and you'll see eventually all my items but we'll be in West - alright so that means there's a demo issue what what what is happening it's my traffic am I in the right that's interesting this is really pairs among healthy is my West so it should go all the traffic to the east right we all agree on that why do West regions here oh because I was checking in Virginia because the data is not there right so it's not synchronized all right so let's check that items what the hell is happening my gmail has some issues let's let's check this out so let's go into the code global table here this it's deployed let's check my health check that's interesting West is deployed 200 200 which wonder function did are the right lambda functions here yes let's go into the east [Music] instead block my API gateway is there three Bo X is anyone spotted some errors that I haven't seen I gave 100% hundred dollar credit to WS if you find the error with me $100 thousand dollars let's bring the mistakes up it's probably an error in my environment variable somewhere what's up sorry I don't yeah let's reload that just to see because the DynamoDB it's it's reloading the right ish the right stuff okay let's delete all the items here and see what's happening why actually let's go in lambda here and see that my code really has the the get especially I want to see the get right so function but I just trust me it works huh its first time the demo fails buddy I probably did something wrong to it a few years you say did I change the as the global to [Music] thousand dollars guys tell you come up with the right thing what's up you have a microphone ruff ah there's smart did I have the right the wrong else checks here on my traffic policies so much traffic policy here else check the health check 7/7 e goes to West's health check seven he goes to West's yes you're right you just you just earned yourself a lot of credits awesome so yeah the error here as you as you very well supported was that my traffic policy has the wrong API health check so is as a health check from the wrong region so very well spotted so yeah it works trust me anyway if you want to read all that from a blog post and see all the code it's actually on medium and my github account I've done that from scratch so you can follow all the code you can follow all the explanations there's three series of blog post and it also tells you how to do this within VPC so if you want to deploy DynamoDB with a VPC endpoint and all that kind of stuff not everything is supported for example coney toe authentication doesn't yet support multi-region so when you if you want to use Kony toe you need to take create object in one in one region and create the same user pool in the other one you can do this through a lambda but anyway what I just want to finish with is few years ago it took injuring teams of twenty people and few months to create that kind of things and now we can do this in couple of minutes when you have good guys that can debug code thanks very much have a good day and

Info

Channel: Amazon Web Services

Views: 5,465

Rating: undefined out of 5

Keywords: AWS, Amazon Web Services, Cloud, cloud computing, AWS Cloud

Id: k9_9bDZa_EI

Channel Id: undefined

Length: 65min 56sec (3956 seconds)

Published: Wed Nov 14 2018