AWS Case Study: McDonald's Home Delivery

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

good afternoon everybody hope you guys are enjoying reinvent my name is Tilly Nagar nursing huh uh I got from a dance corporation so we are here to talk about a weird McDonald's UCCS to massively scalar home delivery platform right let's get right to it so we'll talk a bit about McDonald's and a home delivery solution but most of the time today will be spent on really how we achieved these things like scalability speed to market security DevOps and monitoring so here's some interesting facts about McDonald's um we have 37,000 restaurants spread across 120 countries globally and we proudly serve about 64 million customers every day right as you know scalability is a difficult problem but scalability with a distributed network like this as well as this level of volumes in terms of scale is even a tougher problem for us to solve right we'll talk a bit about that as we go on here's some of the velocity accelerators that v8 models use number one being our digital transformation the whole premise of our digital transformation is to make it more convenient as well as personalized for our customers to make the whole experience personalized the second pillar is delivery again drawing on that convenience team how would we deliver our food to you when you want to have our food and then the third aspect is the experience of the future this is to actually elevate the restaurant experience and modernize the restaurant experience for our customers right so let's get to the home delivery solution this is where we use DCs to really scale we think of this as you as a customer going to something like uber eats and ordering McDonald's food for delivery right so that's kind of the business problem and use case as we work with multiple delivery partners in the world here in the States we use uber uber eats and in European countries we have other partners Asia countries we have furthermore partners right we've used a generic experience flow for you to walk through the user experience it starts by basically you were picking up a restaurant of to order from then you obviously browse through the menu right and I've used our signature crafted sandwiches here to illustrate and then you basically complete your order build your order basket and complete your order right at this point and the order is compete and then then the delivery Rider or driver is close to one of our restaurants the order gets released to the restaurant which we believe in making our food fresh as much as possible as well as giving it to them to the driver to be delivered to you so that's kind of the whole business problem that we're going to talk about so what were some of the critical business requirements for this right so um author I mentioned about speed to market right so this was actually a four month duration for us and that's from going from a idea to a concept to development to massive scale right and that's kind of the new norm that we see every day right scalability and reliability are 250 to 500 thousand transactions and the notion of peak hour for us happens three times a day every day right because you got to eat with this breakfast lunch and dinner right and to put this in perspective this 250 to 500 thousand transactions per hour translates to about twenty thousand transactions per second right so that's kind of the scale that we're talking about here multiple countries support um as I said there's different countries have different business requirements business rules and then also different delivery partners such as uber eats right to work with so the platform had to do that and then finally the cost sensitivity right so again we are not talking about selling big screen TVs here you're talking about an average cheque size of about three to five dollars as low as that right so cost sensitivity is a pretty big thing for us as well so we can spend a bit of time on our architecture so this is a under the covers look of our architecture we see here so the experience that we went through starts here which is the third party delivery platform right so think of this like uber eats right then all our API are hosted through what's called API middleware 8 this is consistently using the API gateway pattern right these are all REST API that's then wired through lbs to easiest right and as you can see ECS is the heart of the solution within acs we have multiple micro services for illustration purposes we've used to write but it's important to understand that these micro services also have different scale and runtime profiles right so for example some services that are customer facing will have tremendous scale tremendous reliability tremendous performance requirements right because its front facing you're hitting that 20,000 TPS all day long right some services could be more like about complex event processing type scenarios where workload optimization is what's important as Laura mentioned arm you could use different scale profiles using the auto scale policies and cloud watch alarms to trigger data scale policy as well as you could use things like task placement strategy to further optimize right and we will go into a bit more detail about how we did that to achieve that scale obviously the order goes to the restaurant right and then behind the scenes for eventing we use sqs right so think of this as one service wanting to talk to the next service rate through a asynchronous pattern we use eventing exclusively for that right and it's not just about scale it's also to be highly responsive and performant right to do that you need to have a lot of things in memory right so we use Redis as are distributed the caching platform right and it's actually hosted through ElastiCache right so that's how we really hit those transaction volumes at about 100 milliseconds or lower right for each call and then obviously it cannot only be on memory so you have RDS backing readies up as well as s3 for some of the most unstructured data all right so that hopefully gives you an idea about our architecture and how we achieved some of these big volumes such as 20,000 TPS so what were some of the principles that we use right um again microservices some is this is not going to be a talk about microservices but here's some principles that we use right having your clean api's right was number one then having a good service model behind that API or was number two right and then then it depends on what level of isolation you require right so you date a model to be isolated as well as you have deployments to be isolated so that each micro service can be deployed independently right so getting your micro service strategy is critical for containerization right and then once you get the containerization right orchestration of that containerization is very important to massively scale right and this is where a platform like easiest shines because you get most of them out of the box we also made a conscious decision to use most of our platform services from AWS this is rather than you maintaining your own database cluster or you maintaining your own caching cluster right use these services because they are scalable out-of-the-box right so that was a conscious decision that we made and finally for the developers and software engineers in the room right on your programming model right if you are again having a highly critical customer facing micro service use a synchronous programming model right if you are having a complex event processing type scenario use an async programming model use that programming model to build your micro service and then containerize right that will save you a lot of time all right so let's go on to the covers and really hit some of these things right so we're talking going to talk about speed to market right how we achieved that scalability and reliability what type of task placement strategies or auto scaling policies did we use right I'm going to get into the meat of that security is all about you know reducing your blast radius and attack vectors right so how did we do that at the container level as well as at the service level right finally we'll talk about DevOps how did we integrate our DevOps pipeline I would just based out of Jenkins and then also monitoring once you take it to production how do you monitor significantly so let's get right to speed to market right we talked about the four month it's not just about the four months though right it's about also showing progress to back to your business right in this case how do we build this continuously and show progress and we had like two week dev iterations for this rate what we're easiest and containers really help is you could really have your dev containers then go to staging show progress to your business uses very rapidly right so that was one of the big premises of this right then the second thing is the poly got tech stack right so you're bound to have board that's written in different languages right so in our case we have some code in dotted and some code in Java right some of this might be legacy code that you need to port over right some of this might be like Java is better than dotnet for certain things right but if the good ol days you have to like do native integration from dotnet to Java right but the beauty of something like containerization as well as ECS is now you could host it in two different containers right and make the true containers talk to each other through an API right so that was also pretty big benefit to achieve speed the simplified easiest deployment model or throbbin through that in detail so I'll skip that point here's another important thing right so typically you as good developers we all write code right and then we basically do a level of testing do a level of performance testing right and then take this massive cycle of integration and scaling right so to hit these volumes normally takes a long time right but the good news is if you containerize this right and use ECS and the correct auto scaling and task placement strategies we almost got this out of the box right so it was a pretty significant point for us and then finally the devops integration is yes integrating with our DevOps tool chain very easily really helped us right so that speed to market next we'll talk about scalability and reliability so I'm going to introduce one of my solutions architect Sylva he's going to talk about not only scalability and reliability but also security and also DevOps as well as he'll end with some real-world container problems that we had to face and overcome right to scale to this level won't you come on Joe Silva so let me for the next few slides what I will do is I'm good to run through how we use ecs as well as other AWS features to achieve the non-functional requirements right so let's start with the scalability and reliability so as still dimension we got the we got the scale targets to achieve 250 to 500 thousand orders per hour with about 100 millisecond response time so how we achieve this by using auto scaling with which ECS provides out-of-the-box right as ultra mentioned right you just have to configure the the policies for the auto scaling and it will work so SES auto scaling will provide two levels of auto scale one is for the ECS the ec2 layer which will scale your EC tools and the second layer is to scale your tasks so how we approach this was we initially did some performance tests to run some load to mimic cover production to identify what is our production load is going to look like because this is really important you need to know how is your load is going to look like in the production so with that we were able to derive the attributes or the thresholds for those our auto scaling policies so then we configure the ec2 auto scaling policies as well as the content auto scaling policies and also it's very critical to get this values corrected because otherwise you will have some issues when you are scaling out as well and scaling it so it's pretty important to get that corrected at the first time so once we done read that you know we were able to achieve the targets right to 250 to find a thousand orders per hour and the next task was to kind of fine-tune this right so we got two more requirements as still I mentioned before the two requirements that we got was to make sure some of the containers should run in isolation all right so we have some requirements say that for instance you know certain country we need to run those containers in isolation from the others right so that wasn't one of the requirements the second requirement swastikas cost sensitivity which that you need to optimize your cost right so those two requirements we were able to achieve by using the task placement strategies and the constraints so you just want to show how we use the task placement strategies and the constraints I I have given like three examples here by three services and these three services will have different requirements in terms of scalability and our reliable right so let's take the first service so the first service will require a high availability and also reliability and the second service we would like to it suggests a batch mode match processing mode but that requires two we need to optimize the load on the cluster so that's the requirements for the second service the third service which I was talking earlier which requires some kind of isolation from the other containers so let's go back and see how we use this placement strategies and constraints to achieve these right so I just want to fit to this diagram so the diagram shows you right on the top you can see the tasks or the containers get auto scale using the policies that you have configured right and the bottom you can see the easy tools are getting auto scale again using the auto scaling policies that you have and in the middle you can see the task placement strategies and the constraints are applied so that the we the easiest can decide exactly where the tasks has to be placed into the which a c2 has to be placed right okay let's go back to the those example three examples that I had first service one service one know if you can remember it require high availability which means that we need to have that tasks across all our availabilities milestones so for that we use the spaceman strategy called spread and the attribute here we have specified as availability zone which means that when the easiest place in those tasks it will make sure that it will place all the tasks across all the availability zones that we have in our cluster right but as dimension you can have different attributes like you can have instance type or in certain IDs depends on your use case but for us it's the ability so ok let's go to the service to service to was a batch processing service so we want to make sure that process runs in a more efficient way right so that will kind of goes to our cost optimization so for that what we selected was to beam back on the memory so with that ECS will make sure that all this wins a place in the new containers or the new tasks you'll make sure that it'll optimize the memory on the cluster right and then we get a cosmetic you out of it the task 3 so task 3 requires some kind of isolation so for that we we what we do is we create a task group right and then we when you are placing the dis containers we say that ok place these containers with this task group so also when we are when we configure the other rules we can say not to place these other containers if there's a container that is tagged with this particular trash group it's ok so let's move to security so this again a big thing on the these days right security especially when you are running containers on a cluster so just want to touch up two things one is the container security another one is the ECS instance security ECS instance is basically the easy to see the whole security right that's a term that aw says so let's go to the container security this container security wiII not be controlled by using I policies we'll make sure that we have we give only the permission that is required for the others therefore that container so you can control all your AWS resource access through the IM policies so you don't want to give like for instance if the containers does not require any elastic cache access you don't want to specify that right in your policy just specify what is just the minimum security of any access you need all to the container the second point is kind of inherited from our architecture as Tina mentioned we were lot of the inter-service communication is happening through the eventing patent because of that we don't have lot of API exposed to between the services that we reduced our attack footprint the second point is ACS instance security so this is important because all these containers will run on these these ec tools right you have to make sure the CC tools are patched and hardened so that there's no anyone abilities on the custom so for this we use our we have an automated process we that we get the latest AWS pcs optimized ami and then we apply our own hardening and we install our own security clients on that and they need to spit out the application specific ami so we call this process our a my factory is automated and also a WCS ASNs topic even they were they publish a new ami they will publish a notification to their tetanus pop topic so we we kind of subscribe to the topic automatically the pipeline will run that and we get the latest a in mind and let's get hooked into our DevOps pipeline so that's which I'm going to talk about next so DevOps and CIC so this is was really critical for us since you know we have to deliver this within a very short time right so we need to get this from the day one we know it cannot do DevOps at the end of the development right so it has to start with the development when we start this so the the base of the Abydos pipeline is having two components main components one is a Jenkins and the one is a terraform so Genk is basically does the orchestration of the pipeline right getting everything compile and download and terraform we're taking take care of the deployment of the containers to the cluster as well as the other a doublet resources so just want to walk you through with the the pipeline very high over so you start with as you dimension right you compile the order image and we have some validation script which will validate the integrity of these images to whether they are using the correct base a doctoral image and all the configuration is correct and once that's done it gets uploaded into s3 bucket and then from there the Jenkins pipeline will get triggered and the Jenkins will get the docker images again it will run some integrations test to make sure all the integration tests are passed and that is good it gets uploaded to the easier which is the repository for the images and from there onwards it will the terraform will take over download those images and then it gets deployed into the cluster right okay so next is a monitoring so we use two components for monitoring of the platform the first one is nearly the second is the elf stack right so near Alec we use to get all the telemetry data of the ec2 instances containers as well as the aw pass compose slice SQS ElastiCache and all that so we get all that element to data into a dashboard so we can see you know in one dashboard everything about the ec tools of the containers as well as some of the AWS path components so for the application login we use elf stack right that's the typical implementation but for the log driver which with dimension right ECS supports many of the log drivers we here we are using syslog which it will stream all the logs of the containers to logs - and the dogs - will forward that to the elastic search and the Cabana we use for the visualization since we are not in our containers we don't store there's no persistent layer right so we don't persist anything on the containers or the EC tools so everything gets still stream out to logs - ok so finally I just want to touch up two points that these are the challenges that few of the challenges we had right during our this 4 month the development the first one was it was due to a we are get we were getting out of memory errors of the containers so this was a known issue between even the doctor and the Linux community the issue here is the application run time which runs inside the container it's not seeing is the container memory limits so it will see the host memory limits because you know because of that the gee garbage collection is not triggering properly and the certeza out of memory exception the the root cause for this is the C groups are not I would say contain a friendly or containerized because of that the application runtime will not see those memory limits right there are few workarounds to go work for this the one is that if it is Java Runtime you can set the heap size to the so a certain limit unfortunately at this time we were running at dotnet 4 with that version did not have that feature but I think the latest top net coercion have the same feature as the Java heap size but at that time we decided to use this LS CFS filesystem which will kind of virtualize all the C groups into the runtime so the runtime count now see the memory limits which the container has the second is related to a network so by default ECS provide different to actually now is three but initially it was two network modes one is the bridge another one is the direct indirect network connection so as the default we are using the bridge network which means that all the containers are placed behind a docker bridge so the docker bridge gets connected to the host elastic interface a network interface but we had a different requirement from our security group to have to route all our container traffic through a different elastic like a secondary interface that we have in our host but the the issue here was the docker bridge was not honoring that routing rule does always hardwired to the primary interface of the the ec2 so because of that we were not able to give that makes use of that feature so what we did was we have to do some custom implementation on the darker bridge to make sure that all the traffic that gets routed to the the secondary interface so that was a custom information that we did at that time but I think now few weeks ago ECS team have released exactly the solution that we wanted right it's called the AWS VPC which will allow you to directly bind your elastic interface to a docker container or a caste so now you can have one to one IP a mapping of all your tests to your network interface which will give you a lot of capability of you know granular capability of controlling your network as well as implementing security on top of that okay just to wrap up I will hand over to Selina just to yeah thanks man Jeeva so some final takeaways and thoughts for you guys right one is you need good microservices to continue race right so that's published stating the obvious at this point but having a good micro-service strategy will enable scalability reliability and good containerization once you have a good container containerization and a microservice strategy massive scale is really achievable through ECS right so a big kudos to the easiest team for making this available out of the box right so the auto scaling policies that Mareva mentioned as well as the task placement strategies really helped us go achieve this 20,000 TPS within 100 milliseconds for each call right and we've tried to break this as well it didn't break so it was awesome to see moving to containers if you haven't done it do it that will simplify your life so much right if all the way from development to go to production right that's been awesome right and then easy is out-of-the-box capabilities such as alb integration from dev on right to ECS really helped us right again cutting down your development time and maybe also simplifying things and reducing complexity right so these were some of the big learnings for us big kudos to our development teams and partners as well the main development team was based out of Hungary right so big kudos for them and hope you guys enjoyed our learnings and the talk in general we'll be around for another 15 minutes or so thank you very much [Applause]

Info

Channel: Amazon Web Services

Views: 35,235

Rating: undefined out of 5

Keywords: AWS, Amazon Web Services, Cloud, cloud computing, AWS Cloud, McDonalds, containers, microservices, docker

Id: -8FK9p_lLy0

Channel Id: undefined

Length: 28min 10sec (1690 seconds)

Published: Wed Jan 17 2018