AWS re:Invent 2018: [REPEAT 1] Run Production Workloads on Spot, Save up to 90% (CMP306-R1)

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

all right that sounds like it's working we've got a fair bit of content to get through so we'll get stuck straight into it hopefully you're all in the right place we're here to learn about how to run production workloads on spot and save up to 90% my name is Boyd McGee key I'm a senior product manager in the ec2 team focused on spot instances I'm lucky enough to be joined today by Scott and Gabriel from Citadel and MercadoLibre to talk about how they have adopted spot and how they leverage it in production workloads so hopefully there's an opportunity for you to learn not only from me but from some of those who have already treaded on the path that we're talking about here today so the agenda no admin so we're just going to start with fundamentals of spot and some of the common use cases I'm going to take about 10 15 minutes time here to take you through some of that before I pass it over to Scott to take you through their Citadel journey and then he'll pass it on to Gabrielle to take you through the mikado deliberate journey of leveraging spot in production and a couple of different types of workloads to learn from there so let's jump in the fundamentals of spot well while I'm the spot product manager and I hope you come away from this session recognizing that spot is potentially more broadly applicable and you thought walking in I'm not going to tell you that it's useful for every possible type of workload we have multiple purchase options and we do recommend you leverage all three of them wherever possible so first on demand instances right I think most people are aware of what on-demand instances are that's how most people get started with ec2 you turn on an instance you're paid by the second when you don't want it anymore you turn it off and we stopped charging you a second later you know it's ideal for getting started when you're building a net new application or just getting started with ec2 and great for spiky workloads are that a staple right potentially not fault tolerant workloads then we have reserved instances and reserved instances the idea is for customers to be able to say well hey this workload isn't necessarily that spiky so I don't need all the elasticity that on demand provides and what I'm going to do Amazon I'm gonna commit to keep using this instance for the next one to three years and in return I would like a discount and so what a reserved instance offers you is up to 70% off the price of an on demand instance based on your commitment to continue using that instance great for where you have a base level of work you know a steady-state workload or things like databases where you're probably not going to turn them off all that often and obviously not necessarily something that scales in a horizontal manner and then finally obviously the focus of today's presentation spot instances spot instances are ideal for fault tolerant stateless or time insensitive workloads and obviously can save you up to 90% so what is spot else what is truly our spare capacity there's no minimum or maximum we keep available on spot however to deliver the elasticity that you expect from on-demand it's necessary for us to have spare capacity and so when on-demand customers aren't using that capacity it's available for spot customers to leverage and obviously the caveat being how we're able to give you that up to 90% discount is to say well when we do need the capacity back from you for on-demand customers we give you a two-minute warning and then we take that server back off you and so that's the the sort of arrangement that you need to be comfortable with in using spot instances so you know some things about spot obviously it's low price low price shouldn't necessarily be used only when you're looking to do cost savings we see a lot of people adopt spot instances because they might be happy with what they're spending but they recognize I could actually do five to ten times more while spending the same amount of money if I adopt spot you know obviously in that the big data and analytics space this is a place we see a lot of customers just trying to do a lot more with the same amount of money easy access so what I'm trying to talk about here spot used to be a system where you'd need to go and learn some unique api's that have different behavior potentially different semantics in them over the last couple of years we've been invested in making spot available where you launch instances today and enabling best practices through those api's as well so spot instances is now available in run instances it has been for over 12 months you just specify spot and all of a sudden we're launching spot instances a couple of weeks ago we integrated the fleet functionality which Scott will talk a little bit about into auto scaling groups as well so now if you're using auto scaling groups it's possible to combine on-demand reserved and spot instances in a single ASG and then finally one of the major benefits about spot not only you know is it about the fact that we can save you up to 90% it's this idea that when you're using spot you truly have total flexibility right just like on demand when you don't need it anymore you turn it off one second later we stopped charging you but with spot when you'd launch that you know you're launching some of the cheapest compute possible you're optimized by default so you have that flexibility to consume and use whatever you want whenever you want and give it back to us when you don't need it anymore and we'll stop charging you and you're getting the best possible rate for compute so we talked about spare capacity and there's a point that I want to make which is very important spare capacity for Amazon is massive when I talk about spare capacity like one of the ways that we can talk about it is how much capacity customers are actually able to capture and take advantage of in the spot market and since it's actually a story from Clemson University they did this over 18 months ago now so our capacity is dramatically increased even since then but Clemson University was able to scale up to over 1.1 million concurrent V CPUs in a single region in a single region they were able to turn on over a million concurrent V CPUs so if you think or spot spare capacity so it might not have enough for me well if you have a workload that requires over a million concurrent V CPUs please come see me after this I'd love to talk to you about it sounds very exciting but spare we have a lot of it but you might say well Boyd I tried to launch 50 spot instances once or a hundred spot instances once and you only gave me 50 so what do you mean there's this huge amount of spare capacity well the one big thing we ask you to do for spot instances which is different to on-demand or reserved is we ask you to be instance flexible because while we never run out of spot capacity it is possible that a specific instance in a specific zone might not be available there might not be enough available for you and so here we actually have sort of the distribution from Clemson they use two excels and up third and fourth-generation instances you know see five s and M fives weren't even out at the time and that's how they were able to achieve this tremendous scale and this is also how customers were able to run consistently in the spot market for product and workloads by saying well hey if you don't have any c48 excels I can take some c-38 excels if I need to see four 8c creates sure if I got that wrong so you know that's the idea behind spot is being able to take advantage of these capacity pools so finally some of the spot rules we call them simple because they're a lot simpler than they used to be so first and foremost as I say spot is spare there is no minimum there is no maximum and so we let market forces determine the price however the market forces look at spot capacity over a long period of time and so you're not going to see dramatic changes in prices in fact if you go look at the last 90 days of pricing history which is publicly available you'll see that spot prices are pretty steady because the long-term supply and demand function doesn't actually change that dramatically and then the other big thing for those of you who have been thinking about or exploring or using spot for a long time beading is not a thing anymore you do not bet on spot instances you used two years ago it's not a thing anymore you just request capacity if capacity is available we're going to give it to you and the only reason we're going to take that capacity back off you these days is if we need to give it to an on-demand customer right so one spot customer cannot kick out another spot customer you just request and we give it to you if it's available at up to 90% off the the on-demand price and I keep saying up to the reality is you're going to say in the range of 70 to 90 percent pretty consistently with spot instances over the price of on-demand and again we do give you that two-minute warning when we need to take it back off you okay another big point when we talk about spot we talk about save up to 90% but the caveat is interruptions can occur and so we naturally spend a lot of time talking about interruptions interruptions interruptions and so you might get it in your head are they're gonna take these servers off me all the time I'm never gonna be able to finish my work I'm gonna turn it on in 20 minutes later Amazon's going to take it off me the reality is and that's what this stat is saying over 95% of the time customers finished their own work and turn the spot instances off themselves right so less than 5% of the time is Amazon actually interrupting customers so to use spot instances you must be prepared and able to handle an interruption do not confuse that with this idea that it's going to happen all the time over 95% of the time customers finish their work and turn instances off themselves finally let's talk about some common use cases before I hand over to Scott these are not this is not a comprehensive list and it's not meant to be a comprehensive list but this is how if you say generally you know what are the workloads I should try and identify as opportunities to leverage spot this is sort of the big four so first and foremost big data big data is an example where you might actually have state on the instance but if that instance is interrupted you don't lose data you just lose work the idea being you might have your sort of data lake in s3 or an HDFS or somewhere else and if an instance that's been doing processing gets taken away we can just reprocess that work and so remember over 95% of the time you're probably gonna finish your work and turn it off yourself and so if every now and then you need to reprocess a little bit of work well it's up to you to decide is it worth saving up to 90 percent in order to every now and then have to reprocess a small amount of data again it's a common place as well where we see people able to take advantage of the price point and just do more with the same amount right it's not necessarily about saving money for big data although you can you may find that hey with a hundred bucks I can get ten times more capacity using spot C ICD hopefully this is an intuitive one if you're doing tests and build servers hopefully you can handle a failure quite gracefully because that's sort of the whole point of a testing system and deployment system and so a spot instance is just a different type of failure that you need to handle when you're running a CI C D pipeline then web services now this one's a little bit counterintuitive and I'm excited that we have Gabrielle to talk about Mikado libros example here because you might think well no my website needs to be on 24 by 7 I couldn't possibly use spot you can't take a server off me well hands up if anybody's waited for a web site to load more than two minutes in the last five years okay two people I don't know why you're waiting that long for your website to load does depend on the website normally by the time it's a minute past I think my Wi-Fi is broken I'm unplugging it I'm trying to work out what's going on so that two-minute warning that we talked about with spot instances can be a lifetime for a stateless web service and if that instance is taking away well the current request should complete and all future requests get routed gracefully to the other instances behind it so if you're designing stateless scalable web services you might find you're able to adopt spot quite easily and then finally high performance computing this is a very loaded term we find most people in the high-performance computing market use spot for sort of grid you know high throughput computing workloads where it's loosely coupled and a single node failing doesn't sort of destroy the entire cluster and I'm happy to say that Scott's going to talk a little bit about Citadel's experience running these loosely coupled workloads on spot instance as well similar to Big Data often times when an interruption does occur it's just a matter of quickly reprocessing and ideally that the time again in which you lose is is not important versus the up to 90% savings you're going to get and then finally this isn't a work load but I've got to talk about it it's containers if you were using containers and I imagine a lot of you already have a container strategy if you're not already using them in production containers a sense pot is a match made in heaven the reason being containers can run here they can run there they can run anywhere so when I talk about instance flexibility that's really easy with containers and the other nice thing is most of you have probably designed modern software architectures as you've migrated applications into containers so you're probably scalable you're probably fault tolerant you probably thought about all of these things and instance flexibility just comes out of the box with containers so spawn is ideal for fault tolerant flexible loosely coupled or stateless workloads as you listen to both Scott and Gabriel talk there you might not have the exact same workload but look for these commonalities in the workloads and then go and look for them in the workloads that you run on premise oh sorry in the cloud today on AC 2 or even if you're running them on premise and say well hey does my workload have some of these characteristics and if it does there's a good chance you can start taking advantage of spot instances today Scott good afternoon my name is Scott Donovan I'm a senior cloud engineer with Citadel Citadel is a multi-strategy hedge fund founded in 1990 by Ken Griffin we currently have over 30 billion dollars in assets under management and are consistently ranked one of the top multi-strategy hedge funds on the planet we have a roughly around 1,800 team members currently we're headquartered in Chicago but we have offices around the globe I've been at Citadel now for over 15 years and the vast majority of that time has been spent designing and building systems to help our quants and analysts run their analytics at scale and that's what brings me here today I'm going to share with you some information about a system that we built about two years ago that's powered by AWS spot fleets so what I'm going to talk about first is I'm going to talk about the kinds of analytics we run or the kinds of workloads that we run and then you can give you a high-level overview of the system that we built and then I'm gonna walk you through a specific use case that runs on that system and then finally I'm going to share with you some of the insights that we gained in using spot fleets for the past couple of years so let's get started as I mentioned we're a multi-strategy hedge fund and while each strategies analytics are significantly different they typically fall under the same four categories there are research workloads that's when an analyst or a quant might have an idea on how to increase the profitability of a portfolio or perhaps reduce the risk associated with a portfolio what-if scenarios reactionary workloads that's when a human or an automated system determines that there's been significant enough market volatility to warrant rerunning some of our analytics in the middle of a trading day we also have overnight workloads at the end of the trading day you gather up the most recent inputs for your models and you run them at large scale to prepare for tomorrow's trading day and then there's model back testing when you make code changes to a model you've got to test those changes and some those tests involve iterating over 10 to 20 years worth of historical data these are extremely resource intensive workloads but it's really critical that we get them done quickly because the faster we can test our changes the faster we can move our changes into production and that just means that's the faster that our company can react to the changing market landscape in 2014 we determined that our computational needs were soon going to outpace our computational capacity so either our data center teams we're gonna need to get really busy racking and stacking new hardware or we were gonna have to change the way we did things we decided to do a POC in a public cloud so we took one of our proprietary scheduling systems and we moved it into a public cloud and we used it to drive a 45,000 core model back test workload now for the most part the the POC worked well we we learned a lot doing it which was the important part the two key takeaways one Dockers a really useful technology for migrating workloads from on-prem into the cloud and the second key takeaway that we learned was that none of the proprietary schedulers that we had built over the years were particularly well-suited to orchestrating docker containers so we were going to need a new system so we looked at a handful of open source tools that are out there we looked at some enterprise solutions but at the end of the day we decided to build our own system on top of the nomad job scheduler from hashey Corp now I'm not going to get into the details of that decision process we did speak at a hashey comp you can find the video out on youtube or if you'd like after this talk you can meet with me and I'll be happy to discuss it with you the new system that we built is named orc orc was built by our cloud infrastructure infrastructure team which I'm a member of it's a four-person team that's helping drive public cloud adoption across the firm the cloud infrastructure team is part a much larger platform engineering team which helps steer the technical direction of the firm in general so this is a high-level system diagram of the orc system as you can see it's multi region there are two regions on Prem at the bottom of the diagram two regions in the cloud at the top of the diagram inside each region you'll see a little green cube with an N in it that represents a nomad and console server cluster we use nomad for job scheduling and we use console for health checks load balancing and service discovery you'll also see a lot of green squares on the diagram those green squares represent a nomad and console client node they're the worker nodes the nodes that your doctor containers run on the orc system has two endpoints it has a job submission end point labeled J and it has a provision or endpoint labeled P users interact with the job submission endpoint to get their jobs on and off the cluster and users can interact with the provisioner endpoint to spin up compute clusters in AWS now a cluster is a construct that we came up with and it basically equates to one or more AWS spot fleets what I'm going to do now is I'm going to show you a specific use case that runs on the orc system this particular use case is for a European trading desk and it typically runs it anywhere from 2000 core scale to 15,000 core scale during trading hours and then overnight it usually runs at about 30,000 core scale a couple of new elements on the diagram we've got a work queue this particular use case is implemented in terms of a producer/consumer work queue because it's simple it scales it works we've also added an auto scaler the autoscaler subscribes to state change events on the queue and when it sees specific state change events it will submit request to the system on the user's behalf we've also added a worker container the worker container houses the model binaries they're the part or the component of the system that's actually performing the calculations so the way the system works is a user puts his jobs onto a queue now the job basically just contains storage bucket location information for the worker so that he knows where he should read his inputs from and where he should write his outputs to so the user pushes work onto the queue and the autoscaler sees that state change event and he makes a call to the provisioner to spin up a cluster or a collection of a WS spot fleets he also makes a request to the job submission endpoint to tell the system to start the users worker containers on the spot instances as they come up and when the workers come up they basically just connect back to the queue they pop a unit of work off the queue they go read their inputs in from s3 perform a computation for a few minutes and then write the results back out to s3 and the workers just continue iterating over this process until there's no longer any work left on the queue when the queue is drained the autoscaler notices and it submits a request to the provisioner to shut down all the spot fleets and it submits a request to the job submission endpoint to tear down the worker containers so it's pretty simple and the next couple of slides here I'm going to show you some graphs of the system metrics for this particular use case what we're looking at here is queued work overtime times on the x-axis and the queue depths are on the y at y axis now I mentioned this was for a European trading desk you'll see the first burst of activity comes in at about 3:15 that's actually the beginning of their trading day so you'll see 2,000 jobs get put on the Q the Q quickly drains a little bit later 8,000 jobs get put on the Q quickly drains and 15,000 jobs get put on it later and again it quickly drains this kind of illustrates the bursty nature of the system works coming in and out of the system all day long now this graph is showing running running workers over time and what you'll see is when the first burst of activity occurred right around 3:15 that pretty quickly we started to get workers starting up and they are clusters came up to scale and then they processed all the work and the workers were shut down again very bursty nature this graph is giving us a little bit of insight as to what's going on with our spot fleets well we're graphing here is the requested fleet size and that that's the blue line and the current fleet size the purple line so the current fleet size is basically an indication of how many cores you have that are currently up and usable so when the current fleet size line converges with the requested fleet size that means that your clusters up to scale that means you have all of your spot instances they're all up they're all running so when you look at the first burst of activity here you'll see that it took just under five minutes for us to get all of our compute resources and later and the second burst of activity it took just a little over five minutes and in the last burst of activity there's all kinds of crazy things going on in this slide what happened was the user put work on the queue the autoscaler noticed it it started up spot fleet's the workers started and work started being processed computations started occurring and then about five minutes into that the user decided to put even more work on the queue the autoscaler saw that and determined that it was going to need more compute resources so it submitted a resize request to the provisioner which basically just passed that request on to the spot fleet's and that's why you see that step function now these next couple of charts are basically the same information but it's very different time window the time window that we just looked at was during trading hours this is during the overnight hours and as I mentioned typically overnight workloads are a little bit larger in scale this particular overnight workload about 600,000 jobs were put on the queue and you can see that there's a pretty consistent downward slope to the draining of the queue that's what we like to see because what that indicates is we're not we're not getting impacted by any significant termination events running workers overtime what you can see is that we pretty quickly get up to about 22,000 workers now we're spinning up a 30,000 course pot fleet but we're only getting 22,000 workers the memory footprint of our workers is such that we can't always make use of every single core on the VMS another thing to notice though about this chart is we get up to 22,000 workers and then it stays pretty consistent at that scale so it's not bouncing around we're not getting hit by termination events now Boyd spoke a little about termination events they're not ideal but for use cases like these it's really not that big of a it's it's not a problem what will happen in this particular use case if we get hit by termination event any of the instances that are shut down whatever the workers we're working on that work is just going to show up back on the work queue and the next available worker is going to pop it off the queue and it's going to rerun the job so while termination events aren't ideal it's not that big of a deal for this kind of use case and again the requested versus current fleet size graph what this shows is that we went from 0 to 30,000 core scale in 15 minutes so that's pretty big pretty fast since we're here talking about spot fleets I just give you a little bit of insight into how our provisional works our provisioner is a rest service you basically post a JSON request to its cluster endpoint and it will start up spot fleets for you you have to provide some information in the request you have to specify the provider AWS the size of your cluster in this case 20,000 cores and you have to give it a machine type now right now we have three different machine types standard high mem and compute and I'll talk about those in another slide but for now just know that there are three another piece of information that you need to provide the provisioner with is do you want an ephemeral cluster or not now what an ephemeral cluster means to us is that we're going to build that cluster out of AWS spot fleets if a user says ephemeral is false what we're going to do is we're going to build them a cluster but we're going to build it out of on-demand instances now while you can use our provisioners to spin up clusters constructed of on-demand instances nobody does it the spot fleets work so well and they're so inexpensive that nobody provisions on-demand resources with our provisioner you can also resize a cluster through the provision and you saw an example of that and the system metrics that we looked at and then obviously you can delete a cluster I just want to briefly touch on the definition of a spot market in case any of you are new to spot instances or spot fleets what we're looking at here is a spot instance price history graph for the c5 2xl instance type over the past three months now what I want to point out here is we are not looking at a single spot market we are looking at six spot markets the spot market is the the unique combination of instance type and availability zone and then the supply and demand of that instance type in that availability zone is what drives the price so we have a couple things with our spot fleets and with with our clusters that are important to us the first and foremost we want them to come up very quickly or as quickly as possible also we want to minimize the size or scope of any termination events that we get now by using a diversified allocation strategy we can actually achieve both those goals when you use a diversified allocation strategy you're basically configuring your fleet to start up in multiple spot markets in parallel so you you build into your launch spec a large number of spot markets that you're willing to accept resources from and it will start up across all those different spot markets so not only does it get your fleet up quickly but it also has a nice side effect that your instances are spread out across a large number of spot markets so what that means is you have fewer instances in any individual spot market that way if you do get a termination event in a specific spot market odds are you don't have an awful lot of compute in that individual market machine type classifications what we did was we took a look at all the different instance types that Amazon supports some of them are optimized for memory so we grouped those in a high mem categorization some of their instance types are optimized for compute so we grouped those into our compute categorization and then we took everything else and we left it in standard so what we're doing here is we're abstracting away from our users the need to know anything about AWS instance types they don't have to know how much how much memories on which instance type and how many cores are on which machine type they just need to tell us if their model is more sensitive to memory or more sensitive to compute and we'll do all the work of building launch backs for them one other thing that we're doing we were an early adopter of spot fleets and when they first came out you could get a spot fleet up more quickly if you submitted a few in parallel so instead of requesting a 30,000 core spot fleet if you requested for 7,500 core fleets you would ultimately end up getting your 30,000 cores quicker now from a boy tells me you don't have to do that anymore so at some point we'll remove that code but that is what we're doing now so what did we learn over the past couple owning and operating this system first off makes things as simple as possible using spot fleets is not that hard but I don't want my users to have to worry about allocation strategies about spa prices about launch specifications I don't want them to have to know anything about those things so abstract it away from them if you can just want to take a quick poll how many people here have ever misplaced an ec2 instance come on get your hands up okay what I'm talking about is you get to the end of the month you're doing a billing function and you're trying to determine which teams are gonna be charged what for what cloud resources they consumed and you find two ec2 instances that are up and running that should have been shut down a week ago mistakes happen now when a mistake like that happens you basically end up having to pay a few hundred dollars extra for resources that you didn't actually use misplace a 30,000 course pot fleet I dare you so what we found is that you you want to be really tight and how you control these things so if a user doesn't have any need to scale up larger than 2,000 cores don't let them build in a limit just like just like Amazon does put a limit on their accounts so that they can't get bigger than 2,000 cores if a user isn't going to need to run a cluster for more than let's say three hours automatically shut it down for them in three hours and five minutes don't let the user make a mistake and forget to shut it down shut it down for them monitoring and alerting obviously that's important in any system monitoring alerting but when you have bursty large-scale systems like these you could potentially have three or four or 30,000 core spot fleets up and running and if your system isn't working luckily you're spending an awful lot of money to be doing nothing so it's really critical that if something is broken you get a human involved as quickly as possible to just minimize the cost of your system outage diversified allocation strategy is the key to getting your fleets up quickly and as well minimizing the impact of termination events and to illustrate that when we first built the system we did not understand this we only had 32 instance pools in our launch specification and the time that it took our 30,000 core clusters to get up to scale or for all of our spot instances to start was about 50 minutes I mean it's almost an hour so the only change we made was we changed our launch specification and we added more spot instance pulls to it that's all we changed and by making that simple change we reduced the startup time of our fleet to about 15 minutes so we got a 70% speed-up in the startup time and finally I don't think any talk about spot instances or spot fleets would be complete unless you touch on the cost savings so what we're seeing is about a 70 to 75 percent reduction and the cost over there on-demand pricing so thank you very much for your time and if you have any questions for me we'll be hanging out in the hallway somewhere after the talk thank you [Applause] okay hi everyone I'm Gabriel I'm senior if recite - Menasha of merkel delirium and in this part of the meeting we want to show you our little what intensive journey in terms of cost I mean how we use spots to save tons of money merkel earlier that it's a very different world it's for mission critical words so well is my career my career is the biggest e-commerce time in commerce some payment platform in latin america we have more than 200 million users and also we serve tons of thousands of transactions a day and to support that we have a large infrastructure or a large IT organization on top to the company we have almost 3,000 applications today on top of 20k machines on AWS that supports 50 million requests per minute and to manage that we have 2,000 engineers that they have they focused put in the product of my career itself that they make they'd make 2,000 deploys away and as you can imagine we are in a the relief conference the 20k it's a big number for us at least and it's it's good to see how we manage that I hope manage that in term of cost and to be honest term occurs before and after January and why that well we're a growing company so we just moved to the cloud and we have started to see that we have a bill every month every month we started to see that the will start to grow a little but as we are growing faster when company you say ok that's normal think of growing costs are growing it's pretty well then after a couple form we started to see that we're starting to grow a little more than the business and after that we started to share with that was much more than the business our finance team which comes to us and say hey guys you should do something that can't continue so after valve we did something so what we did we say hey we will reduce the wheel there's no problem however we are growing company that sure showed us that we are growing in terms of applications we move almost 100% in the amount of application in the last year our non most in deployment so reduce the bill is important but keep your business growing is important too so how to manage that how to manage that trade-off between keep growing and reduce the cost well we started to check the difference option we we have a lot of meeting we had a lot of meeting with Amazon also with Boyd we started to see that we have a lot of options the reason spot in South Africa also we explore could fingers we have some regulation problem there but as you can imagine in DC and spot presentation we say to start to explore spot deeper and the first one that we found that spot are really great as well measuring you have up to 90% of the count you have almost a flavor supported pretty well price and also that is brain pouring you don't have commitments so you don't need to say hey I will use that or here we will prepare that not a prepay that no you should say I will use that today or maybe tomorrow no it's okay well there is not the very thing is magic they can be steered every time so how to use something that can be straight that anytime in a mission-critical diploma and that was the way start to figure out how to how to solve so to understand how we did it how you use it the first alehrer was a record leave infrastructure what the infrastructure as the same case the despot mention all our infrastructure is managed to run internal platform in fact we we it's pretty similar in some case to spinnaker just to show you every machine is created through the platform not not by the development the lower sorry almost our infrastructure our webserver so they should they can be straight at any time they are java application or go application wherever but almost they are web servers and also we because the application it's running in our platform we know we know how much critical is that application for our infrastructure how much critical how much affect the application to a business metric and that it's very important for the next slide so in terms of infrastructure we have a large platform we have a tons of machines but it's pee see it's it's very simple we have just a couple of load balancers our room six thousands but these everything it's a load balancer another scaling and a couple of instances though we say okay we if we have instance that we just add more instances in the cheaper the better cheaper than a spot and then shrink the pool that's it in fact we did it in a couple of days I think it's very very simple to do that there wasn't choice also engineer working on that is what it just was a couple of guys and today the last week I think that Amazon percent of the last month I mustn't do that also you can use the mix and instance type outer scaling where you can say hey I would like to have 50 percent in spot 50 percent in in on Amazon solve that problem but when started when we started to do that we should implement so that once they were we in fact we have moving almost 60 60 percent of our infrastructure to spot instances and believe me we saved tons of money however when we started to check that we started to to ask ourself different programs into how to do that when to say when to start to a specific I went to run that job how many spots used purple at least when you study say hey 100% it's cheaper what happened for instance ice and how to manage that and also how to deal with the other scaling the differently processes when we are working on top of them but the third one for us at least was he see where daily basis were load so we shall say hey throughout the night shut to reduce the programs we also run some tests on the day and there is no problem it's just add machines or remove machines is very easy Howard has to prevent down them due out the night the second one and then it's very important for us it's how many spots use your mission-critical you can be down how to prevent downtime in term of you can you should consider that all the spots can be stirred at the same time if you want so the solid we make a kind of risk account in terms of how much critical is an application for the business and what is the probability of an instance I of spot eastern side we also estimate that I'm doing that we decide how much instance if we want to have in a pool and with that and basis on the criticality of the application we can do a mix of them for example we have a we're in e-commerce all our eat items API it's very little for us so there we just put 30% of the of the instances as petitions why 10% because we estimate that if we lose the tooth percent the application will be up with a problem may be a little little hope but it will it will be up the same without any problems however in the other in the other hand we have for example the pictures pull that it's behind the CDN so if they pull goes down there is no problem for a couple of minute and there we just pulled hard percent because if we lose the problem if we lose the enter poll in five minutes or 10 minutes or 20 minutes it's not a big deal for us or it's a big deal to save tons of money having the 100% there so that was very very important for us and I recommend you that estimate how much spot do you do you require you require in a pool and how much risk you want to take because believe me the spa dentists will die so what happened I very recommendable to hear the event the community event because it gives you the possibility to remove for example in our case the traffic from the recesses so that makes a very friendly stop of the intestine so it believe me at least for us where we we have a lot of requests to means it's a lot maybe I think don't talk to boy about 30 second for us it's a lot salt but two minutes ago we heard the event and then after after we heard we just go to the low balance of hey promote - in services from the pool and that's it then we expect the Amazon collect the distance so what happened nothing it's very important that you consider that just for a stateless world you shouldn't put a database here it's that it's been poor and also it's very important that your water can be automatic make this should be automatic why because you should expect that this will die and then you should create a new one and it's very important that everything it's the everything should be automatic there and the last one in our case where we add and remove instances what's very important what happened without us killing because if you have a novel scaling in you all five instances you have faced as a you have you other faces more there's no problem the prayer is that if they already changed out to the store in transit because maybe your superior is going down you will start to destroy your your spots so you will start to pay more and it's not that you want so well with it's a kind of predictor scaling where we shall set the amount of engine that we will mean for a pail and with that we prevent that spot will be removed however is some spikes if we have some spikes for example in traffic CPU aware we can add only man to prevent down them also today with the new order scaling you can present that and also add more spots where in the process to explore the new feature to move on because it's also when you are scaling you want to save money too so the poor results what that was impressive for us we almost duplicate our consumers in the in the first child however the right here in the ratio between cost and I was have been very very down it has picked up and that was awesome for us in fact if we see the total cost that it's very important if we see the bill we almost duplicate the consumers cover the total cost just grow tickity per sentence icing like that and that was awesome because it was the treat of everyone that was we can keep growing we can keep creating new machines we can keep using more CPU without increasing the bill at the same the same freedom at the same time and that give us the possibility to keep growing without keep growing our bill a lot so some lessons and we'll learn their first ones very important to the month the the amazon bill it's pretty complex you have a lot of things and there are many other costs with your machines for example in our case was also not working so do the mats in our in your case and just analyze which is the best for you as boy say you have only memories we share and also spotter you have a lot of things to consider second what spot are really great we love them however using them with version ability because they can be destroyed at any time so it's very important that you choose the exactly point where you want to put the spot and you should consider the report will be destroyed so you should design your application thinking on that if recommended you design your application think of that every time because also the only money Stan called I is just to think that they called I frequently that's all cos management need to be performed continually it's something that we learned it it's not just do something that's all it's actually you need to work every day because they will constantly and also there's not supermodel like in every place we say that because if you explore the if you plot the different options without God fingers you have a certain instant spot instance enough once as different social in fact there are multiple options we use all of them to make a good coast approaching in my career thank you very much guys thank you thank you both for coming in and sharing your journey hopefully you found some insights from both Scott and Gabriel on on their experience obviously please do complete the survey if you enjoyed the presentation otherwise don't worry about it and yeah it's God said we're gonna go hang out in the hallway answer some questions if you've got me so yeah please join us and thank you again I know it's late on a Thursday so thank you for coming [Applause]

Info

Channel: Amazon Web Services

Views: 4,080

Rating: undefined out of 5

Keywords: re:Invent 2018, Amazon, AWS re:Invent, Compute, CMP306-R1

Id: JInfs0Ntx3Y

Channel Id: undefined

Length: 49min 31sec (2971 seconds)

Published: Fri Nov 30 2018