AWS re:Invent 2019: Running lean architectures: How to be cost-effective on AWS (ARC209-R1)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hello and welcome to ream end thank you for coming to this session here this room is packed full of people and thanks for coming here for choosing this session here my name is consonant Gonzalez I'm a technical sorry I'm a principal solutions architect technical person in our company and this is Jason fuller he is head of cloud management operations and together we're gonna walk you through running lean architectures how to be a cost-effective on AWS so if you happen to lose some money at the tables yesterday this is the session to recoup it and actually I'm getting this session like the seventh year in a row and it's always a very full fully packed session every year I try to bring in something new so that even if you were here last year you should pick up something new here and every year my goal is to save you so much money that you can easily pay for next year's reinvent including hotel and travel and everything and you will figure out how to do that so let's get started so what are you gonna get out of this specifically first we're gonna talk about best practices on how to lower your AWS bill because this is what you're here for but you will also learn how the same best practices can help you build a more scalable more robust and more dynamic architecture so it's not a trade-off it's actually a win-win by saving money on AWS you will automatically be more efficient and get other benefits out of that you also learn how to save time to innovate because time is more valuable than money you can always make more money but you cannot make more time if it's gone right so learn how to save time and be more innovative here and then we're gonna see some real-world customer examples this is all about practice here we're gonna talk about a theory that's the theory but all of that theory has been put into practice and it is proven to work and then all these things are easy to implement so when you walk out of this session think about how you can put these things into practice today or tomorrow or next week so that you are on track to paying for your next year's rain wind now even if you don't do anything and I know I'm a lazy person myself we're working hard to save your money because the AWS business model is really based on using prices because when we reduce prices we see more customers we see more AWS usage we get to build more infrastructure which give us better economies of scale which allows us to lower our infrastructure costs and that in turn is something will will give you back to you as customer so you can save more money by reducing our prices yet again and there are two accelerators in this model first we also invest a whole lot of money into infrastructure innovation and you've seen during Andy's keynote how we introduce the graviton to servers with a lower price per transaction we introduce special chips for machine learnings and the goal is not to be faster the goal is also to be more efficient and more money saving with our infrastructure innovations so that also helps us lower infrastructure costs now the other accelerator is the more users we see the bigger the community is more partners and you see the partners in the partner Expo here more people using our features people giving us feedback so we can build new features and new services and that makes it obvious more attractive over time since 2006 we were able to reduce prices seventy seven times already and we are no we're nowhere near done reducing prices here now this room here is full of builders you're all builders and that's great because as builders you're taking your own destiny into your own hands you build it you run it and you optimize it now what what's the difference here between being a builder on AWS and the classical IT well in the cloud your biggest strength as an architect and as a builder is architectural flexibility it means you're not building anything in stone in the cloud it's always flexible you can always change your architecture in the cloud you can always optimize it and you're never really done which gives you this agility that people expect from the cloud so when you build something in the cloud your main goal here is to avoid unnecessary stuff right you try to take waste out of your system it's ok to start with a minimum viable architecture and see if it works and then you can start iterating and try to get rid of things like necessary resources idling resources or stuff that is repetitive that you could do more efficiently here so as you go about this process of really taking and improving your architecture on AWS there's a small process that will help you along the way and you can think of it as an as your own small cost optimization flywheel if you will and it always starts with measuring your current cost on AWS and after getting inside getting transparency about your cost you can then start coming up with ideas on how to build a better architecture and once you have those ideas on paper or in your favorite IDE then you start building them into your architecture for real in the cloud and the first step is always to measure your existing cost so that you know where you're standing and then you can plot a course to where you want to be and the best way to start with that is by using the a doubles billing and cost management dashboard you should be very familiar with this dashboard if you haven't seen this dashboard before go do this as soon as you leave the session and because here you get a real complete picture of all of your cost broken down in any foreseeable dimension including a budgeting function you can set up a budget here and you can set up a budget for the next weeks months and well probably not years but for the next couple months and get a better planning ground here and the budgeting function will give you automating automated notifications it will send you emails when you're reaching certain milestones in your budgets so that you have full cost transparency and you always know what's going on and the console's also comes with an cost Explorer you can drill down your bill with the cost Explorer identify the big chunks so that you know where to start with saving money it also comes with an API that you can integrate with your monitoring systems so that you get a near real-time feed of your cost on AWS now the most popular way to save money on AWS is to use easy to reserved instances and reserved instances really mean that you commit to a certain users of ec2 and you get a discounted hourly rate in exchange is the classic discounting scheme that we've been seeing from so many retail businesses over so many centuries ago in the very beginning reserved instances were also tied to a certain capacity that was guaranteed if you bought for reserved instances you've got the capacity reservation for four instances today these two things are decoupled so capacity reservations are optional now so that you have the freedom of whether you just want to save money or whether you also want to have the guaranteed capacity with it resort and scissors are available in one year and three year terms and you can save up to 75% compared to on-demand in exchange for a little bit of planning so the only downside is here is you commit to a certain amount for a certain period of time like 1 or 3 years and you need to be a little bit more planning so if you see something that you're pretty sure you're gonna use over the next year or so then reserved instances is probably a good good choice here there are three types of reserved instances there are standard convertible and scheduled or ice and if you are unsure about which one is right for you go to the AWS cost Explorer because the cost Explorer will give you personalized recommendations based on your historical usage this is how it looks like and it will tell you ok if you do this you will save so much and if you do this other version here you can save so much so use those recommendations and try to understand better how those reserved instances can help you now going back to our flywheel here the next step here is architects so let's get more technical here right so in architecture the easiest way to save money is simply to turn off unused instances this sounds super obvious but if you're coming from an IT background and if you're coming from an on-premises world this is something you have to learn what I mean in the on premises world you you bought hardware and you put hardware and I cut my fingers and you you believe and you know oh my god hardware and then it was sitting there for 3 to 5 years and it was running 24/7 and consuming all that power and cooling stuff in the cloud is not necessary so where are those unused instances so think about developer instances test instances training instances all these instances don't have to run 24/7 you can switch them off during the off hours of the day over the weekend and then then they will stop spending money for you you can start and stop those instances you will not lose the the the instance data because you're putting the data on EBS volume and they continue running but you will stop paying if you stop those instances thing is you can apply this concept to whole architectural setups even if you have a complicated architecture with web servers and app servers and farms and everything scale out and wonderfully complex you can shut down everything through automation by using things like load formation or terraform here so think of instances in the cloud is something that is disposable that you only use when you have to and this is how much you can save so this customer here this is their usage of ec2 instances on the y-axis you see how many instances they're running on the x-axis there's time and you can see how the weekend and the end of the vacation season is visible in this graph here and in this particular case this customer saving 35% just by being smart about stopping instances that they don't need anymore now I briefly touched on automation and my recommendation here is to automate everything you should be running a hundred percent fully automated in the cloud and you can do this with so many things you can do this with the command line you can use more sophisticated approaches doesn't matter what tool you use the important thing here is to automate and the easiest way to get started with automation is to use ec2 auto scaling because auto scaling is a mechanism that will adjust the amount of instances that you need in the cloud it will adjust them automatically based on demand by using our monitoring system cloud watch to measure how much demand there is in the form of latency at the load balancer for instance and it will automatically start new instances and terminate unused instances to be efficient and to follow the curve of demand automatically here you get three wins here first you get automatic capacity management always the right capacity no matter whether you're having large demand or low demand second winners automatic mitigation against failures if an instance goes bad because after all they are running on hardware if it goes bad it will automatically be replaced by a new instance no need to do anything here it will automatically heal itself and you can automatic cost optimization here so to summarize this first bit here the old thinking in the on-premises world was of thinking of service like pets we all have pets right I have four mice at home for the kids here and we love them right but the cloud is not the place to think of services pets because windows services pets is individual servers and remember back then I used to give my service names and they were manually administered but manual administration creates the potential for configuration drift and then you're running with a configuration that is not documented very clearly which creates the potential for error and that creates a lot of work especially when things go wrong and it shouldn't be that way in the cloud think of your cloud instances of your cloud resources as cattle that are very highly standardized all of the servers are the same because they're automated right highly automated less error potential and less work to do and that means in the cloud you can build a more efficient architecture because you're thinking of service as cattle now and that will give you lower cost now another way to save money is to use easy to spot instances who is using spot instances now a couple people do if your restaurant if you didn't raise your arm find people who did raise your arm and ask them about spot instances so spot instances essentially gives you access to spare capacity on AWS which is praised priced based on supply and demand and you can save up to 90% money by using spot instances so you will probably also what's the downside here well there's a small downside if we need those instances back because we found the paying customer for our spare capacity here then we will terminate those instances but you get a two-minute warning and you can react to that warning automatically through automation for instance by using auto scaling right so you can use auto scale scaling to combine spot instances with on-demand instances and reserved instances and get the best of both worlds now spot instances in the very beginning used to be complicated and this is a an old ancient graph of spot instances many years ago and you can see how chaotic the prices can be today spot instances are really well understood you get very simple very smooth graphs of the pricing they're always they're very predictable they're always cheaper than on-demand instances and they're very easy to manage especially if you use tools like spot fleet or if you're using auto scaling because now auto scaling has its own little tick box that you can check off if you want to combine spot instances with on demand and then you can save a lot of money here okay that was a bit of theory let me bring on Jason to the stage here and he will tell you how to put this all in practice and what here is doing with all these things Thank You Konstantin thanks Jason so my name is Jason fuller I'm the head of cloud management operations for a company called here technologies some of you might not heard of here technology so I'll put up my obligatory marketing slide I'll spend as little minutes as we can on it here technologies is a very large platform company that runs location globally so our biggest competitor is on the west coast in the United States we probably all use them for many many things I won't say the name but ultimately we're the number one platform in this space so we love competing with those guys but you can see some of these numbers and they're huge and the reason we can achieve these numbers is because we use AWS so when Constantine called me a couple months back and said hey you're a huge user of AWS you're also very efficient in how you use AWS we've achieved over 50% run rate reduction through cost avoidance exercises exactly the way that he's laid out for you today so what I'm gonna do is just take you on a brief journey through how we got there what challenges we faced what lessons we learned and hopefully you take away a little bit from here and maybe pay for reinvent so your traditional RI strategy if you're in an enterprise that went cloud a while ago maybe not yesterday but a few years back innovation and speed to innovate was one of those decisions you took whether there's a bottleneck and you force people through a sourcing process or through a financial process or you let the builders build we let the builders build so when I arrived in here technologies in 2016 we have over 500 accounts we have over 10 million unique instance IDs per year well over 200 thousand concurrent running all the time and very little RI coverage teams are very much allowed to do what they want with their architecture good architectural practices within each team but these are 500 islands that are all interacting with each other but interacting in a independent way we had limited reporting at that point if you remember we were using AWS as detailed billing report the DBR file we convert over to the cost utilization report the Kerr file if you haven't looked at this billing file it's a bit like looking at your teenager cell phone so prepare yourself lots of good data but very big ours is over a terabyte CSV which means I have to pay AWS to open it because I have to put it in a redshift so always good so we said listen we need a new proposal here we need a way as a company for us to centralize this process teams are not going to make year-long our three-year long decisions they're working in AWS they're working on highly technical very very large-scale big data deep data problems they don't want to talk about am I going to give an RI one year or three years and so that's part of the reason when we interviewed a lot of teams why they said we didn't use our eyes we just didn't feel like we could commit we didn't know we wanted the freedom to change so what did we do anybody in the finance department of their company in the room I see a show hands raise your hand yes one two perfect we'll talk later we went to you first we said before we go down the road of architectural debate and decisioning and builder vers builder and all these great you know very positive nobody was being angry yet let's go talk to finance what is it exactly that you see that's wrong and of course finances answer as always AWS is too expensive okay let's talk about our eyes we go to build in with our management team and say ok add a fleet level all the islands we have a huge amount of our eye potential so we show the potential and then how are you going to achieve the potential well we add technology to the mix how much are we going to be able to do from a reporting perspective who in finance receives reports who in management gets the KPIs who's held responsible if the team changes architectures one of the challenges with our eyes is that it is restricted to your OS it is restricted to your region it is restricted to to standard versus dynamic and so we make made some decisions we said we think we'll take a look at this and we will say one year only Linux only in the regions only where we see greater than 5% usage and we used a vacancy solver which basically were replaced with a commercial tool later which said if no team is using an RI that's a bad investment but the RI floats so if you take an RI and you buy in US East and you have 50 accounts that are using us East one ec2 instance goes down in an auto scaling event another instance will receive the same benefit now if you're an individual team and you just invested a million dollars in our eyes twenty percent of the time it spends time and Constantine's account who's invested zero dollars in our eyes he just got free lunch so that wasn't a positive thing of the RI program and what we did was we took the AR ends and we mapped them we know when we bought them we know what account we bought them in and we make sure that only the team that used the AR I paid for the RI so a hundred percent managed Arirang coverage of our eyes so figured all this out on paper start to put it into action we we incorporated a concept of green zone red zone and green family red family this is your simple you know first Sigma away from the center type activity what is our model show us that if your instance family is greater than five percent utilized in the fleet and your region is greater than five percent utilized in capacity then the potential that you drop an instance and somebody else is there to pick it up is high that confidence gives us a matrix which we can then buy against we created an implementation where we control log to you teams could not buy their own our eyes we did bottleneck at that point because we had such a low team initiation we said fine we'll handle it you can't buy it anymore because you'll mess up my reports if you do so we've blocked them we actually do this through organizations if you're not the enterprise team that owns organizations get close with them if you're gonna implement this program because they can write a policy that says remove are I buying from the console block the API from the team and what that does for us is it allows us to say control buying sourcing and Finance are happy team they're happy because their prices are being reduced but teams need communication so developing a good rapport and a good order who's in charge in your team which island owner is going to be contacted and once a month we say hey these are your renewals if you don't tell us to stop we're gonna renew we took an opt-out approach right so what did this do it opened up a couple of things first of all after the success of this program and I'll get into the lessons learned in the numbers in a minute we saw other opportunities well if teams aren't buying our eyes our team is paying attention to trusted advisor are they looking at instances that are left on all the time and if you haven't used this API I would I would highly recommend it this is a screen capture from one of my Splunk boards that shows my teams and how much they're wasting in the cost pillar so trusted advisor gives you data on the best practices of the well architected program that AWS uses as their foundational bedrock you should architect AWS what we do with this data is we then find the people who are going to waste over ten thousand dollars that month and we come knocking on your door excuse me did you know you're wasting over ten thousand dollars this month this isn't a good thing for anybody in the company what can we do to help you with architectural changing this so pull the API into a dashboard make it visible make it the bleeding thing you talked about with teams let's go to the dashboard let's work on it together there are commercial companies out there that do this if you're more of a bi than a build but I hope you're building and then make sure that you bring in other tools so one of the things that we started doing was the personal health dashboard when am I gonna see failures because AWS changes something this is like your large-scale status at AWS Amazon calm for you so my ec2 is failed why because AWS my RDS failed why because of AWS but it's mine it's in my accounts and account specific way to look at change management so we drive all these things what happens builders build everyone gets excited and we get to a point of this this is this year's project where we write sighs automate based on heuristics of your ec2 so teams don't look at us as the bad guys anymore three years fast-forward to today we've saved you a lot of money we've shown you where you might have holes we've taken those worries off your back because you can build and confidently there's a central team there that catches some things as a safety net so the team built this this year it was really fun project using lambda Aurora server lists effectively what we do here is we pull a heuristics report of your ec2 usage we look at your CPU your memory your i/o your bandwidth we compare it statistically against your usage when you're using it so what you'll find in the market is a lot of teams and a lot of companies will look at a 24-hour clock we all go to bed instances sit idle zeros screw up averages so we only look at you're using it and from there we recommend a different ec2 instance this family would be better this size would be better the teams now have an architectural guide that helps them and to push them along we drop them a ticket in their JIRA queue which engineers love when you put your tickets in their queue and then we take action on it two weeks later if you don't so if you ignore me and you don't talk to me through the JIRA queue I go out and stop your instance but I don't do it in 30 seconds which is best practice I do it in two weeks so we give a lot of time at your technologies some companies you'll talk to will say yeah give them three seconds because builders will won't do anything they won't put anything on an instance three seconds later if it's not tagged kill it right we can get into that offline so the results greater than 80% coverage in year one we've held over 85% for three years this means that my ec2 which is 65% of my spend every month is 85% covered with our eyes which is gaining us about 50% off of the on demand rate okay teams became cost to where right as I showed you we expanded this situation so we said oh RDS done ElastiCache done we kept moving dynamo done so every time that an RI program at AWS is released we have the wash rinse repeat program in place to build the next RI program for us and in three and a half years this has given us 50 million dollars back of the bank and it's avoided 150 million dollars to spend that we didn't have to worry about trying to save so from here we'd keep enhancing we keep automating right AWS keeps announcing things it's great it's great flywheel program they announce I have to go react we keep enhancing our finance folks so the two of you and I will talk after and and ultimately we building custom metrics so what should you watch out for we instituted this in 2016 in q4 which means that every fourth quarter for the last three years I've had a huge RI bill that I want to pay because the renewals come due once a year so we've tried to spread this out but I would recommend spreading it month by month or week by week but move it into the year so that your finance people don't look at the court fourth-quarter like they do with me and say do we really need to put these millions of dollars to work don't let people tell you they need an exception oh I have an ex 32 in Sydney it's not part of the green program we would really love an R I know nobody else is using it nobody else is in Sydney please don't come to me you know we'll help you with buying in alright because I blocked you but I'm not gonna put it into my program that then measures and monitors and my KPI is right so protect yourselves and make sure you look at some program designs so blending price if you're not using blending use blending we give the teams the savings so they see it in their invoice other companies have advised for and talk to you they hold that savings in a bank account they distribute it as necessary so there's some tricks there with how the money actually gets just you know sent down and then for us fin ops is the goal so the goal is that finance can run these programs and Finance can be a partner to the engineering team so that my team doesn't have to consistently sit between finance and operations so I didn't mention savings plans because you're gonna get into it but we are using savings plans extensively so it's a really great program - thank you Jason thank you thank you if you hear that a hundred and fifty million costs avoided another fifty million cost saved he could have paid for half of all the attendees to go to reinvent so I challenge you Jason next year I want to have see you paying for all of the attendees so Jason mentioned savings plans so a double your savings plans is a very new way to save money and so what are savings plans savings plans are a new flexible pricing model that you can take advantage of that can help you save up to seventy two percent on your Amazon ec2 usage including AWS Fargate so you heard about a double target this is our service container service there you can simply start containers and then forget about them we just introduced forget for eks as well for sorry for kubernetes as well and now you can save up to 72% very very easily because the only thing you need to do is to commit to a consistent usage in terms of dollars per hour without specifying a specific instance type or any other complicated thing there simply say okay I think I'm gonna spend that much money per hour on AWS figure it out how to save money here and you do this over one or three years and in exchange you get a discounted price list so you receive a lower price for everything you use in terms of ec2 and forget with saving steps very easy to use very good savings and very flexible so flexibility is the key behind savings plans there are two types of savings plans the first one is called compute savings plans it's the more generic one and compute savings plans are flexible across the instance family across the region and the availability zone it even the region so you don't have to say I'm gonna save money in the Frankford region I'm from Munich so that would be my region or in the whatever other region you might use you can be totally flexible across regions here you are flexible cost operating system tenancy or whether you use Amazon ec2 or whether you use Fargate for containers the other savings plan type is called ec2 instance savings it's more tailored towards ec2 instances and in this case these savings plans are flexible across instance size but not family so you have to commit to a specific family here availability zone operating system and tenancy so in a nutshell the difference is with the compute settings plan you get the greatest flexibility and you can save up to 66% off of the regular price and with the ec2 instance savings plan you can save up to 72 percent in exchange for a little less flexibility but still a lot a lot more flexible than the traditional reserved instances again go to the ADA based console it will automatically give you recommendations on which savings plan is the right plan for you and you can figure out - some simulations and see or how this will play out in the future for your own bill now let's move on to more sophisticated ways of saving money because there are still so much stuff you can do to save money let's start with service computing so who's using lambda here a table is lambda okay quite a few people here so service computing in a nutshell really means that you never pay for idle time because if your lambda function is not running if nobody triggers your lambda function you don't pay for anything so that's great you only pay for the execution time of that lambda function what that means you have to avoid weight cycles within the lambda function - to be really efficient here so what do I mean by this so let's take a look at this lambda function over here in the beginning on the function this lambda function - some API call which is an HTTP wrist full request and then you need to wait for that call to happen because stuff travels are across the internet and then you get a result and in between you have IO wait time and then you do your second call and your third call and maybe you're gathering some data from three different sources so the overall execution time of lambda is going to be dominated by waiting for those calls to happen and that wait time is time that you have to pay for so the remedy here is to use non-blocking code instead try to be smart about how you write that code and try to group all of those initiating requests in the very beginning for instance by using multi-threading you can do all of these requests very very quickly putting them into separate threads and each thread weights in parallel so that the wait time the execution time shrinks down to all of these threads waiting in parallel and then you collect all of the results and get done and that will lower the overall total execution time strategies on how to do this as you can use event-driven code for instance with nodejs or you can use multi-threading with Java or Python and here is an example of a lambda function that I am using for my own personal setup this is a lambda function that aggregates are as s feeds across multiple sources and in the beginning of course I did the lazy thing I did the simple way that does this one after the other and all the time this lambda function was growing in terms of total execution time then I implemented the multi-threading code and boom 60% less lambda users times 60% less lambda cost the other thing you can do is sometimes we have to wait inside our code so for instance you might start something like complicated database requests anyway want to wait until that request has been completed or you're starting a batch job or something else and then you have this loop kind of thing where you say okay is this thing done no let's wait for 10 seconds is with areas and another 10 seconds go by and you pay for all of these wait cycles so it whenever you see an explicit waiting thing like time don't sleep here be alert because there's a better alternative and the alternative here is to use something like a double step functions a double step functions is a very simple workflow execution service and you can model the waiting and the checking for your results in step functions and the waiting inside of step functions is for free so you might say okay constant is nitpicking here this is very very very specific how much money can you save here so here's an example our customer coca-cola they have a system that updates their loyalty database whenever you buy a bottle from those automated vending machines and they they've been doing the simple thing first they wrote the lambda function the lambda function was waiting 90 seconds per bottle and then they refactor the code into using step functions and now they are saving 90 seconds of lambda execution for every single bottle they sell through their machines that can add up to a lot of money here the other thing you can do in your architecture that is just a little bit sophisticated but pretty easy to set up is caching so you might know caching about yeah this is a way to make stuff run faster right yes but fundamentally caching really means that you do the expensive stuff only once and then viewers use it as many times as possible and that is a great way to save money because memory tends to be cheaper and faster than CPU execution cycles and the great thing about caching is that you can cache everywhere you can cache it every layer of your application the other thing you need to keep in mind is you need to keep track of how old is this cacheable item is it still current which is a little bit of application code that you need to put on top of that but once you have that you can cache at the database level you can cache application server level you can cache at the web server level at the edge you can cache inside your users browsers you can cache everywhere and the simplest way to get started with caching for a web application is to use Amazon CloudFront which is our content delivery network that includes caching in 210 nodes worldwide will do all the caching for you and the result of using something like cloud front here is that you can get away with a much smaller back-end you can scale down on the back end because now most of the work is done by those proxy caches all over the world and the the pricing is pretty simple because using CloudFront tends to be the same cost or less then the data transfer cost that you have to pay for anyway so it pays for itself this is how much caching can save you in this case our customer team internet in Munich they added caching on top of one of their most popular DynamoDB tables and boom they save 3,000 reads per second on that particular table and then this they thought great let's do it for all tables and at the end of refactoring all of the tables to use caching they were saving 20,000 reads per second which translates into a lot of money monthly and now Markos from Team internet he can pay for all of his team to go to reinvent if you wanted to so let's go back to that cost optimization Flyway we discussed a couple of architecture bits sing now we want to build right so when building something on AWS and you want to save money and be efficient here the golden rule here is to avoid unnecessary heavy lifting by that I mean if you find a service on AWS and now we have over a hundred and seventy-five services and that number keeps going up if you find a service that is already doing what you want there's no need to re-implement that any more so for instance instead of running your own database and managing and patching and all that stuff on AWS just use one of our 15 different automatically managed database services and now we edit Cassandra to the mix so if you were previously in the business of running Cassandra on your own on ec2 now you have a lot more free time that you can spend with other stuff if you're running application integration services like message queues or stuff like that workflow systems go use one of the Adobe AIR services because they make it so much easier and you save a lot of time and effort instead of building your own application integration service and it also lowers your risk because there's less potential for introducing bugs into the system running into other issues and again if you're using some kind of analytics try go check out the analytic support for you and lots of services there that you can use including elastic cash and elastic MapReduce and all that stuff and well you get the picture don't reinvent the wheel here is an example from that same company from Team Internet they used to run batch jobs on Amazon EMR to do some analysis and then they tried out okay what if we can do this with Amazon Athena which is a query service that uses Hadoop in the background and they found that they their cost went down by more than 50% just by replacing EMR clusters which are already pretty nicely automated with Amazon Athena which is a higher level service here and the reason why that they spent a lot of money waiting for those clusters to be starting and then they can only they could only do their best jobs and with Athena they were able to get rid of that waiting phase and in exchange she also got a simpler architecture that was easier to manage so lots of wins down the road here ok let's spend the final couple of minutes here with saving on AI and machine learning so who in this room is using machine learning in one way or other okay so there are a couple people here so we gave this talk on Monday and somebody left a comment here and please do leave comments we really live by your feedback somebody left a comment on Monday saying that yeah Constantine asked and then only five people raised their hands and then he spent like all this time talking about machine learning and everybody else could have gone home or something like that please don't go home right even if you're not using machine learning now think about using machine learning in the future because machine learning really is it's like having a weather forecast for your business think of machine learning is a weather forecast for your business so think about all of your business processes and how you can use machine learning to be smarter about your business something like when are these machines going to break so I can make sure I have the right replacement parts nearby how many people are gonna go to my store in the next week buying this particular item how many items do I need to order to fulfill demand all that stuff and if you can't find any ways to use machine learning in your business use machine learning to automate auto scaling and predict how many machines you're gonna use for a specific application and there are some real savings there customer of mine they used auto scaling and they were able to save 50 percent of their easy to build just by using auto scaling then they applied machine learning on top to predict how much ec2 capacity they need in the future and then they were saving 70% of their ec2 cost because they now had a more a smarter way of auto scaling thanks to machine learning and even if you don't use machine learning whatsoever let's cut a deal here right so you're gonna follow through this and try to think about these machine learning best practices in terms of your own existing things like batch machines because machine running is really high performance computing for training neural nets and you can apply these principles on your own stuff here and even if you don't use any of these at the end of this part I'm gonna tell you how to find the perfect Christmas present for your loved one without machine learning okay that's that's do that right so first in AI ml the same rules apply as ever right so the turn of unused instance rule can be applied to those Amazon sage maker notebooks the data scientists used to build something and simply don't run those notebooks all the time or over the weekend be smart and shut them down stop them over the weekend second thing is automating everything also applies to machine learning and we have a whole suite of services around Amazon sage maker that will make you it easier for you to run machine learning in the cloud and that there but you don't have to work so much and just can concentrate on your machine learning implementation here the using auto scaling rule here is already built into Amazon sage maker that's the great thing about sage McGee will take care of everything including auto scaling and sage maker understands how to use spot instances so the use spot instances rule can be easily used with Amazon sage maker here so it all comes down to the unnecessary heavy lifting rule here which really is the reason we build sage maker in the first place so this looks like a sage maker commercial and maybe it is but the key thing here is the team has built Amazon sage maker to make your lives easier free up your time that you can now spend on building that next big machine learning thing and then Amazon sage maker does everything for you and makes sure to make everything very very efficient along all of the steps of machine learning which includes building your machine learning app training your machine learning models and deploying them into production so when we go through the building phase here using those a doubles manage services again is a great way to start something like Athena aw-oooo Atticus Lake formation because machine learning needs a lot of data to work on and that data needs to be gathered and pre-processed and and managed right so the other thing is with machine learning you work with labels labels are the the answers that you're looking for and that you're giving the machine learning model to learn about like this is a kid this picture has a cat this picture is a dog you have to label those pictures you have to attach cat and dog into those pictures and that is something that can cost a lot of money so Amazon sage maker ground truth comes with a feature called active learning where it will look over your manual process over the shoulder of your manual labelers and figure out how to do the labeling on its own and that will save you up to 70% in labeling costs so when you're building your machine learning application you kind of experiment on your laptop which is nice and easy but as soon as you move closer to production please use Amazon sage maker notebooks which have now become very very easy and always say to make a studio you get a full IDE for developing stuff so please do take advantage of these developer tools they make your life easier they save you time and time is more valuable than money and when you use the sage mega you can automate the starting and stopping of those notebooks through a lambda function you can set up a simple lambda function that stops your notebooks over the weekend and that will save you a bit more money and talking about notebooks notebooks come in different sizes you should always try to get away with the smallest size because the pricing is based on the sizing and now we announced yesterday that the Amazon sage maker notebooks are elastic so you can adjust them on the go start with the smallest instance type for notebooks and only scale up when you really need them and then scale back down that will help you save more money and again you can run some small training things on your notebook that's okay but as soon as you move into something more significant go use the training features for sage maker because they make your life so much easier now let's move on to the training how can you save money during the training phase well again don't use those notebooks and the golden rule about auto scaling and it also applies to any other auto scaling setup is start with the smallest instance you can get away with try to scale horizontally by adding more and more instances all of that type because it allows you to scale very finely granular and only move to a bigger instance once you have to you're running out of memory for one particular bit of your training stuff then you it's okay to move to the next one otherwise try to get away with horizontal scaling as much as possible and when you're using spot training the thing to remember is sport instances mean they can be shut down with the two minutes notice right so in the context of machine learning you can implement checkpointing into your code by the way if you're not using machine learning and using batch computing for HP see like computational fluid dynamics like those Formula One people there you can implement the same thing with AWS batch where you implement checkpointing logic there and then you can use a double e's batch with support instances here the other thing is we provide you with pre compiled and pre-optimized versions of your favorite machine learning frameworks such as tensorflow MX net pi torch do we use those AWS provided versions because they have been optimized for the specific platform they're running on and they will that will save you a lot of money because running non-optimal code is going to cost you and those frameworks come with specialized io libraries such as TF record for tender flow or record IO 4mx net that will help you push data faster through your training cycle and this is another thing that is also in parallel to other HPC applications the the bulk of the time spent during training is spent actually with IO trying to put all those data points through the learning algorithm and by using a specialized frameworks such as TF record or record io you can be a lot more efficient in pushing the data through and there's another thing called pipe mode go check out pipe mode so and try to figure out how to use that pipe mode essentially streams data through the training process no need to load and copy and other complicated bits here and then sage maker comes with automatic model tuning features and a tune model runs so much more efficient in production and that's how you're gonna save money in production so looking at the production bit of machine learning here actually the biggest cost driver of machine learning is an inference so about 90 percent of the cost in machine learning is spent on the inference phase because people train once and then they deploy to mobile applications or web applications that are used by millions of users so this is how a lot of money goes into inference so how can you save money there well section Aker comes with a feature called neo which will do code optimization of your model by getting rid of all the unnecessary bits and aligning stuff optimally to the hardware of your choice so that you have optimized executables that lead need CPU and GPU resources the other thing is there are two ways to use machine learning and production one of them is a 24/7 real-time endpoint that you can query but that is also running 24/7 and spending money 24/7 or you can batch up requests and send them in batches and when you send them in batches sage maker will start up a cluster run the batch shut down the cluster you only pay for what you use so if you can get away with a batch model that's a lot more efficient than using those 24/7 endpoints if you do use the 24/7 endpoints think about deleting unnecessary endpoints such as deleting those training and testing and other endpoints you might not need and you can set up those endpoints so that you can use one endpoint for a full sequence of multiple models or one endpoint for different models running in parallel thereby saving you on those endpoint costs here and again right sizing is always a good thing try to do some benchmarking and figuring out what is the smallest instance you can get away with and you can use Amazon Elastic inference to only use a slice of a GPU instead of a full GPU because an inference we don't need that much GPU capacity or the new AWS inferential chip which is in the new ec2 in one instance type here and again auto scaling is always a great thing here now that was the machine running bit I think you can apply those principles to any other application so this is how you can put these in practice and I promise you how to find the best Christmas present for your loved one so here's the algorithm step one tell your loved one that you already got the perfect present for him or for her tell him or her you're done and you're so much looking forward to his or her face at Christmas step two let him or her guess step three take notes okay now you're lucky in my case my wife told me about this I can't use it anymore okay putting it all together first use easy to spot instances 90% savings here use reserved instances if you're conservative or use Savings Plans very easy to use very very flexible up to 72 percent savings and that's something you can hand off to your finance department and let them figure out everybody should have a Jason in their company thinking about these things and how to leverage that then turn off unused instances think about these things if you don't want to think about turning off all the time go automate everything automate a hundred percent of what you do on AWS it will pay off hugely if you're using server lists try to understand where your code is spending time and why and then try to get rid of those synchronous patterns where stuff is waiting for responses all the time and be smarter by using things like multi threading or event-driven code and try to cache everything everywhere there is always an opportunity to add another cache and caches pay for themselves leverage managed services as much as possible we are we have thousands of people in teams of implementing those things on AWS as optimal as possible because they get built to write so they know how to optimize database they know how to optimize message queues they can do it for you you simply have to use those services when they're ready and then you just use them and finally if you're using machine than if you're not using machine learning start using machine learning now as a weather forecasting tool for your business or is a forecasting tool for auto scaling and if you are using machine running now please do check out sage maker yes this is a sage maker commercial but it will save you so much time on top of saving money so let's go back to the cost optimization flywheel and customers ask so what do I do with the saved money I mean Jason he can now easily bring the whole company to reinvent multiple times here so there's gonna be some money left what do you do with that money well saving money really is the beginning of another cycle here because you can now take that money and invested money into new people you can invest them into new developers that can help you build new stuff that can help you be more innovative and as you are more innovative you are going to save even more money so if you're missing those developers that can help you understand machine learning then save some money here using those techniques hire those developers hire some data scientists let them figure out better ways to save money better ways to be innovative and to run your businesses and if you want to learn more about architecting best practices please check out the well architected program it's a website just say a degree is well architected and will automatically bring you to that website and it comes with the full white paper on cost optimization that Jason's teams have been using to create better architecture and to avoid costs that went into those one hundred and fifty million saved costs here and if you don't want to read I can understand why I prefer reading all the source stuff so there's much more there are YouTube videos of our previous talks here so you can check out the talks all the way back to 2014 and learn about other great ways to save money you can learn about negative caching you can learn about DynamoDB query optimization patterns you can learn all this great stuff by watching the previous videos on reinvent every video is different every video adds another angle on saving money on AWS so there are some related breaker breakouts here unfortunately the first one is already has already happened but you can check out the sorry you can check out the Thursday breakout here which is about optimizing Olivia's cost and utilization with Ares management tools and with that thank you very much for coming here please do check out our responses from training and certification and enjoy the rest of the show [Applause]
Info
Channel: AWS Events
Views: 2,481
Rating: 5 out of 5
Keywords: re:Invent 2019, Amazon, AWS re:Invent, ARC209-R1, Architecture, HERE Technologies, AWS Lambda, Amazon EC2, Amazon CloudFront
Id: FAowmVOweO4
Channel Id: undefined
Length: 53min 51sec (3231 seconds)
Published: Thu Dec 05 2019
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.