Accelerating modern application development with Amazon ECS

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

[Applause] [Music] here we go here we go here we go here we go go here we go here we go here we go here we go go here we go go here we go here we go here we go here we go go here we go here we go here we here we go here we go here we go here we go here we go here we go here we go here we go here we go here we go here we go here we go here we here we here we go and welcome back to another episode from of containers on the couch we're back with you again with everything about containers streaming live here on Twitch and on YouTube and today I have two amazing guests which I don't think either of them have been guest before so I wanted to welcome both nidi and asok to to the show um I'm going to let them introduce your themselves first before I do it I'll mess it up anyway so here we go NY let us tell us who you are what you do at AWS and what your favorite color is okay I I will do so in that order hi everyone super excited to be here my name is nidi m uh I've been at AWS for two years I am a container specialist here I um my my favorite part of my job is talking to customers and um seeing them build their production grade workloads on Amazon ECS and AWS fargate I'm super excited to be here and talk about ECS today uh and I'm joined by asoke so asoke oh by the way my favorite color is pink haha awesome hey thanks nii hello everyone uh this is Ashok shirama so I'm one of the senior container specialist is says at AWS my primary goal is kind of working with startups and Enterprise customers in us to help them navigate in their containerization Journey again uh longtime listener and Watcher for this show and first-time presenter really excited to be here and my favorite color is green green awesome so thank you very much both of you my name is m cocing I'm a developer Advocate with the ECS service team I've been at AWS for almost five years in this position for more than two um some of the favorite things that I like to do is also of course talk to customers play with technology and of course do things like this which is um spread the knowledge to everybody who would like to listen and my favorite color is navy blue just as a matter of Interest so everybody thank you very much for joining us here on the chat um say hello um don't forget don't be shy we like to see where everybody's joining from so give us a small wave see see what you're doing um and today we're going to be discussing a topic about how you can accelerate your modern application development with Amazon ECS so n you want to kick us off and tell us what we're going to be talking about on the agenda today yeah absolutely um so ashoke and I presented this at reinvent uh in the form of a chalk talk that was our first chalk talk it was super interactive so we're going to try and kind of keep that flavor going it's a really broad topic right like what excites me about about this topic is you know kind of teasing back from um from customers who do this day in and day out right um folks like insta card folks like Goldman Sachs folks who are building production grade ECS clusters and we thought we could just kind of get together today and give you um an overview of what it would be if you today were beginning your first uh your first workload uh onboarding your first workload and the way they've kind of framed this talk right again it's super broad U we're hoping to cover a lot of content but we're trying to ground this into uh four pillars and the pillars we've chosen is sort of you know you probably if you've ever attended a well architected a review with any AWS Specialist or you know you've gone through that framework you're probably familiar with the way we think about grounding application bre practices right so we're going to talk about security we're going to talk about reliability we're going to talk about performance efficiency and we're going to round it off with kind of operational excellence but where this talk uh is sort of focused is how can you achieve those best practices with Amazon ECS so that's a little bit of a background right uh we're going to be talking about a whole lot of things um why we're going to build these app why why are customers building applications on ECS uh and then kind of go down the pillar uh all these pillars and I sh can I'm going to give you all these nuggets all these best practices that we've sort of encapsulated in talking to a whole lot of customers across over collectively over 10 years um and and and and hopefully this resonates right okay um very quickly today if you were just just getting started with ECS and assuming you you did not know what uh elastic container service is uh the way I would think about it it's a fully managed uh container orchestration service uh it lets you deploy manage scale your containerized application you can do it entirely Service serverless uh if you use the fargate launch type which a lot of our customers gravitate to or if you want a little more control on your instances um you could probably do that with uh with the ec2 launch type as well right and Amazon ECS is sort of like this Orchestra right it really lets you it lets you launch monitor it lets you scale your applications it gives you flexible compute types as I mentioned ec2 or fargate and it does that in a secure manner while allowing a lot of Integrations uh for monitoring and observing your application and making sure that it's secure throughout the uh your container is secure throughout the site so that's a little bit about about ECS uh our talk is going to be today uh on uh on Amazon Service uh and sort of one thing I I wanted to call out these are going to be terms we're going to be using throughout of this this conversation right a lot of very common pattern um for running your container uh an ECF could be behind a load balancer it could be an elastic load balancer and NLB um we're going to talk about how we integrate with Secrets manager some best practices there uh and and and just this just kind of you know I just hope this kind of level sets uh hopefully you're familiar with um you're coming from if you're coming from the world of eks um your smallest unit of deployment is a for uh for ECS that would be a task uh you have a bunch of task running together it becomes a service and then when uh when you have a bunch of services running together that's a that's that's a Coster and that's what uh EF is been all right um without uh going into more details uh let's just get right into it right so the first thing I mean the very first thing that you're going to ask uh is well the first thing I want to think about is how do I build a secure application this is not something that you you know is a second nature right you lead with it and so with that I sort of want to give it hand it over to a shoke and you know we're going to talk about what it is to be what it is to build a secure application what is the AWS shared responsibility model looks like and yeah let's uh let's get talking he thanks n okay again as n was alluding to I think we're going to discuss this uh best practices aligned or grounded to this well architected of framework starting with security so if you have interacted with any AWS personal you cannot get away with this shared responsibility model this uh this terminology so to kind of uh decode this uh the whole ey chart here to simplify this so we usually say security of the cloud is AWS responsibility and the security in the cloud is uh you as a customer's responsibility so we often see uh customers uh kind of want to know where this responsibility model kind of uh lays and it kind of differs depending upon which service customer is adapting and which feature of that service is kind of uh customer is using so when it comes to ECS so there are mainly two components one is a control plane which is a managed control plane that ECS or AWS provides to you which is completely managed by AWS it's completely AWS responsibility to kind of provide that secure uh control plane for you and whereas for uh data plane there are primarily two options uh uh in ECS whether you are running on your uh your workloads on traditional VM based uh uh instances like ec2 or you are using this uh newer serverless uh containers like AWS fargate so as you can see on the left and right so as if you are using ec2 so there is more responsibility on the customer side starting from the virtual machine operating system that goes on it and every uh operational add-on that you would run and your application software and everything that goes into it so like Uncle Ben always says right from from the uh Spider-Man universe so great power always comes with great responsibility so just have to be very cautious on how much power we want to uh yield there uh where whereas on the right hand side so we see uh for customers uh who wants more simplified operational experience and we see as niid was alluding to more than 50% of customers kind of start their easier Journey with AWS forget because as you can see some of this uh day to operational overheads like uh managing the add-ons that run on the uh forget infrastructure or forget instance and then hardening of that patching upgrading and as as well as scaling of that is all AWS responsibility and uh AWS customer is just have to worry about their application code and everything that goes inside that uh container they will not they're going to run on AWS forget see the other important uh question we always come across is our application kind of handles a lot of sensitive data and secrets is one such a sensitive data so the next uh topic we're going to cover is like how can we secure uh the sensitive data or the secrets that our applications going to use uh when they are running on ECS so when it comes to uh securing this sensitive data like database credentials API keys or whatnot so these uh sensitive uh credentials it's always recommended to use the secrets Management Solutions so we often come across customers uh who want to use this manage services like Secrets manager so which has uh which has a capability to kind of automatic encryption and automatic rotation of the secrets uh by default and can it can also replicate those secrets for other AWS regions as well so once you start adapting this uh Secrets management solution like AWS Secrets manager you can supply those Secrets uh to your ECS containers like uh like shown below like so you can refer which Secrets manager secret you want to inject into your container and you can Define that by specifying in this task definition like shown here so in this example so we want to say we want to inject this database user and database password as environment variables in our uh ECS task so when you have such requirement the ECS agent going to uh when it tries to launch this task it will read this and it will reach out to the secrets manager service and it will inject those EnV variables into your application container so your application code that's running it can just refer the value from the enu variables and it can talk to the uh the corresponding uh backend system in this case a Aurora database so this works well for majority of the customers uh where uh customers have a predefined way of accessing this their secrets using environment variables but one of the uh go with this approaches if you are rotating your secrets and there's no way to kind of update this EnV variables at that point in time you are kind of recycling or redeploying your uh ECS containers to fetch or to get access to these uh new secrets so this is uh this works but there are much better ways of uh doing these things and we're going to look at another approach so where and this approach is common ESP especially if you don't have access to the source code or if you are using uh add-ons that are supplied by third party where their applications are uh reading the secrets from a a local file system so in this example we are saying uh we are running a sidecar container this side car container is basically responsible for talking to the secrets management solution fetching the secrets and then writing that secret to a shared volume which is m Ed on these two containers your application container and your side car container so this way uh you can create a dependency so that your site car container always starts up first before your application container so that the site car can finish its job and application container can read the secrets and then uh kind of use this uh secrets to uh talk to the uh backend database the good thing about uh one better thing about about this approach is again the side can yes m is there I'm sorry I I'm was myself for some silly reason I want to ask a question this I'm trying to understand what the benefit of using this what would the customer benefit being it seems to be more complicated than using directly Secrets manager to inject into the task what would be the actual added benefit of using this kind of method yeah that's a a great question right I think uh that's where uh the secret rotation comes into the picture like for example if you're using the secrets manager which is rotating secrets on a regular basis then uh your application may not have access to the uh latest one and the side car can periodically uh retrieve the secret and update the value in the uh in the local file system or in the shared file system so that's one because on the other method we have the fact that the secret is retrieved once in once only when when the task comes up it doesn't rep periodically keep on trying to refresh the unless it's written in the code but but by default doesn't actually refresh automatically is that correct that's right and I think that you bring up a a great Point wring that that the third approach which is the best approach you if you could implement it is uh using Secrets manager uh SDK or whichever the secret management solution you're using they provide their own SDK which you can integrate you in your application code which has the capability to automatically rotate the secrets in case uh if the connection is failing or so awesome I I and I can attest to it attest to multiple customers right or just getting started said no one that rotating Secrets is fun and doing it ourselves is fun so absolutely makes sense yeah awesome so these are the three different ways uh our customers try to explore this uh securing their secrets and one common thing is how can we securely give access to our application to get only that specific secret that's the next thing customers usually solve and that's where before we jump into that I want to just sry there was a question which came in from jel C of the fact of do we provide an example this this kind of side car in order to inject to put it on do we have an example of this available for on on on a public repo or in blog post somewhere I don't remember seeing it I don't have it handy but uh we can uh uh we can find it I remember see we don't you can't put it out of your head for this specific second it's fine no worries the worst comes to worse jel if you would like to afterwards send me a message on Twitter I will run it down for you and we'll give you an answer exactly where you can find this we'll put it also on the show notes of this episode after it's recorded thank you so you were saying the fact of how do you actually allow the specific tasks to access what they're supposed to do St the middle I'm sorry yeah that's our next challenge to solve right then Secrets management is one such one such example but we often see customers kind of using uh multiple AWS Services uh to accomplish their business uh thing right in this case uh we have a simple front end and backend API front end is using uh S3 bucket to fetch static assets back end is storing the uh like customer information in a Dynamo DB table and these two may be talking to Secrets Management Solutions uh to do some other job and that's where uh task I IM roles comes into the picture the task I roles uh is a capability so where uh we can assign unique I IM uh permissions for each task that you can run on the ECS infrastructure so instead of your task using the I am permissions from the ec2 instance profile so we are creating a unique IM am role for each of your tasks with much fine grain uh permission so that that task can do only that job in this case if the task need to access secret one you write a permission policy to allow access to secret one alone so nii how often you kind of come across uh customers um using this or uh using this instance profile or versus uh I am uh task game roles yeah quite a bit quite a bit so this is typically prevalent if customer is coming from a traditional ec2 world where they are running multiple applications on ec2 and every application is kind of using the same easy to instance profile to get the temporary credentials so they often kind of uh not aware of this uh uh this uh secure uh Ser uh capability where they can assign this uh uh unique IM permissions for the easiest tasks so we got a question of there's two different kinds actually today is actually three but that's a different story what we'll do we'll talk about the two at the moment there we talking about a task execution role and the task role what is the difference between the two of them yeah the task role is just what I explained you is by your application to talk to other AWS Services Task execution role is something that uh is used by ECS agent that is running on ec2 or the fargate to talk to other AWS services on your we have like things like uh to launch a ECS task your ECS agent has to download a container image or it has to talk to that Secrets management solution to inject the secrets to EnV variables so all those capabilities you are creating a task execution role and assign those permissions to that and it is used by the agent running on that uh uh ec2 or forget to talk on your behalf okay uh the other common thing uh we get uh we hear on the security side is implementing network security again right all these things kind of help our customers implement this defense in depth strategy so in the network security segment uh all we are advising our customers is using this uh construct call security groups like task roles the recommendation is to use unique security groups for your unique or for your ECS tasks or services for example in this 3 tier app you have a load balancer front- end service and backend service and a database you are creating unique Security Group at each of this layer and defining what are the Ingress and ESS rules at each of these stages so that you are allowing uh 443 traffic from all the public internet to your load balancer and you are allowing only port ative to your front end from your load balancer and so on and so forth right so that's one aspect to think when you are thinking about network security and the second aspect is always kind of uh look for encrypting the traffic right so whether encryption in transit or encryption uh at TR so may uh what other things do you hear when when you hear about the nwork security network of course the of always expect the unex no you can't expect the unexpected but it always fear the unexpected so try as minimal for example not exposing things too much to different places in other words lock down the security groups as much as you can and one other thing I wanted to point out for this specific thing where the security groups this works only when we using AWS VPC Network Mode on ECS that means when it's on fargate which is the only mode there is and when ec2 is using AWS VPC mode these are the way you can use as the added benefit of using these security groups which allows you to control the traffic and the flow of the traffic of who's allowed to communicate with one service and lock things down a lot more yeah that's a good finding right the other important thing we come across is enabling this VPC flow logs which provides uh visibility into this network flow traffic that is flowing through this uh VPC so that's another Tool uh in the toolkit to help you to visualize lies so which components are talking what connections are going through what connections are getting denied uh so on and so forth to get some observability into the network traffic one other question we do get like when should we think about using private L endpoints um so if you could if you could comment on that yeah absolutely right uh private link uh helps our customers to talk to other AWS services using AWS backbone Network given all the service apis are publicly available a default way of connecting to the apis or the service end points is using public internet but we have customers regulated Industries where they don't want to Route the traffic to the public internet they want to keep the traffic within the VPC within the AWS backbone so private link kind of helps them to keep the traffic uh in the AWS uh backbone Network y yeah that's another common thing we hear from our customers again we can talk about the security all day long right there so many different things we can Venture into when we talk about security but I think uh we kind of uh simplified this like four high level themes uh what we hear commonly from our customers when we say security uh the next pillar uh we going to talk about is uh reliability but I just want to ask so when you think about reliability what that what is that one thing that pops in your mind so may shortly the when we talking about availability is in other words the um the capacity of a customer to actually get the the functionality they're looking from the um from the service in other words is the API available am I able to access the website am I able to to perform a specific command or um some kind of function which I'm doing with the application in other words from the customer perspective I don't really less caring about what's happening on the hood with all the nuts and bolts and the bits and bites I just want to know that my application is working correctly and how and if it is not how long it'll be down for and when it can when it will recover how soon I can get it to recover yeah it's all about making sure the application is doing its job at all the times right it is relable and highly available and that's a common theme when we are de designing this distributed applications in the cloud and one common uh ask is how can we maintain High availability when we are running in Amazon ECS that's a common question we always get so one I'll just add right M an an important concern right like I don't care about stuff that's going under the hood I want to maintain up time and make sure that my application is resilient and available I will say like over the years in talking to customers who use a more managed service like farget um things get a little more nuanced you know they want to know what would happen if an a is impair what the schul is doing behind the scenes uh and you know all this to say right like they trust us as AWS to make sure that under the hood we are we are we are designing the most highly available control team uh and making sure that you know when tasks fail another one comes up and that application is run so just just wanted to add that you know um there is there is a certain notion that fargate is a blackbox but there's this one talk that I would highly recommend if you want to really deep dive into all the availability improvements that we've done uh for for fargate it was spoken about at reinvent I think M I don't know if you were on it but um definitely go go check that out um to see exactly how the schedu of I'm going paste the link over here to exactly how we build the surface in a resilient way and log can of course be in the show notes yeah thanks I mean when I think about high high availability right I always remember our c scod right uh which is everything fails at some point in time it's all about how we are architecting to withstand those failures so it's that's a theme behind uh these design uh these architectures um I mean one of the best uh uh I mean tool when we are designing for high availability is utilizing multiple availability zones these availability zones provide this fall tolerance uh to kind of implement this High availability so it's always recommended to kind of run multiple copies of your application like multiple tasks of your ECS service across this uh multiple availability zones as nidi was alluding to so we can use uh placement strategies if you are using E2 to spread this uh task across availability zones or if you're using forget the scheduler do it for by default so you don't have to worry about it so those are the things that you should uh customer should look out when they are designing for high availability uh the other common thing we always get is how many tasks should I run for service and it's a million doll question I wish I have bunch of those in my bank account but uh so the the magic formula here is always to find this uh base desired count of your application this is like number of ECS tasks that are required to meet your study State traffic like for example in this example calculation we have so we are we are saying uh there are like six tasks are required to run my uh to meet my Baseline demand and to design this for high availability you always employ this formula like base desired count into Target availability zones divided by Target availability zones minus one so this kind of gives so what is that magic number of tasks you would run to kind of achieve this High availability so M or n right from your experience like how many uh customers uh do you see this employing are using multiple availability zones and if they are using either is two availability zones is the right number or three availability zones is the right number like what is that magic number do you come across there is no magic number the the I the best thing is you for the the most bang for your bu your for your buck is to um use as many availability Zs as possible the reason for that being is if for example you um one of the availability zones fail if we taking this example of three availability zones then you lose 30% of your available capacity which will have to start up somewhere else in the meantime in the talk that we were talking before on which I put the link for we discuss how you actually should build your applications to be provisioned for an availability of 150% that's what the way we do it in other words we over provision our things so that even in the event of an availability Zone failure we still are able to manage and accommodate 100% of the traffic and only the the remaining availability zones but the more availability zones the less amount of capacity that might fail will have less of an impact which means usually you need less instances in more availability zones than in smaller smaller number of availability zes yeah absolutely I think this formula is kind of alluding to that to make sure sure it can withstand uh loss of one availability Zone and depending upon how much uh like redundancy you have you just have to take this formula and then uh run that many number of tasks I I will add right when I when I talk to some customers are super early in their Journey uh the important part is to start with the multi- a posture what I see customers going is I need to build multiple regions uh and then when the reality of the implications of the cost and even the RTO and RPO requirements when you kind of Deep dive you realize the probability of an entire region being nuked it's it's low but at the very minimum you should at least start with a multi- A posture before and there are very good reasons to go multi- region absolutely there are uh but as you're kind of beginning to think about your you know resilient architecture and the very first step would at least be well if you have you know there's no magic number but if you have you know more than one a which is most of our regions you should at least be building uh deploying your application in yeah absolutely uh start with multi a and look for multi- region based on the business needs yeah exactly I mean the other theme to this High availability is it's good I mean it's great to have this availability but how can we make sure it is working as it intended to and that's where uh chaos engineering comes into the picture uh I mean that's where before you go to chaos engineering I want to stop you for one second sorry um and welcome everybody who might have joined us after after we started the session today we're talking about the how we are going how you can deploy your applications on ECS and invate according to the well architected pillars and um we're here with nidi and today and if you missed the beginning of the session please go back and catch the beginning from the recording it's very very interesting and again thank you very much for all the people which are participating so amazingly in the in the chat it's a lot of fun and I'm trying to answer as much of the questions as I can so back over to you as yeah thanks M for the context yeah uh this chaos engineering or fault injection uh simulator service is one tool which can help us to kind of test our resiliency or reliability aspect of our application so we can inject uh it helps us to inject various uh type of failures into our application and and kind of run the testing to see how our applications are kind of Behaving uh to these failures and it is always uh important to kind of embed this as part of your application life cycle so that whenever you are releasing a a a new code change or a new major release you know how reliable are uh that uh code version would be so I'd like to ask like nidi and mahish right so how often do we see this our customers uh using this and have you come across any other the tools to implement this C Engineering in ECS World need you want to take this one or should I can also but I'll let you let you go first okay thanks yeah I would say uh we released this quite recently it's something that folks are very used to when they come from an ec2 world uh when I talk to a lot of fargate customers this is something that they would like to implement right uh and and and more and more I hear about folks doing it but like proactively are folks doing this no not a not a whole lot well it just Tak time and resources um but I think with the integration of f uh and and being able to have these simple ECS actions I definitely hear good things about it from customers and and to roll of need said is yeah it it takes a lot of maturity from the customer to be able to run these kind of experiments I would say the first time when you're starting to get your application up and running and you're building it and you don't have that much traffic and you have a small team it's not the top of mind you're more the fact of delivering features delivering value and when your first outage hits you then you say oh maybe we should have been testing and and seeing these things before and then it becomes a priority so it it it takes a mature organization and a mature team to be able to implement these kind of things my personal thing is the sooner you do it the better it will be for you in the long run especially especially when you growing at a huge scale because when those things happen at scale a lot of things which you never ever anticipated in your life would happen will occur as we said with C with verer vogle said everything breaks all the time in the most Unthinkable places that you never thought about and it takes it it take it takes investment definitely it takes time but it makes you your team and your applications a lot better yeah plus one on it right so especially uh when customers are running this very very high SLA SLO applications uh security critical critical monitoring or in the medicine medical monitoring world so where the SLA or rather High ility is a Paramount I think these implementing these uh techniques would really help them to go that uh far and I think I'm not sure if M if you come across this Netflix Simeon Army set of tools which our customers used to use to kind of similar or Netflix used to use to simulate various kinds of failures in the system correct cha the chaos monkey chaos gerilla and I think think that's what they used to be called a long time ago but I think still they changed the name but the idea is of course letting something go of course and please do not do this on production systems without testing before you do that but once again you're at the mature State you like let a as they say a monkey and data center go start pulling out cables is pretty much the same kind of U comparison and hopefully if your system is redundant enough it will be able to recover and if not you'll find the weak spots where you have to in invest more time in order to make make something like this happen if it would happen again You' be able to easily jump over that hurdle of having some kind of an outage and wouldn't affect your end customers viewing your service in this case Netflix whatever it may be I I just want to quickly add and M maybe you can drop a link is that you don't have to start from scratch outside of a Blog that we obviously have we've codified this in cdk and terraform using ECS blueprints so take a look and that should serve as a good starting point yeah I'll drop a link to that shortly thank you uh last but not least in the reliability pillar is kind of uh having a way to kind of troubleshoot these failures when they they do occur it would be nice to have and that's where uh tools like features like ECS exac will help to get a interactive shell to a running container or running task and then able to do all the debugging which uh they would like to do this is just getting a traditional shell access to the running process and look at the file system or the process all of those things to kind of debug further again it's nice to have it's good to have and enable this in your development or nonpr phase not in the production and this is especially useful uh for services like forgate uh which are managed compute instances which are running in the fargate managed infrastructure uh I think that wraps up the reliability pillar I think the next uh we want to go to the uh performance efficiency pillar again I'd like to ask the same question like what do you think of when we think about performance efficiency uh like for AWS or ECS so how much I can use or how much what would be the most optimal amount of resources for me to use with the least amount of money that would see the bit in other words and there's another whole aspect of not only the least amount of money but also the most green and sustainable way of not only saving cash but it's also being good to the environment and using the resources in a responsible way that you don't have to just spin up um more compute and use more electricity and more resources which are not good for for for all of us absolutely right it's all about just making uh use of the right number of resources to meet our application requirements right in order to do that probably the first thing we should know is what does my application would need so I mean that's a first thing uh often customers ask like how much memory how much CPU how much GPU should I assign to my uh to my workload again another million dollar question there is no right or wrong answer to this right it's all about every application is unique it's all about finding the nature of the application you are running I mean the key is the load testing or profiling your application to figure out your application what application resources does it need like CPU memory GPU Etc and once uh you found those uh requests or limits you customers can assign those things in the task definition in form of reservations and limits in the ECS so that when ECS is launching uh your ECS tasks it makes sure it gets that right amount of capacity or the resource resources meeting these requirements and these things may be optional some of them are optional when you're running on ec2 infrastructure but it's always a best practice to Define this for every application you are running and it is mandatory uh when you are running this on forgate infrastructure or forget launch time and how often do we see uh like nii how often do you see a customers implementing this uh load testing uh in the real world again I would say that there are customers who do this once and they should be doing it's a continuous exercise but very often the the the the common thing I get from a customer is just tell me how much vcpu in gigabytes would I need for my application and they've not done that load testing so uh it's it's one of those things you know it's a sweet sorrow but you have to do it because you know your application um the best yeah I had one question though right so for for at least at the four uh for services that are run tasks that are run for fargate can you talk about some other tools that we have to you know right siiz Services uh and to talk about over provision and underprovision servic and some tools or maybe you're just getting to it but I'm going to ask you anyway yeah absolutely right I thinkk you kind of alluding to that uh the next question so it's all about how we can make sure we are assigning that right number of resources or the resources I'm assigned is my applications are using so that's where tools like compute Optimizer kind of comes into the picture which is looking at your resource usage and how much you allocated to the resources so and it is proactively generating the recommendations for you to right size whether it's upsize or downsize uh those fargate uh easiest tasks so it's always uh best practice kind of look at those recommendations and kind of implement uh act on those recommendations yeah and I will say computer Optimizer is completely Available to You free of charge right you you go in and you enable it it's dead simple go to the console enable computer Optimizer there are some caveat and we have an active road map even for computer Optimizer but if you're using Target tracking on vcpu right um you should be able to get you know recommendations on whether your service it's it delivers recommendations at the service level as the show said right but it doesn't take those it doesn't proactively do that for you so it's not just around Resource Management it does have a pulse on how performant your application needs to be so for instance it's not you have your service running up up for a day are you going to go in and say that you know two vcpus is is more than what you need no it actually goes back it has it looks at a history of 14 days uh it kind of uses that average utilization to give you recommendations so in my mind slam down you haven't seen computer optimizer go enable it um the worst thing you can do is you know reject those recommendations and not act on it because you may need a a larger task size for whatever um for ECF ec2 that's unfortunately not available that's that's one common customer ask and so we have these other ways that you can right siize your uh ECS uh task running on ec2 but in general I will say that when customers come and use um uh you know think that they can manage their ecc2 cluster they utilization levels are really low uh and that's really when you know farg by definition is able to get you know their clusters better utilized uh with that one task for uh VM and uh you you still at the service level when you aggregate all your task you see that the service over provision you can go look at computer Optimizer and then um take those recommendations to write some your uh because given they're very discreete right there's only you go from 0.25 all the way to uh 16 vcpus uh and so it it it isn't as granular but that that's still not to say that you know there isn't meaningful cost optimization that can be done yeah all great points I mean bottom line make sure you are doing the uh profiling of your application figure out the right that's the most important for sure yeah that's very very became a you know aggressive recommendation to use computer Optimizer but there's a lot of customers who are just not aware absolutely uh last but not least in this uh pillar is again this there is also overlap with the cost optimization pillar it's about how can a customer can reduce the overall cost of a running cluster so there are primar four dimensions to it one is understanding what is that it cost today so that they can uh know what to optimize for so important thing is getting the cost visibility so there are tools out there like cost Explorer or cost and usage report so make sure um you tag your uh AWS resources so that you can get the right visibility in terms of cost the same applies for ECS tasks just make make them a tag with appropriate TXS so that you can visualize those costs in those tools and the second aspect is uh running the right amount of resources or either right right amount of instances of your application so that's where autoscaling comes into the picture whether it is service autoscaling or the compute autoscaling service autoscaling is all all about running the number of easiest tasks meeting your demand at that particular point in time so implementing the service autoscaling policies would really uh make sure you running the right number of ECS tasks all the times and I think M you are alluding to this in the beginning right right using the right compute capacity uh which is sustainable in nature and that's where uh uh like instances like resources like graviton instances which provides up to 40% uh price performance uh benefits compared to x86 and especially this is getting very very uh obvious in the last few years so where the modern application languages all support this arm 64 based uh CPU architecture I mean there is no reason nowadays not to use uh graviton instances again picking the right instance and running that in a more cost efficient and sustainable Manner and the last but not least in this pillar is all about like using the right purchasing option or rather right uh uh compute capacity option like there are uh on demand and spot so if the workload is interruptable you can always utilize spot capacity to get deep discounts and once you reach a study State uh for your applications then you can look at s compute savings plans or reserved instances so that so that you can get discounts for long-term commitment M do you think of any other aspects when you think about this uh reducing cost uh in the ECS clusters or like how often do you see customers using graviton or these other sustainable uh compute instances so I I'm seeing more and more customers move to graviton because of was as you said it's cheaper and performance is better so in other words you use less Computer Resources so instead of running 10t or M4 for extra large um ec2 instances to run a certain amount of tasks they'll see reduction of somewhere up to about 30% um for they get more more bank for the buck um but the most important thing is another thing which I think we also which was taken on the first thing is Right sizing is whatever you're not using for example um test environments which are not have to run on the weekend or um on holidays whatever you can also shut those things down you don't have to delete your cluster but you can just scale them down to zero and bring the resources down to minimum using far Gates you don't pay for anything if you're using ec2 you'll use the number of ec2 instances which are there at the minimum but also shut things off when you don't need them also save you some money yeah uh M you you literally just stole my line of thought right these These are four levers but they're not to be used independently like that's I think that's the one point I kind of want to drive home right um You can you can perform schedule scale in and scale in at night when your applications aren't receiving traffic at the same time you could run that exact application on graviton you can layer it with spot uh for you know those lower environments that stuff where your workloads can design can be designed to inter be interruptable and you can leverage a CSP you can leverage a compute savings plan on gravit on or you could leverage it on x86 so sort of you know we have multiple levels but these can be layered it is not a one um you know I'm doing this and this is how I red I'm reducing costs in my cost to and and I and I think ECS kind of makes that easy um to layer all of these you know Hardware options along with uh along with uh you know purchase options along with um flexible compute which is you know in this case it the spot um to kind of really come up with I want to I want to reduce cost but at the same time I don't want to be any less performant um like one customer I was talking to when they when you see graviton adoption you know um it's not just in terms of I was able to reduce my cost it's about they went down from using 120 tasks to 96 Tas for rning the exact same application on gra so again your mileage may vary but that's if you're building a net new workload um I would say start start graviton currently like farget supports graviton 2 uh which offers that 40% Better Price performance over yeah all those great points I think that kind of wraps up the um the performance efficiency pillar and then that takes us to the last pillar which is the operational excellence so last I'd like to ask anidi so what are the tools uh that our customers use to kind of get this operational visibility or rather observability into these ECS workloads so what do you see in the field yeah absolutely so you know it's important to understand that when customers think of or at least when they're starting out they think of observability is just being able to you know like a giant log being able to figure out what's going on with your applications but as you dive deeper you want to build your entire application to be observable from the very beginning and so the way we think about observability is not just you know the ability to be able to instrument your app to collect metrics the metrics logs and traces right um the common questions that and fully observable app would be able to answer is well is the CPU utilization of this service what is it you know it's that why is it at 85% or whatever uh how many how many methods in this service complete under X milliseconds um what kind of you know HTTP requests per second is this API handling has my deployment um transition from a pending to steady state right so these are just kind of common questions that you should be able to answer if you've built an application that is fully observable um in the aw Cloud specifically to ECF these three uh Services these three capabilities are largely mapped to um your cloudwatch metrics um Cloud watch logs and x-ray for traces and we have there you go thank you so much so those are our native uh Integrations that is not to say that we don't have an you know um the ability to support open source manage servic um we have uh Integrations with um Amazon amp uh manage service for premisus further for tracing and and and we can talk about it a little bit more about the a do collector it's a little more advanced but we can definitely talk about that but one thing I I did want to uh kind of you know hone into is a lot of customers when they when they were to begin their observability Journey on ECF um their first stop would be container insights right um so what is container insights uh it really gives you these you know dashboards if you've used container insights and enabled it uh you can collect summarize your metrics um these are instance level metrics so you're talking CPU memory dis Network usage right so you go to The Container inside console and you can see these metrics from your ECS task um that were automatically connected by container inut um much less talked about but you absolutely want to start framing your um thinking in terms of events uh how do I capture granular performance events and then we'll talk about event whiches well and then application logs right these are your standard out standard error health checks um custom logs from your containerized application and microservices so um really container insights kind of gives you that first stop uh where you're able to collect all these metrics uh for logs uh use Amazon cloudwatch log Service uh if you're how many of you are familiar with it probably quite a few um that will allow you in near real time um to collect and store logs from your resources from your apps um just think of it as you know login s sites will is a it's backed by a powerful query language and now you can slice and dice your logs uh whichever way you know terabytes of logs one tip that I or or a question that we get is right of what is the logs driver that is recommended right so ECS has a built-in support for AWS logs driver this is going to send automatically to cloudwatch logs uh you can store them for long-term storage for analytics there's no need for any agent no need for configuration it's built in but then the question thank you am so the question I get is asked is okay great I have AWS log driver but what if I want to perform filing what if I want to do metadata enrichment what if I want to send it to a different log advocator uh and this is where we came up with firel I love the name um so fire lens is another uh option for collecting logs on ECS it's a log routing solution it'll let you send your logs to a number of different AWS services or you know partner services and it works both for ECF ec2 as as well as fargate so you can actually use fluent b or uh fluent bit and actually we provide an AWS uh fluent bit image right you're certainly welcome to bring your own uh but we we also provide that um so that's that's logs uh that's metrics um I did want to talk about events and event right um I I I love event bridge I feel like it's a lot of folks you know don't leverage event Bridge as much as uh they should uh and here's here's like a classic use case right I have a deployment um you're using Code deploy For Better or Worse you're using Code deploy and that deployment fails on code deploy you submit that event to cloudwatch events and you define an event rule um for an event to take an action so really being able to instrument your app where you're getting more granular information because you have these actions as Target uh in an uh in an event rule it's like an alarm action you can u define an SNS Topic in the event Target you can make that event rule send a message to the topic to get notified you can take that action uh you know all all the good stuff you can you can send an automative response you can integrate with Lambda um so in in case you haven't checked out eventbridge it's a serverless event bus service and it lets you connect your applications uh with all this with all this data right and and it's it's delivered pretty much near real time so let's just assume if you are using an ECS service what would be the kind of events that that you know you could send to event Bridge uh it could be a container instance State change it could be a task State change from pending to fail failed to uh running um a deployment State change uh or it could be you know an API call Via Cloud so you know lots of popular rules to get started with uh and definitely take a take a look at event Bridge as well um what did I miss traces I don't get in a lot of conversation yes my you're un mute you're on mute M you're mute we I would love to read your L I'm sorry again we have another four minutes to go so I just wanted to make sure that we're aware of the time I am so sorry I I went off right there right okay um so there's a whole lot of ground to cover I mentioned this I did want to leave you with so we can do a whole observability Deep dive as my have mentioned I did want to leave you with this um so when you think about all the things that you would need for a fast and frequent deployment which is basically the operational excellence pillar there are a bunch of uh tips and tricks that you could you could actually do uh which which uh uh you know the container image fulltime is just one aspect of it it gets talked a lot when you have these ml applications with you know 20 Gigabytes or more uh uh image sizes and you really want to build an optimize image size and and and sort of reduce that imageable behavior but outside of that in the entire you know what what what the the components of entire deployment there is a common thing which is the load balancer health check uh if we look at the default I believe it is five minutes that's an awfully um it's actually 2 minutes 30 seconds my bad and it's a it's a awfully long time you know to have the load balancer health check on uh you can change it to 5 Seconds right so that's the health check interval change it to five make the threshold count to two and now what it means it takes 10 seconds before your load balancer is up and ECS contain considers the container head same thing with the de deregistration delay uh the default is 300 seconds if your service is like a rest API where your average response time is under a second just reduce this delay reduce it to like you know 30 5 seconds so um there you've saved some time uh and then uh your image pull uh by default it is once uh uh you want to make it you know prefer cached that you're getting a cached version uh of your image uh and I have three minutes so I would be uh doing a disservice if I didn't mention about seekable oci um we have a ton of resources on uh how to build an image uh with this it's an open source technology and what it really does is it shaves off a huge amount of time from um that it takes to kind of pull an image from the registry uh and it does that by lazily loading only those data layers that are needed this currently works on ECS fargate and M if you could drop link as to like how would a how would a customer get started on S that would be awesome um but but that's another level that you can pull to sort of reduce that image uh full time now you're talking from over two minutes to maybe 60 seconds which is a significant sharing off uh in in that deployment all right I sort of rushed through all of this m do you want to add something I want to just close up because we have one more minute so the first thing I wanted to say firstly asoke and this resource that you have on the screen is amazing I'm sure sure is there any way that our customers can actually use this after the show can I drop them a link in the in the comments so that everybody can get it so I'm going to put it over here and of course if you want to download it takes a bit of time because it's a big graphic so don't worry it can take a bit of time to download the file but still this is a very good resource you can use yourself um the one last thing I wanted to thank everybody of course was the people that were participating today in the show and I'm not going to get all through the names because I don't have too many second to go before the end of the Stream So in any EV thank you very very much for participating today thank you NY thank you asok for an amazing session and don't forget to join us again contain us from the couch next week same time it will be an eks session and don't forget to subscribe to the YouTube channel and twitch channel for more amazing content anything related to AWS and specifically with containers on containers from the couch thank you so much for joining us today and everybody have a great day thank you [Music]

Info

Channel: Containers from the Couch

Views: 566

Rating: undefined out of 5

Keywords:

Id: s0JwBkellik

Channel Id: undefined

Length: 60min 5sec (3605 seconds)

Published: Fri Mar 22 2024