AWS re:Invent 2019: [REPEAT 1] AWS Fargate under the hood (CON423-R1)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hi everybody my name is Archana Kanta and I'm the founding engineer of Fargate I've been at with AWS for eight years I've been through ec2 ECS and now far gate and I'm joined by owner Phyllis who's our awesome principal engineer on the Fargate team and together today we're going to walk you through the Fargate architecture will tell you all about how it works under the hood and and how its continuing to evolve as we speak and we hope that at the end of this talk after you see and understand some of the inner workings you're able to better reason about the service for your own application use cases and also that some of the more general design principles that we've used in building far gate and at AWS in general we hope that it strikes a chord with you and you're able to take those principles back to your own applications because we really do believe that the services that we build here at AWS are probably not that different from the types of services and applications that you build on on AWS so hopefully we can share some lessons and learnings there so with that let's get started so the goal of AWS Fargate was very simple and straightforward we wanted to give our customers a way to run containerized applications on AWS without ever having to provision or scale or manage or interact in any way with virtual machines or the underlying ec2 instances so completely container native experience on AWS and a quick recap of what that container native experience looks like on Fargate first you register a task definition task definition is just a specification document that tells us about your application containers the image URI the amount of CPU memory for your application etc etc and notice that within a task definition you can have multiple containers and what this means under the hood is that we treat all the containers within a task definition as a single unit and we deploy it onto the same virtual machine so they're co-located then you create a cluster and note that this isn't Stroeve ec2 instances or VMs but it's rather just a way for you to group your applications so some customers use it as a namespacing construct or a permissions boundary for example then you call the run task API with the task definition and a cluster and boom you have a serverless instantiation of your containerized applications running on Fargate so as you can see there were no ec2 instances or VMs involved in this experience so as we were cooking up this service once we settled on the overall goal of the service and the customer experience the ball gets handed down from the product team to the engineering team and one of the first pieces of documentation that we put down as the Fargate engineering team was a list of design tenants now these are broad general principles that we put down to help guide the engineering team through the design process and the technical decisions that we had to make along the way especially when there are hard trade-offs to me so these are the tenets that we came up with for Fargate and it's in priority order so that's important priority number one is security always period so if we're ever faced with the design choice where this one design is so much better than the other design along all these dimensions and axes but we just have to make this one small security compromise we will not do it we will not even consider that option and security in the context of Fargate means that the entire hardware and software stack underneath your task is always patched and secured and up to date and basically your your task your application is protected from any kind of unauthorized access either over the network or access to the credentials and permissions that your task may be using or to the actual data that your task is crunching on our second tenet is availability and scalability and what this means is that we really strive to maintain the uptime of your tasks that are running on Fargate and not just running tasks but we also strive to make sure that the service itself is up and available to accept new tasks launch requests or stops for that matter so that you have a reliable way to scale your applications up and down as the need of the hour changes and then once we have a rock-solid security and availability story we focus on operational efficiency with a service like far gate we see tens of millions of launches a week and we're running a huge server fleet underneath all of this to host your tasks and at that scale it becomes important for us to constantly identify and eliminate sources of waste in that fleet and constantly improve our resource utilization and oftentimes this gets passed back to you customers as price drops and in fact we had a major price reduction for Fargate earlier this year we announced savings plan for Fargate so this is the tenant this is the engineering tenant that makes those kinds of improvements possible so the agenda for today is we'll walk through an introduction of the Fargate architecture and then we'll walk through each of these tenets to see how they've actually influenced the design and you'll see that these are not just some things we write down on a piece of paper and then never look at it again but they're very much living breathing principles that really are embodied in our designs and you'll see that and finally some of you might have heard all the buzz around firecracker it's a new virtualization technology developed at Amazon and Fargate is actually in the process of adopting firecracker under the hood and the reason that we're adopting firecracker is because we believe that it's really going to help us along these tenets that we talked about so as we go through each tenant will stop to see how firecracker is helping on that front all right so far gate architecture and for the purposes of this presentation we're gonna focus on the run task API because that's where all the magic happens so broadly speaking the Fargate architecture consists of a control plane and a data plane the control plane basically is the services that are running the Fargate business logic to orchestrate your task launches so it's kind of the brains of the system and the data plane is the huge fleet of servers that run your containers so it's kind of the must so the Brawn of the system and this is the high level run task call flow so when you call run task it first hits our control plane we then reach out to the data plane and acquire and reserve server capacity for your task and then we persist intent in the control plane to launch that task onto that capacity and then the call returns back to you with the pending task asynchronously the control plan reaches down to the server capacity that was reserved and issues the instructions to launch the containers in your task definition we bring up the containers and the data plane on the server that was chosen and then the data plane reports back to the control playing plane saying the containers are up and running so now let's take a look at each each of those boxes the control plan and data plane and a little bit more depth starting with the data plane and remember that this data plane that we're going to talk about here is the pre firecracker data plane so we'll cover the firecracker base data plane in a little bit so data plane where do your containers run surprise surprise they run on ec2 instances now these are not ec2 instances that live in your account so you don't see them but they are ec2 instances that live in a far gate account within a far gate VPC and one other thing about these instances is that we don't actually launch these instances in the run task workflow itself because it would take too long VMs typically take boots minutes to boot up and that's in most cases unacceptable for container startup times so we actually keep a pool of pre running ec2 instances that are already up and running and as your task launches come in we picking already running instance and put your tasks on it so now let's zoom into one of these ec2 instances and take a look at what the full stack looks like starting with the physical server this is a physical server that's installed in an ec2 data center somewhere and on that physical server there's a hypervisor that manages the VM virtualization for ec2 and this portion of the stack the server and the hypervisor is all managed by the ec2 team on top of that lives the ec2 instance so this is the Fargate instance that we launched using run instances just like you would launch ec2 instances and within this instance we run Amazon Linux - as our guest OS and we install a go agent the Fargate agent in there and the responsibility of the Fargate agent is to communicate with our control plane and then we have the container runtime which is docker in this case to actually spin up the containers and this entire portion of the stack within the ec2 instance is managed by the Fargate team so we patch it we upgraded you don't have to worry about that and generally we maintain the full life cycle of this ec2 instance and above that is our protected asset which is your task in your containers networking wise the primary network interface for this ec2 instance at 0 is actually in the Fargate V PC because the instance was launched in the Fargate V PC but when this instance gets selected to host a task we create an elastic network interface or an en I in your V PC and then we attach it to this instance as a secondary network interface and any traffic network traffic that's coming from the far gate agent or any of the system components that's between that far gate task line and the ec2 instance line is going out through the far gate eni through our V PC but any network traffic that's coming from your applications including your image pull and logs pushing is happening through your en I through your V PC so that's an introduction to our data plane so now let's take a look at our control plane in the spirit of micro-services and not building a single giant monolithic service we actually have a few different control plane services that are each responsible for distinct piece of the system starting with the front end service so the front end service is the entry point service so the public end point that you hit when you call run task is hosted by the service and the service is responsible for performing some pretty standard front of door gatekeeping activities so it enforces I am authentication authorization it enforces all our limits etc then we have the cluster manager system so this is our big back-end service and this is the service that keeps track of your clusters and the tasks that are running in them and this is the service the control plane side service that communicates with the data plane so this is the other end of that communication channel and then we have a capacity manager service and this service is responsible for instances so it keeps track of both unused available instances in our warm pool and instances that have been checked out and are actually hosting tasks and on the run task workflow this is the service that's responsible for picking a specific instance for your task so we have to pick an instance of the right instance type and size based on your task requirements with the latest software installed so this is the service that does that and then also as we use up capacity instances from our warm pool this is the service that's responsible for replenishing that capacity so now let's put all of this together and take a look at that call flow of one level deeper so run task comes in hits the front end service we perform off and we perform limits enforcement front end forwards to call back to the cluster manager service which keeps state about clusters and tasks so we will add a record in our database for your new task and mark it as pending and then the cluster manager system will call out to the capacity systems actually acquire capacity for this task the capacity manager service will inspect the pool of instances that are available pick a specific instance record that instance is now being reserved for this task so we don't have other requests that end up on that same instance and returns the capacity ID back to the cluster manager service and that gets recorded in the capacity manager services database against the task and the call is returned back to you with the pending tasks asynchronously capacity manager service reaches down to the instance that was selected through that farg ADNI and activates the Fargate agent so we keep the Fargate agent kind of in a suspended State when it's in the warm pool because it's not in use it's at this point that the Fargate agent actually gets activated the Fargate agent will then register with the cluster management service the cluster management service is expecting this instance to register so as soon as it sees that it actually sends down the task definition and any other information that the agent needs to launch the containers the agent makes the requisite docker commands to actually spin up your containers and then reports the tasks as running back up to the control plane at which point we flip state in the control plane so this is basically it this is what happens this is all the magic that happens underneath the hood when you make a run tasks API called now in the next section we have a little surprise for you we're gonna talk about eks on Fargate so how many of you heard the announcement from Andy Jesse on Tuesday that's right most of you I think so we thought this talk wouldn't be complete if we didn't at least touch upon how we built eks on far gain and how eks fits into this architecture picture for those of you who missed the announcement the gist of it is that so far you could use the ECS run task API to run server list tasks on Fargate and that will continue to be the AWS native API to access Fargate but now with this launch you can actually use the kubernetes api s on e KS to run serverless pods on Fargate so what does the e KS on target architecture look like and this is just a sneak peek so we still have a control plan and a data plane except in this case the control plan is actually your eks cluster that's running the kubernetes control plane underneath the hood and the data plane as before is just a bunch of ec2 instances that are running in a far gate V PC and they have the requisite far gate and kubernetes agents installed on them so when you create a pod it hits the kubernetes api server which is a standard component in kubernetes and then we've introduced this concept of an ETF our gate profile and the profile basically gives us some additional information that we need to launch your pod on Fargate and one of the things you can specify in the eks target profile is a set of rules based on namespaces and labels that tell us whether we should route a pod to Fargate or route it to ec2 instances that are running in your account as before and the reason that we did this is because we didn't want to introduce a new field on the pod itself to tell us that run this pod on Fargate because then that would mean you'd have to edit all your existing pod specs that you have out there and maintain separate pot specs to run on ec2 versus Fargate so we wanted a more dynamic way to figure out whether a pod should run our on far gate or ec2 and so what we do is we look at all the profiles that you've registered with eks and we look at your incoming pod spec and if your pod is being launched in a namespace and has the labels that match a far gate profile then it gets routed to far gate otherwise it goes to ec2 instances in in your account and this matching logic is being carried out in a new web hook that we wrote and that we now install in your EKF clusters and the web hook basically is automatically invoked by the kubernetes framework every time a pod is created and the outcome of this web hook is that we basically set the scheduler name on the Fargate to indicate our decision and this mutated pod spec then gets persisted in at CD as normal and asynchronously based on the scheduler name that was set on the pod it either gets picked up by the default scheduler in which case it will be scheduled on to your ec2 instances or it'll get picked up by a new Fargate scheduler that we've written and now also runs in your eks cluster and the Fargate scheduler basically owns all the interactions with the Fargate data plane so it calls out to our data plan to acquire capacity sends down the pod specs the agent on the instance will bring up the containers in your pods back and report it back as running back up to the control plane so this was a kind of a high-level sneak peek in the ETS target architecture this topic can probably use its own under the hood talk someday but for now let's go back to the ECS Fargate architecture and revisit those tenets to see how that has influenced the design now we switch the order of the tenants a little bit just for presentation purposes so I'm gonna start with availability and owner will cover security and resource utilization so availability as most of you know AWS services are offered in a bunch of different geographical regions and Fargate is no different so when it comes to availability the first order of business is making sure that these different regions function as independent failure domains so minimizing any chance of correlated failures between regions so how do we do that so we actually run a separate stack end-to-end across the data plane and control plane for every single region and these stacks don't talk to each other they don't know about each other to the point where even the physical servers underneath the services and the data plane fleet they actually live in data centers that are local to that geographic region so if there is some kind of an infrastructure failure in that region we can be pretty sure that it won't affect the stack in all the other regions oh and also deployment wise software deployment wise also we're very careful and deliberate about the cadence with which we push a software change through the regions so typically we'll start with a single region deploy it out there give it a healthy amount of bake time watch our services and metrics and make sure everything's healthy before we move on to subsequent regions and as we deploy to more and more regions with no problems we gain confidence in the change and then we can start to speed up the deployment so next up is availability zones availability zones are meant to be basically sub failure domains within a region and again they're designed from the ground up from the data centers to make sure that they have independent fault characteristics and and often you know we strongly recommend the customers to take advantage of this by spreading your applications across availability zones and the recommendation is no different for Fargate we recommend that you create subnets in the different availability zones within a region and pass those subnets into the run task API and the default behavior of Fargate under the hood is that we will spread all your tasks for a given task definition or a service evenly across the availability zones for you and the idea is that if one availability zone is experiencing problems some percentage of your application is still up and taking traffic so that's what we asked you to do to to basically design for high availability in your application but what do we do to make sure that these zones are failing independently so far we just have a single stack for the entire region so let's go through each of these components in our stack to see how we can make them more resilient to single zone failures so the front end service the front end service like I said it hosts the public endpoint that you hit and that public endpoint is a regional endpoint so we can't really split the front end service into zonal services the cluster manager service keeps state about your clusters and again a cluster is a regional construct because you can launch tasks from different availability zones into the same cluster and also the zone spread logic that we talked about that logic actually lives in the cluster manager service so we can't really split this service into zonal services either they're logically regional services but what we do is that the physical servers underneath these services are still striped across the different availability zones now let's look at the back half of the architecture the VPC with the ec2 instances that has a slightly different story because with ec2 instances there's nothing cross zonal about them an instance is launched in a zone and everything about that instance its network interface its volumes everything lives within a zone and the capacity manager is a service that deals with instances so that's also entirely zonal and it doesn't perform any cross zone functionality so what we can do here is we actually split this part of the stack into separate zonal services so we actually run a separate dedicated V PC for every zone and there's a dedicated capacity manager service that's maintaining state about those instances so the cluster manager service does so and spread logic picks the zone and it calls out to the capacity manager service in that zone so now if there is a problem in a single zone the regional services may be operating at slightly lower capacity because they've lost a percentage of their fleet but we make sure that those services are scaled up enough to function without any disruption as far as customers are concerned even in the face of failure and on the data plane side we obviously may not be able to launch your tasks into the zone that's having problems but the other zones are basically functioning at 100% functionality and capacity so now let's zoom in one more level and take a look at the scaling story within one of these zonal V pcs so the thing about zones is that we actually don't control how large a zone can get and by large I mean how many tasks and thus instances we need to run in a single V PC and that's because it depends entirely on the distribution of subnets that we're getting in the run tasks API call and sort of the zones that those subnets live in so because we can't control how large a zone can get we can't really rely on the fact that a single V PC is able to scale and handle that load so what do we do about that so what if instead of having a single V PC for the entire zone we split it into multiple smaller V pcs and the thing about these V pcs is that we make them a fixed maximum size and this is a size that we've tried and tested and we know that it's something uh V PC can handle load wise and if we ever happen to fill up the fixed size V pcs that we have here then we just scale out and add more V pcs so this is the concept of cellular architecture where you want to have fixed size units that you understand the scaling characteristics of and scale them out horse we rather than having a single large unit and try to scale it up vertically forever and if you have zones that are smaller then you just have fewer cells in that zone and in fact this happens quite a lot our zones are pretty imbalanced especially in regions where we've launched new zones later on and the added benefit of cellular architecture is that we now had a sub zonal failure domain here if something goes wrong at that BP sea level it's only the instances that are running within that B PC will be affected so putting all of that together this is what the architecture ends up looking like with our multiple cells for the Fargate V PC and I want to take a minute to go back to that cluster manager service that's still a pretty big important service in our stack and it's keeping State it has a database and it's a little bit of an eyesore that we still have a single box there for the entire region so what we did is that we applied cellular architecture to the actual service and its database so we actually run multiple copies of the cluster manager service and each cellular cluster manager service is responsible for a subset of the clusters so again this was done in an effort to reduce blast radius so that is basically what the entire stack looks like the Fargate stack looks like and this is just for one region and Fargate is now available in all commercial regions I think that's 22 regions so we run this entire thing times 22 regions times 60 something zones so I just want to make the point that we actually go through a lot of trouble and we take on a lot of extra work just to make sure that the service is operating reliably for all of you and now I want to introduce firecracker to see how a firecracker can help with scalability so like I said firecrackers new virtualization technology that was developed and open sourced by Amazon and it's custom built for containers and functions so how is firecracker different from other virtualization technologies out there so if you think of virtualization as a spectrum where you have virtual machines on one end and you have containers on the other end there's always been this difficult trade-off between security and isolation properties versus startup time so with traditional virtual machines you get a rock-solid hardened isolation boundary between neighboring VMs but you are booting a full virtual machine and that often can take long whereas with containers you get lightning-fast startup times because they're just processes basically but the isolation boundary between neighboring containers is not quite as trustworthy so with Firecracker we think we've hit that right sweet spot on that spectrum where we can get the strong isolation properties akin to traditional virtual machines but we also get the fast start of times like containers and owner will tell you more about how that's possible but for now let's see how this works so with firecracker you take a bare-metal server and you basically install firecracker as the hypervisor and then you can launch your container workloads inside what we call a micro VM and as I mentioned before these micro VMS have strong isolation properties like regular virtual machines so we can actually co-locate multiple micro VMs running well you know running different tasks across customers even on a single bare metal machine so how does that property help with scalability in our previous ec2 instance model we run these ec2 instances in a single tenant manner meaning that we only ever put one task on an ec2 instance and like I said earlier we see tens of millions of task launches a week and what happens in the single tenant ec2 model is that these tens of millions of tasks launches get translated one on one to tens of millions of ec2 launches instance launches and VMs are really heavy and they're not designed to handle this kind of churn so it really puts a lot of pressure over there with firecracker what we do is we run easy to bare-metal instances and we run your task inside of a firecracker micro VM and like we said before we can actually put multiple micro vm's and multiple tasks on a single bare metal instance so now if we bombard this model with tens of millions of Fargate task launches it turns into tens of millions of micro VM launches but micro VM launches like we said are designed from the ground up to handle this kind of container like usage patterns which are often short-lived and high turn type patterns and we only have to launch that ec2 bare metal instance when our existing fleet of bare metal instances are full so it's a much lower rate of call to the ec2 run bare metal instance API so we're basically funneling down our calls and that has really helped reduce pressure and it just makes it a more tenable scalability story for us and even if you just remove all the tasks from the from this picture and just look at the instance density and footprint in these two pictures the single tenant model versus the multi tenant model you can see how our single tenant model is so much denser in terms of the number of instances we're running and if we think of our fixed size cell again where we can only put a fixed number of ec2 instances we get a lot more tasks mileage from that single fixed size cell with our bare metal model than with our single tenant ec2 model so that's all I had availability wise and now I will hand it over to owner to cover the rest [Applause] hello my name is honor Phyllis and I'm a principal software engineer working on AWS Fargate at AWS security is always our first priority so let's talk about how we achieve task security and isolation in Fargate as I only explained a few minutes ago today your Fargate tasks run on our fleet of ec2 instances in this slide I'd like to walk you through our data plane stack to see how we achieve test security and isolation in Fargate at the bottom of the stack we have the physical server and hypervisors the ec2 hypervisor ensures that isness is running on the same physical server can are isolated from each other using trusted hardware virtualization moving up within our ec2 instance we run our guest OS that's Amazon Linux to our Fargate agent and container runtime your forget tasks composed of one or more containers is running on top of the stack now the container isolation boundary is composed of abstractions like cgroups namespaces and second policies although they provide some level of isolation at AWS we do not trust these to be secure enough for multi-tenancy and so therefore forget never collocates two tasks on the same ec2 instance even if they are coming from the same customer each instance runs only one one and only one task once the task completes the instance is thrown away it's never used again what that means is every time you call run test you get a fresh new trust at ec2 instance this is in order for us to provide you our task level isolation guarantee many of our customers including financial institutions have built their own multi-tenant safe applications and platforms on top of Fargate okay so we've said that ec2 instance boundaries trusted because of hardware virtualization but the container boundary is not I said the files in depth exercise let's go through what would happen if a task were to if a container were to break out of a task and try to reach other resources on the system so we said we will never Co locate two tasks on the same ec2 instance so a test can't locally reach into another task but what else can it reach what other components do we have to secure in order to give you our task isolation guarantee first there is a guest OS forget agent and container runtime although all these components are reachable from that container none of them contain any state other than the state for the locally running task so even if those components are compromised there is no information there for an attacker to use but notice also we have to forget en I attached to the ec2 instance true that a and I the tests can reach into the forget V PC which runs our other you see two instances and tasks and it can also reach into our control plane which stores our state about other tasks so that means we have to secure the target EPC and control plane and make sure they are multi-tenant safe so how do we achieve that we use the tools and principle that are often recommended to your customers as security best practices for instance instance traffic inside the far gate V PC you make sure all forget en eyes are associated with a security group that this allows any communication between instances we also use V PC flow logs to make sure a monitor that there are no suspicious traffic going on in the far gate V PC and for the instance control plane the permissions available to the Fargate agents are scoped down to only describe and mutate the state of the locally running tasks so even if through a combination of other attack vectors an attacker gets access to a task ID it still doesn't have any permission to make any API calls to our control plane it's all locked down so now here comes the exciting part we'd like you to show you the new Firecracker based data plane this is not in production yet but it's coming soon before we do that though let's revisit firecracker to see how it improves our security posture we said with firecracker we can put multiple tasks on the same ec2 instance why is that firecracker is a virtual machine monitor that is based on KVM JVM is the same hypervisor that are easy to instance the ec2 nitro platform is using today with firecracker we can run containers as microbials and get the best of both worlds just like traditional containers with fire trader micro Williams you get the minimal overhead and fast startup times however unlike traditional containers firecracker microbiomes also provide an additional layer of trusted secure security isolation by hardware virtualization so this is the new firecracker based forget data plane let us take a closer look now so just like our ec2 base data plane at the very bottom we have the physical server that is managed by easy to forget tasks run on our fleet of easy to burn model instances and within the bare metal instance we run our host OS again Amazon annex 2 and we run our multi tenant Fargate agent on it we also run our container runtime container D and our plugin to its fire cracker container D so the fire cracker continued plug-in allows us to run containers as micro BMC we install the Firecracker vmm which spins up firecracker micro vm's and within the firecracker micro VM is the guest OS that OS is dedicated to your task it's not shared by other tasks so far gate team owns and manages all these layers from the host OS on the bare metal instance to the guest OS inside a firecracker micro via that means you don't have to care about the undifferentiated heavy lifting and you can focus on your applications your containers run within the safe and white environments in user space as with our current ec2 base data plane your tasks still get their own dedicated eni the DNI is only accessible from your tasks the forgetting are this time remains on the bare metal instance okay so now that we understand the layout of the firecracker based data plane let's go through the same exercise that we did before a few minutes ago with easy to data data play and compare them so as we discussed earlier the firecracker micro VM is hardly a virtualized and offers a trusted isolation boundary firecracker not only isolates micro VMs from each other but it also isolates micro VMs from the underlying bare metal instance however the task and container boundaries are still considered insecure so repeating the same exercise as before let's examine the extent of a reach in the case of a container breakout as you can see the local guest environment is still within reach but much of the forget data plane is now below the trusted line this is protected by again hardened micro hardware virtualization so with fire cracker we were able to create the reduced area of attack surface exposed to the tasks and thus further sim if I and strengthen our isolation story now let's look at our third and final tenant that's improving operational efficiency forget offers a wide variety of about 50 different CPU and memory configurations for your tasks in this slide we are considering only through three of them for simplification in our ec2 base the data plane we run a variety of different in-situ instance types in an effort to best match the test configurations that we offer however in situ instance configurations are a bit more coarse-grained than test configurations which means there's always some waste on inside the ec2 instance which decreases our utilization also as Archana mentioned earlier behind the scenes we run a fleet of ec2 instances that's because most container workloads expect fast startup times and having those instances ready and booted up helps with that however it it also introduces a new type of inefficiency that is some of those ec2 instances are now running idle before you call run tasks now let us consider the case when there is a burst of small tasks in sizes and it just fills up our warm pool buffers with the next request we can either choose to put that task in a larger instance type which creates even more waste and therefore reduces our utilization or we could choose to reject it but that would reduce our availability in either case it's not a good choice to make let's revisit firecracker to see how it helps with the utilization problem so in addition to the isolation benefits that we've discussed earlier firecracker also provides two more attributes that is really attractive for serverless platforms like far gate and lambda first the Micra we have spun up by firecracker are not traditional rehabs because we made an open-air choice to emulate a very minimal device model most cloud need workloads do not care about traditional devices in the VM so stripping them out gives us both very fast startup times fire tractor micro VMs can start running in in a matter of milliseconds and it also reduces our surface attack area this is super relevant again for services like far gate because it means we can read we can launch a firecracker micro VM on demand when you call run tasks the second property that is very useful for Fargate is that is the flexibility and configurability of resources for firecracker my problems with firecracker we no longer have to be bound to the static ec2 instance type in cell sizes we can create any microvia with any cpu and memory size so when you call run tasks and give a give us the exact CPU and memory you want for your containers we can create a new file to microvia just-in-time we call this right sizing right sizing enables us to eliminate on instance waste and so what this means is that in our firecracker data plane we no longer run a heterogeneous fleet of different ec2 instance types instance types rather we run a homogeneous fleet of bare metal instances which reduces our operational complexity and increases utilization so as test launches fluid we are able to configure and launch a right sized micro VM in milliseconds this has allowed us to reduce own instance and warm pool waste which in turn allowed us to drop prices earlier this year of course the packing isn't always perfect so there is some innocence way still but we're still doing way way better another area of innovation enabled by firecracker is the reduction of the data plane overhead with the single tenant ec2 instances we were first forced to run a copy of the Fargate agent next to each task so far the agent is just another process which consumes CPU cycles in some memory it's not much but when you run a large clique like we do it does it up also because the forget agent needs to talk to the Fargate data control plane we have to attach a forget Eni to each instance this creates some pressure on our VP C's now comparing that to the multi-tenant bare-metal instance fleet we now run only a single copy of the multi-tenant forget agent on the bare metal instance itself this greatly reduces the waste also now that we are on the one copy of the agent we now only need one target DNA and that Eni remains on the bare metal instance so let's read what you've discussed today today we've taken a look at what's under the hood of our gate security of your workloads is always our primary tenant it's the basis of everything else with you next comes availability and reliability fergie team goes to great lengths to ensure that our services remain reliable and secure so that your tasks run safely then finally we will keep innovating to improve our efficiency efficiency enables us to provide you better performance using less resources which reduces our costs which means you get to enjoy more price reductions a the best container services team is here in reinvents we have many related breakouts and sessions this is a list of some of them you can choose to attend them here or watch the videos later on YouTube we also have a public road map on github this is where you can learn about upcoming features file issues create feature requests and ask questions to the actual team members who are building them we want to hear your opinions we hope you found the session useful so on behalf of the Fargate team we'd like to thank you for your time and now I think we'll have some time for questions thank you [Applause]
Info
Channel: AWS Events
Views: 14,154
Rating: 4.9402986 out of 5
Keywords: re:Invent 2019, Amazon, AWS re:Invent, CON423-R1, Containers, AWS Fargate
Id: Hr-zOaBGyEA
Channel Id: undefined
Length: 48min 50sec (2930 seconds)
Published: Mon Dec 09 2019
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.