Deep Dive into firecracker-containerd - Mitch Beaumont (AWS)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
that's right can everybody hear me okay yes good excellent good great so my name is Mitch Beaumont I work for AWS I'm part of the solutions architecture team and I'm based here in Sydney I had the pleasure of talking at container camp last year down in Melbourne and I talked about one of one of our open source projects we were working on which was the the container D sorry is that coming up working there we go sorry which was the the VPC cni plugin and that was a CNI plugin that we'd written to help integrate kubernetes more natively with with VP Singh that's come a long way since since I presented last year and I'd encourage you to take a look at that project now I think we're pushing towards version two so there's some interesting enhancements there but I thought I'd go one step better today and talk to you about two open source projects that we've been working on since the last time I came to contain a camp they are firecracker and projects which has spun off a firecracker called firecracker container deep but let's start with some basics kind of a show of hands who here is using containers in production okay more than I thought that's good excellent all right so some of this might seem like one I want stuff too but I do want to make sure that everyone's on a level playing field so what is a container well it's a it's a I mean it simply put it's a mechanism for deploying and all running software allows us to encapsulate a piece of software and all of its associated dependencies meaning pretty includes benefits in the realms of isolation and repeatability and by encapsulating these pieces of software and all the dependencies we address some of the challenges that are associated with dependency drift and configuration drift as well there image-based modern containers that we use today through tools like docker or image based and we're able to again a lot of optimizations minimal storage minimal network i/o usage through the use of these these layers and they're very flexible in terms of how we use them so we can deploy them as a single container or we can deploy them as multiple containers compose together to make a more complex distributed application we've got services born out of that use case things like kubernetes things like Amazon's elastic container service all designed to help build sorry all designed to help solve that problem of managing and orchestrating these these these composed sets of applications through containers so how does one go about making a container now Liz the legendary Liz rice has demoed this container camp a few years ago and she built a container from scratch so I'm I'm not even going to try and fill her shoes and give you a demonstration of how that actually happens but I think it's important that we just take a few steps into what is a container obviously for everyone in the room here it is magic these things just happen but there are some fundamental basic concepts that sit under the covers when we deploy a container so we've got things like namespaces these are by the way these are all the Linux primitives so if we're deploying a standard container on Linux or Linux base container we're using these primitives to construct them so we're using things like namespaces to give us those visibility controls to isolate what we can see or what the containerized process can actually see and touch things like file systems and networks tanks we've got control groups which help us limit the limit the types of resources that our process can get access to so we're talking about CPU memory disk i/o then we've got a concept called capabilities these are again built into the Linux kernel and these features allow us to provide more fine-grained access control to containers to restrict the types of calls that our users within those containers to make can make as opposed to just saying routes or non route we can be a little bit more deliberate about that we've got tools like Cisco capabilities like sis comp that allow us to to limit the scope of the system calls that can be made to the to the underlying kernel and then we've got Linux security modules so things like a bomber and selinux and then under the all of that I alluded to this earlier on in my previous slide we have these these Union file systems that allow us to to do things like deduplication of storage and really allow us to optimize the usage of our infrastructure to deliver these containers so containers versus VMs it's a it's an age-old battle what is the difference and I'm sure everyone has some opinions around this I thought I'd share a few of mine with you and hopefully this will help set the scene the rest of the session today so the differences are I mean if we look at the container to start with the container as those of you that are using containers will know they they share a kernel all of the components as I've just talked about there to run a container actually built into the kernel this makes containers very quick to start I boota boota boota an operating system my kernels already ready I can start creating containers and they'll leverage those existing kernel components to run on top of my operating system very lightweight very low overhead very quick to start the flipside is obviously virtual machines and there they have a slightly different proposition they have a have a more defined isolation boundary and I'll talk more about that very shortly and the virtual machines they emulate or at a virtualized virtual hardware in order to use a virtual machine we have to actually install an operating system so I need to put a kernel inside of a virtual machine guest to actually run it each of those virtual machines is responsible for managing the the initialization of that virtual hardware boot processing harder initialization and then generally speaking from experience virtual machines quiet a little bit more overhead because of this additional these additional capabilities or these additional steps they have to go through to actually initialize themselves so why would I use a VM over a container and that is a question that I'm sure many of us spend many nights staring out into the sky trying to ponder and answer that question I I know I do well it all comes down to kernels I thought that was funny so if we think about virtual machines virtual machines have a single kernel per VM we have a single in a standard Linux container and I'm not excluding Windows but I'm talking specifically about Linux today as standard Linux containers have a single Linux kernel per virtual machine on as I said earlier on a container shares a Linux kernel with with with other containers running on the same minutes kernel what this means is that they are sharing the entirety of that Linux kernel interface and that's quite big there's a lot going on there there are a lot of there it can be very hard to reason about the the system calls that are being made and the data that's being exchanged in exposed using the proc and the system all systems on each of those operating systems which can make it quite challenging to reason about that security model that we need to wrap around that virtual machines as I mentioned they they virtual eyes or they emulate hardware so they look and they smell like hardware and you can define these simple Hardware interfaces we understand after many many years of deploying operating systems on on Hardware what that looks like it's much easier for us to reason about that security posture of a virtual machine as opposed to a container running on a shared kernel what do I mean when I talk about isolation I've mentioned isolation a couple of times now so let's have a quick look at that I took that photo by the way budding photographer so first and foremost what I mean by bye-bye isolation when I think about that from a from a cloud providers perspective the most important thing for us is making sure we protect our customers we protect our tenants from other tenants that are running running in the same for environment and this can be protecting them from you know intentional malicious behavior or it could be simply protecting them from noisy neighbors and over an over usage of resources we don't want a laborer to to interfere or impact the the operations of another customer's application and we believe in this concept called defense at depth and that is how we tackle all of our services when we're building a service from a building a product when we're building a project like firecracker and the firecracker contain any project we actually analyze each of those services using a framework called strides it's pretty well known I think in security circles it's spoofing its acronym it stands for spoofing tampering repudiation information disclosure denial of service and a privilege escalation or escalation of privileges I think is the acronym there so we look at all of our applications and services using that framework to try and make sure we thoroughly understand the potential attack attack vectors and security risks that may exist there we learn over time and what we understand to be the case is that the Linux kernel is its old the primitives that exist within the Linux kernel aren't brand new and they can be tweaked to improve isolation of processes that are running there and a lot of the constructs that exists that I've just talked about helped us solve those problems but it still remains a single point of failure if that's the use the the right word to use there and and slight misconfigurations in the way that that those those features work and operate and interact can lead to you know potential threats in the space of privilege escalation or information disclosure all of us in this room I'm sure have been using virtualization for some time now who's been using virtualization by the way virtual machines VMware yep as I thought most people have been using it so we we understand how that works as I said before we understand and we can reason about the security models around virtualization and at AWS we've been using virtualization for a really long time and we're generally happier not sharing a kernel between our workloads because we think it gives us a much stronger security posture when we start to deploy applications and on our tenants and our customers start to deploy their applications on the flip side of that source should I say along the same lines of that you know we can build the most secure system in the world and obviously we try very hard to do that but we also need to build a system that is usable and we need I'm sure many people in this room have used cloud providers and when we build these systems we need to make sure that you can use them and we want to make sure that you can bring your own tools to the party as well and so things like docker obviously is paramount to most of those conversations most people are familiar with docker and and as docker and general container utilization has evolved over time and matured what we started to see as we started to see some standards emerge organizations like the CNC effort Scott was talking about earlier on play a big part in that and as does the Linux Foundation and we've got standards that are defined by the Linux Foundation like the OSI is the open container initiative and if we look at the open container initiative it defines a number of very interesting specifications especially in the container space around image standards so it defines how the images that I that I referred to earlier on should be presented and represented and transferred to the location so that you can actually instantiate a container from that and it also defines a standard for for runtimes so how do we actually run this container how do we construct the environment of the container runs in and how do we do that in a standard way in an open way so that we aren't necessarily locking anyone into using a particular type of container and time or Orchestrator so things like the OCI are very important in that space and and when we're building any type of service especially the ones I'm talking about today we wanted to make sure that we had open interfaces or compatibility with those open specifications so that we could we could support other use cases whose phone was that okay so what is it container in time container in time so the standard Linux containers and I'm gonna go modern most of my slides of questions by the way so I'm just self questioning myself so I can can I some back to you the standard container runtimes Linux container runtime this is the part that's responsible for actually assembling those Linux primitives so when I create a container when I use a Kubler to our virtual Kubler tour I or I use docker docker run I'm using a container runtime of sorts to actually construct that Linux container for me assemble all of the pieces to give me the environment in which my process can be isolated and can be executed one of these standards that I alluded to earlier on or mentioned earlier on was the OCI standard for container runtimes and that is actually largely based on the work that was done by docker and a lot of that was actually contributed to the - there have been container initiative from docker those of you that have Dov's dived deep into docker will be familiar with with with run C or in C is a binary essentially it's responsible for constructing the container as it as it runs on an operating system and and the run C represents essentially the reference implementation of the OCI specification for building a container and time other runtimes do exist and you know this is kind of what I'm here to talk about today is one of the runtime implementations that we've developed for for using a virtual machine Micro virtual machine service like firecracker look a bit more about that surely so container D is is probably the one that and that Scott had it up on the slide earlier on as well and container DS pretty much the standard reference implementation for for a container runtime or slightly higher order than the container runtime itself and but it's it's responsible for managing that life Michael of the containers as they run within your environment it was again contributed to to the to the CFCs by docker so it's a project that you can see by hopping along to the CN CF website now there it's it's I can't remember the state of it whether it's graduated or not yet but it's definitely part of the CN CF for was one of the really interesting things for us at AWS when we were thinking about how we build this integration with firecracker in containers and what runtime do we use it was that containers he's written in a very very modular way written in a way that you could integrate different types of capabilities you could bring your own actual low level runtimes you could bring your own additional capabilities so things like snapshotting which I'll talk about very shortly but it's very modular so we can extend it and manipulate it and change it in ways that work for us without necessarily modifying the core capability of the functionality that it offers so let's take a quick look at what container D looks like under the covers sorry phenols already seen this diagram today or yesterday in any other sessions but I'll quickly walk you through what container D looks like under the cupboards each of the each of these blocks essentially represents the different modules that I that I mentioned earlier on and so a top level there we've got the this G RPC API so this is basically the standard way that we interact with with container D from an AWS and a fire cracker perspective what we're really interested in is the the creation of the containers themselves and the bits that go into creating that container so we're thinking about the file systems that are required to create the root filesystem for the container and how we mount those and attach those to a to a virtual machine within the context of fire cracker so let me think about the content store so the content store is a part of the container the runtime which is essentially responsible for storing all of that raw information that makes up your containers root filesystem so I have a container image and it's downloaded inside of the tarball or there is a tarball that essentially contains a representation of a root filesystem and that tar ball exists in its raw format within this content store then we've got this snapshot of component this was one of the areas where we really have to start thinking creatively and what we'll see why very shortly but essentially the snapshot or is responsible for for extracting converting the contents of those tarballs that are stored in the the content store into a file system that is compliant with with the OCI standards and represents a file system that the container can actually use in order to run it's this layer here in tired of the inside of the snapshot module that we implement things like the copy-on-write capability that gives us the optimized storage functionality from an api perspective each of the snapshots are each of the the snapshots represents a full thoughtful image to contain a D now the snapshot or itself again going back to the modular ization of container D the logic that sits behind that is it can be anything we can implement whatever we like underneath there and that's why we really wanted to go and inject ourselves and our own capabilities and experience to to solve a few problems that we saw so it does ship with some standard snapshots of functionalities in there but then they come with different trade-offs like storage and space optimisation used and I'll talk a bit more about how we solved the snapshotting problem for for our particular projects then we've got the runtime pieces the bottom that we click forward there so this can basically use any OCI compliant runtime that exists out there today defaults to run C which is the binary I mentioned earlier on but it's plug-in based again so we can bring your own runtime we can connect it in there and this is where we wanted to look at bringing firecracker into the equation so I've talked about firecracker a few times now I'm sure you're all dying to know if you do not know already what is this firecracker thing so I'll take a few seconds just to quickly explain what firecracker is and what its proposition is and why it helps solve some of the challenges that we we were talking about a little bit earlier on so it is a a two KVM based virtual machine monitor it's designed to run instantiate create micro virtual machines each of those micro virtual machines has a very limited device model and a very limited set of capabilities and functionality and the reason we wanted to keep that device model and that functionality so limited was because it makes it much easier for us to secure that that surface area that vector in addition to so it's based on KVM as I already alluded to open diagram gives you a good indication of where it sits in the stack so we have our we have our infrastructure there which is our hardware we've got our we've got a Linux kernel sitting there with the the kernel virtual machine module sitting inside of that and then firecracker sits on top of that and uses those KVM constructs to actually build a micro virtual machine construct within which then we can essentially login and ruin any kind of user level construct but we need to be able to bring things like an operating system to the micro virtual machine and we can do things like bring we do that using image files again so we can bring uncompressed kernel image files that as long as they're in an elf compliant format we can attach them to these micro virtual machines and actually boot this virtual machine and have it running and operational it's we have this simple API it's sitting at the front hair it's a it's a restful api and this is where we configure the micro virtual machine itself so once we start the VM or before we start the VM we need to give it some instructions we need to tell it what type of kernel to use and and we can essentially bring any kernel with it with a few restrictions around the fact that it can't be an uncompressed so it can't be a compressed kernel image and a few other little things and we can't use initial Ram file systems I can share some more details later if you want to around that but what we define the characteristics of the virtual machine through the API we can then start the virtual machine using the API and we get this this very small lightweight virtual machine within which then we can essentially run anything that that is supported on the operating system that is running in there I ran a session at the AWS summit in Sydney a couple of months ago which really went a bit deeper into firecracker I'm not going to do that today but if you'd like to learn a bit more about firecracker hop along to the URL it was a really cool session because I put got to put some assembly code up on the screen and talk through how assembly code plays into virtualization and how we solve that challenge so have a look at that it's a kind of interesting video if I if I do say so myself ok right so now how do we go about creating a how do we create micro VMs and then run containers inside of them which is what we're really looking to try and solve today we've got our micro virtual machine hopefully again by now we start to understand why this this hard shell around our Eddy in container provides some additional benefits from a security and an isolation perspective how do we actually get those containers to run inside of these Vicryl micro virtual machines and that's the question that we asked ourselves when we set about building the the container D firecracker integration one of two main big diagrams I have for you today just to walk you through what that process looks like again some of the bits on here will look a bit familiar from the Container D diagram I showed earlier we had Li have to build about four new additional components for this project in order to fulfill or or meet the purpose that we were trying to meet so we build a snapshot of component and I'll talk a bit more about like surely we built something we're calling the control plugin we built a runtime and then we built an agent component now the snapshot a component is a block block device based snapshot up and this will all become clear very shortly it's essentially an out of process G RPC proxy plug-in which means that it doesn't have to be compiled into container D so it will run alongside the existing container D binary without too much issue then we've got the the control plug-in and the control plug-in essentially presents an API and we use this API to manage the lifecycle of the virtual machine so we interact with that API and we say create me a virtual machine configure this attach some volumes over here and we configure the properties such as the disc images that we want to use as well now at the minute and this is going to change but at the minute that that that plug in that control program is actually compiled into two container D so if you wanted to use that particular if you wanted to test this out today you would actually have to go and and compile that in it would be a special version of the container D binary as opposed to just a plug-in based version but we were working on that and then we've got this proxy shim runtime component which is essentially responsible for connecting container D which sits outside into the virtual machine and the components running inside of the virtual machine and inside of the virtual machine when we have an agent and the agent that runs inside of the virtual machine you can think and I'm going to use Scott's presentation from earlier to help kind of visualize this in acts in the same way that the Kubler DAC so it moves inside of the virtual machine and essentially instructs the operating system how to construct the container inside of that virtual machine the agent launches Runcie inside of that container oh sorry inside of that virtual machine it's responsible for attaching the snapshots that are existing that exist and that we created using the snapshot to process earlier on and essentially then what we have inside of that can virtual machine is a container it looks and smells and feels like a container there's nothing different about it except it's a container running inside of a micro virtual machine running on top of another virtual machine turtles all the way down so what is the block device snapshot er so the reason we have snapshot is as I mentioned earlier on is that we use them to create these route file systems that we from of the container images that exist today now in the world of firecracker and the container D projects we when we download these container images we're storing them on the host we aren't storing them on the guest so we need a way to expose those container images into the guest operating system that is running because that's where our container ends so we have to think through that because obviously for AWS and for every other cloud provider as far as I'm aware and security is job zero and the most important thing we do and we couldn't figure out a secure way to share files between environments between the guests and the operating sorry the guest and the host so what we ended up doing is we ended up implementing a block device snapshot err so the way we do that essentially is we take the image that we download from the content store we suck that image in and we unpack the tarball and we actually create a blog device and we write the entire contents of that table or to a block device that block device exists on the host and then we then we use that block device and we mount that block device to the Firecracker virtual machine and then once the block device is presented within the firecracker virtual machine itself the agent is able to pick up that block device and mount it to the to the container that is running inside and then we get our root filesystem for for our container in there we actually wrote two different types of block device snapshot er one of them we call a naive snapshot and it was something that was really written for more of a proof of concept and essentially it did pre copying so every time we created an image you created you copy and if we added anything to that image if we need to create another copy of that image so very inefficient in terms of the storage space that it would use but it helped us prove a few points we then evolved over to use this device and upper base snapshotting functionality and device mapper is is a linux framework for essentially mapping physical devices to to virtual devices so we use a device mapper to to build these block devices represent these doc block devices to our virtual machines because we're using the block device sorry the device mapping functionality we were able to gain some benefits from a storage utilization on perspective where out the storage the device mapper uses thin provision pools of storage represents them as devices and we can share their blocks across multiple different types of containers so we get the benefits of deduplication in that particular scenario there and we've recently merged this into upstream container D so I've dropped a URL at the bottom now if you want to go and look at the pool request and and read through some of the banter that was having backwards and forwards around that pull request it's quite interesting so that was the snapshot up then we've got the time here then we've got the control plugin so the control plug-in is essentially responsible for the VM specific setting so we're using this control plug-in to direct what the virtual machine should look like creating a virtual machine with these properties I want to have these devices connected to it I want our out this much CPU and this much memory now as I mentioned this is compiled into into container D the reason it's compiled into container D in the minute and not a remote sorry not a dynamic plugin is because we we want it to be able to expose the the API that this presents using the existing container dapi socket as opposed to having it exposed over a different socket and that would cause a few challenges downstream with a few of the clients that need to connect and interact with that service in the future you know we may look to move this out and we'll tackle that problem as we as we go through the evolution of the project and then we've got the runtime so the runtime is responsible for stacks it's sitting outside of the virtual machine interfacing with the the control plugin and it's responsible for proxying instructions into the virtual machine itself it talks to the agent within the virtual machine using a console called V sock we have a each of the virtual machines as a V sock socket and we talk we part and pass instructions into the virtual machine via this V sock socket we're implementing this using the container D v2 API not the OCI specification as a few reasons for that and I won't dive too deeply into that because I'm very conscious of time but it provided essentially a little bit more flexibility to us from a from a process invocation perspective what about the agent so I mentioned the agent in my diagram a little bit earlier on as well now the agent is the bit that sits inside of my virtual machine akin to the Kubler that the Scott was talking about earlier on and it sits inside of that virtual machine in is responsible for invoking run C for me so it's going to call run C and it's going to tell brand C to create me my containerized environment using all the primitives that i talked about during the the beginning of the slide here so it's going to trigger the construction things like namespaces and C groups which the container itself will actually run it and the reason we want to continue to use run C and we haven't done anything too differently yet is that we wanted these to look and feel and smell like containers that we all know and love today see earlier slide about building a seamless experience for our customers who want to use the same types of tools and workflows that they're already using today so so when you create this virtual machine inside sorry the the container inside of the virtual machine it will look and feel and smell essentially like like a container that you're used to running on your own laptop if you have docker for Mac installed yes oh wait what so I've tried to diagram this a little bit to help make it a bit clearer and and understand this may not necessarily be the clearest of diagrams as also so sorry about the eye chart there but this kind of gives you an idea of the workflow that we're going through here when we when we actually instantiate a firecracker virtual machine and we were in a container inside of it so we we have an Orchestrator X it invokes container deal calls container D it's passing a talking-to container D has is downloading an image it's passing that to the snapshot er the snapshot or ISM packing that image writing it to a block device the runtime at this point in time is also talking to firecracker and my apologies sorry the control again is talking to firecracker and saying go and create me a virtual machine that looks like X whilst all of this is happening firecracker is building my can take my virtual machine for me then the runtime then is taking the disk images or the device blog devices that have been created by the snapshot err and it's presenting them to the Firecracker virtual machine that's now being created inside of that then the agent itself wants the firecracker virtual machine has started up and the kernel is loaded the agents then looking for those devices that are attached to the virtual machine and it's using them alongside they run C binary that it's invoking to actually build the container mountainy the block devices which essentially represent the root filesystem and the kernel the kernel image and lo and behold see previous slide with the magic we have a we have a container running inside of a virtual machine in much the same way that your container would run in any other environment with the exception of having a nice hard virtual machine shell sitting around the outside of it so what is the current status of the project so far everything is prototype right now but it does work I don't have a video to show you today but but it does work and I would say if you want to have a play with this please go along and have a look at the project that gets a page which I'll show you very shortly we're working on a few optimizations with regards to the communication between the runtime and the agents that is because the implementation the the V sock implementation is that we're currently using is experimental implementation at the minute so we we have a few things to tweak and change and manipulate around that as as V sock itself matures we've also got a few challenges that we're trying to address around mapping of block devices to firecracker itself to firecracker virtual machines at the minute because of the complexity of the device model and the functionality that we wanted to reduce on firecracker virtual machines we don't really have a way to dynamically attached storage volumes or block devices to those virtual machines so we have to pre allocate those block devices to those virtual machines so that they already exist so that we can then redirect them to the block devices that we create through the snapshot to process all in all though these enhancements that we're making are leading us to a point where we'll start to be able to run multiple containers inside of a single virtual machine and obviously that then opens the door to constructs like a pod or a task if you're thinking about Amazon's elastic container service so multiple containers and again also we want to we want to afford the ability for four users of this to be able to run commands like docker exec so that we can execute and run components inside of these micro virtual machines in the same way that you would do if you were running on your own virtual machine so the project is open source I would encourage you to to go and have a look at the URL there if you're interested in learning a bit more about it the top level of that URL is the is there is a kind of core project for firecracker itself and the sub project you've got there is the firecracker container D integration that we've built we'd love to get your feedback I'm conscious that I'm chewing almost into lunchtime now so I I won't I won't take any questions now but what I will do is I'm going to be here for the rest of the day so if you'd like to come and talk to me about this or any of the other work that we're doing over at AWS with regards to containers I'd be happy to happy to chat with you I've also got my colleague Jason new maker over there as well so he's here to chat as well if he's like to have a chat about anything but thank you very much for your time sorry I overrun by a few minutes [Applause]
Info
Channel: Container Camp
Views: 1,941
Rating: 4.9024391 out of 5
Keywords: container camp, container technology, containercamp, Containerd, firecracker, AWS, Firecracker Virtual Machine Monitor, Linux
Id: 2rwYZdVPN4g
Channel Id: undefined
Length: 32min 19sec (1939 seconds)
Published: Sun Apr 12 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.