Deep Dive into firecracker-containerd

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
let me introduce myself my name is Ajit Singh Rana I am a docker captain as well as docker community leader in Bengaluru and I will be your host for the entire tracks today which is a mix of docker for developers open source transform and there are other tracks are the panels and discussions all those are we are going to cover in today so we will start with the first track which is an open source track and we have our first speaker who is working as a senior software development engineer at AWS and he is here to talk about you know firecracker continuity projects so if you are new to the Firecracker this is an open source project which was announced by Amazon in the last reinvent a dual OS which happened in 2018 the speaker helped building the amazin ECS and forget and he is a contributor to docker and continuity project under this session he is going to deep dive into fire cracker continuity so without any further delay let us welcome Samuel carp onto the stage hi everyone my name is Sam as I was introduced I'm a engineer working at AWS and I work on the team that is helping to build the Firecracker container D project which is a project an open source project aiming to make it easier to run containers with hypervisor mediated isolation provided by the fire cracker of EMM which is a new virtual machine monitor announced by AWS at reinvent so we have a little bit of an agenda for today we're gonna do a really really brief overview of containers just because I want to set some context for what's interesting then we're gonna dive into like what is container D and what is interesting from a can runtime perspective we'll talk a little bit about the Firecracker virtual machine monitor itself then we'll talk about what it takes to convert that into something that is usable with container D and usable for containers we'll have a really short demo and then we're going to talk about where the project is and we'll have some time for Q&A so I think you know all of this about containers or most of you who are here at docker count are probably gonna be familiar with this already but I do want to talk about this just to set some context containers are a really broad term but we're talking about the aspects of containers that are related to the use of both the usability of containers what makes them interesting from why you want to use them and also the underlying technologies that are implementing to containers so containers from a usability perspective or primarily a mechanism for distributing and running software with varying degrees of isolation and repeatability containers have grown really popular because of the kinds of things that they've enabled like repeatable deployments some of the problems that you might have with deployments are the drift in your different systems in terms of dependencies or in terms of configuration containers being an image-based deployment system with an isolated view of the system and isolated view of the filesystem helped to get around that issue by letting you package all of your dependencies and all of your configuration inside the container image itself and that lets you know that the same versions of your dependencies are always going to be present containers are also efficient in storage and network transfer Dockers helped to build this thing called layers which give you a way to share some contents between different container images and only have to transfer deltas instead of the whole image every time you want to download it there's also copy-on-write so when you modify or when you launch a new container you can launch it really quickly without having to copy the whole filesystem containers are also pretty flexible you can have single purpose containers and then compose them together to build applications out of a combination of containers instead of having to package everything in a single container you can have a separate file system but you can share things like the network so your different containers can communicate over local hosts or things like the UNIX IPC subsystem containers make it easy to automate your deployments because of these kinds of repeatability you can rely on containers to be deployed correctly the separation of purposes in containers lets you have an application composed out of multiple container so you can have something like a web server web server in one container your application code in another and some monitoring infrastructure and a third this composability is also both driven by and led to some of the popularity of container orchestration systems like kubernetes or like Amazon ECS to make it easier to deploy and manage container system containers in production systems at a large number of containers large number of hosts so these multi container workloads end up making up the basic unit of deployment in the form of kubernetes pods and ECS tasks so containers are on linux are made out of a number of different primitives in the Linux kernel there's enough here for a talk on its own but I do want to touch on them briefly because they make up a lot of the flexibility of containers so native spaces are a mechanism to control visibility and provide separation of things like networks or process IDs and file systems control groups as they're commonly used in containers help you to limit the quantity of resources that are used like the amount of memory or CPU that a given container can access capabilities provide somewhat of a finer grained control over permissions than just a single root or non root user set comp is a mechanism for limiting the allowable syscalls that a process or a container can make Linux security modules like selinux and app armour provide ways to restrict access to resources like files and finally Union file systems or the mechanism that image layers work with containers share a Linux kernel the technologies that make up a container really flexible and the fact that they're these technologies are built into Linux make containers run really quickly with very little overhead virtual machines are a little bit more isolated they virtualized or emulate hardware components and you need to run a full operating system on top of the VM in order to use it they also need to take care of the boot process including things like initializing the hardware components viens typically have a little bit more overhead than containers because of that so if VMs have more overhead why would you want them generally I think they're easier to reason about virtual machines have a single Linux kernel per virtual machine they only have access to that individual machine containers on the other hand Scherer kernel and that means that you're sharing the entirety of the Linux kernel interface which can be something that's pretty large you have all sorts of different sis calls that Linux has you have data exposed in the proc file system in the Syst filesystem and it can be challenging to reason about all of the interactions between those sis calls and the data that's exposed the ants on the other hand look like hardware and when you're creating a virtual machine you have the option of creating something with very simple hardware with very straightforward interactions with well-known hardware interfaces that makes it easier to reason about from a security and isolation perspective and a great place to establish trust and resource boundaries which make it good when you run things like multi-tenant workloads like what we do in the cloud where we run workloads belonging to different customers on the same physical hardware it also makes it easier easier to run workloads where you don't really trust some of the software for example use cases where you're transcoding transcoding user-submitted content like videos or audio or images with tools that have had a little bit of a spottier vulnerability history so what do I mean by isolation in AWS we believe that our first responsibility is security of the infrastructure protecting against infrastructure level attacks we have a what we call a shared security model where we take responsibility for the underlying infrastructure and customers are responsible for the security of their own applications and that means that we need to protect customers from other tenants that are running on the same hardware and that can be things like customers that are intentionally malicious and are intentionally trying to attack other customers or things like noisy neighbor problem where you have customers that are using a lot of resources and so we don't want someone who's using all of their resources to affect you ATF's believes pretty strongly in defense-in-depth and what that means is we have different mechanisms to protect against multiple different kinds of threats we use a classification system called stride um for thinking about different ways different kinds of threats and stride stands for spoofing tampering repudiation information disclosure and escalation of privilege when we build a native US service we end up constructing a threat model that covers threats and all these different categories and we come up with mitigations to try and deal with those threats the Linux primitives that make up containers are not very new and you can use them to increase the isolation or enforce separation of processes you can use things like systems like set comp to limit sis calls you can use Linux security modules but because they share a kernel the kernel is is a single point of failure it wasn't really necessarily designed with the easiest cases in mind and the flexibility of configuration that containers have make it easy to accidentally expose threats like tampering information disclosure and escalation of privilege in the cloud we've relied on hypervisors and virtualization for a pretty long time we believe that hypervisors protect well against the kind of threats that containers can expose and we're generally happier relying on a mechanism like that which doesn't involve sharing a Linux kernel between different customers in the cloud but isolation isn't really enough on its own we want to meet our customers where they are and enable them to run all of the different kinds of workloads that they want to run and over the past few years Linux containers have grown increasingly popular due in large part to the docker toolchain and the easy user experience if it's provided and I think the fact that you're here at docker con probably means that you agree with me on that docker is immutable images easy way to save and pass around those images and easy command-line interface have really helped to drive this popularity container orchestrators like Amazon ECS kubernetes and mesas have also grown in popularity as a means to run and manage large numbers of containers over lots of machines and container orchestrators like this has made it easy to deploy multiple containers and compose them together as a single unit like an ACS tasks for kubernetes pod as docker has evolved and as the container landscape has matured some standards have emerged one of the bodies responsible for these standards is the open containers initiative a part of the Linux Foundation whose members include the maintainer zuv docker and various other parties interested in improving the container ecosystem OCI has established a few standards that are relevant to container users and are particularly relevant to the fire cracker community project that we're talking about today these include the image standard which defines how a container file system should be represented and transferred as well as the runtime standard which is meant to make it easier to have alternative ways to run containers kubernetes has also established an interface for runtimes called the CRI or the container runtime interface compliant runtimes with CRI can be easily used with kubernetes so that brings us to container runtimes or the software that hopefully makes it easy to run your workloads inside containers when you run containers you probably interact with a few different parts of the stack and I've tried to break that down a little bit up here you might interact with a cluster Orchestrator like Amazon SES or kubernetes to coordinate containers across different hosts handle application semantics like exposing services through load balancers monitoring scaling all those things you may interact with a local Orchestrator like docker for development purposes and also many of the container orchestrators that you use end up using docker to do local orchestration on the host as well docker ends up handling things like local restart policies health checks local network traffic io and collecting logs below the orchestrator might be a management component that deals specifically specifically with the lifecycle of containers like container D container D was a component that was broken out of docker and factored to be a separate thing and so now docker uses container D for local lifecycle management of containers most people might not interact at this layer and below that it's finally the container runtime like run C which is responsible for making the container actually work today we're mostly going to talk about these bottom two layers local management with container D and the runtime for standard Linux containers run times are responsible for setting up the the Linux primitives that make up containers like C groups and namespaces in the file system from the container image the open containers initiative has a standard for container runtimes largely based on the work that docker did and run C was actually a contribution from docker to the open containers initiative and run C ends up being the reference or standard implementation of the OCI runtime but as a standard the intent is to make runtime swappable and that other runtimes can exist other implementations can exist like the one that we're writing for firecracker container T is a slightly higher level tool that builds management capability for containers helps you control the lifecycle manage container images handle i/o and so forth container D was donated by docker to the cloud native computing foundation it's written in a very modular way and that makes it attractive as a platform to build firecracker support for containers so let's talk a little bit about container t's modularity container D has broken up into a bunch of pieces which are intended to be independently usable but also complement each other well when you use them together as a user of container D or an implementer of an Orchestrator like kubernetes the primary mechanism of interacting is through container DS G RPC API or an adapter layer like container DS that go client this diagram shows some of the components that we're interested in first we're going to talk about the content store this is where container DS stores raw content which is usually compressed image layers when an image is pulled with container D it ends up in the content store the layers of the image are typically stored as compressed star balls that are not extracted into a file system in this component next we have snapchatters this is the component that is responsible for materializing the raw image content that those compressed star balls in the content store into a file system that you'd see in a container copy-on-write functionality that you'll see in systems like docker is implemented at this layer if you've ever used docker you've seen its storage or its graph drivers the snapchatters are basically equivalent functionality with some slightly different implementation choices from an API perspective each snapshot in a snapshot or represents a full file system when you extract an image into a snapshot or you end up with a snapshot at the end containing the full image contents but the implementation inside the snapshot er can do whatever it wants as long as the thing that exposed at the end is a file system that is the combination of its layers container D has a few built in implementations of snapchatters that offer different trade-offs and things like space or storage efficiency the next thing I want to talk about is the runtime this is where container D dispatches to a lower-level runtime like run C so the standard one that you'll use is run C the reference implementation for the OCS standard run C implements containers using those Linux primitives that we already talked about but container D can use any runtime that either adheres to the OCI standard or adheres to a separate version to runtime interface that the container D project has created that's intended to be a little bit more flexible at the expense of some additional complexity container D also has a plug-in system which means the different implementations of these interfaces of these of these parts of container D can be loaded as plugins for example there are multiple snapshots and those can be either compiled in or they can be loaded at runtime so I want to talk next a little bit about firecracker itself firecracker is a new virtual machine monitor that was recently open sourced by AWS it was built specifically for some of our internal infrastructure like a TBS lambda and a Tobias Fargate as a very small fast purpose-built vmm for running function and container like workloads having a really small target use case means that the implementation has the freedom to cut out a lot of things that you'd see in a more general-purpose vmm things like a floppy drive or a full keyboard or even things like a PCI bus cutting out that cutting out unused components makes it easier to optimize both for speed and for security and those are the goals of firecracker firecracker is designed to be secure and efficient above pretty much all of the other concerns that we might have firecracker attempts to optimize for security by limiting the scope of what firecracker can do in terms of the features that it has and the model of exposes to a guest virtual machine and even the permissions that the the mmm itself has to make calls into Linux it focuses on reducing the ability for the guest VM to interact with the host kernel by implementing some of the functions that could be provided by the host kernels KVM subsystem in the VM itself it also limits the ability of one guest to interact with another VM guest by sticking to a single virtual machine per VM M model so a single Process Model and we write in a we wrote firecracker in a memory safe programming language like rusts in order to protect against common programming errors like use after free and buffer overflows firecracker also aims for efficiency in terms of the guest boot time and the overhead of the VM itself the developers of firecracker measure this as part of their continuous integration system so they have targets for the boot time the amount of time it takes to go from launching the firecracker vm m to the guest VM having user space executes as well as how much memory and how much CPU is consumed during the whole process and firecracker exposes an API to drive all of its interactions making it pretty programmable so for firecracker container D we're trying to run containers with firecracker and we do have some goals for the project the first is that we want to be compatible with the things that people are already doing with containers and enable them to use firecracker without having to make major changes to either their software or their workflows this means we want to support people's existing docker or OCI container images we also want the resulting containers to look a lot like you'd expect when you were when you weren't using firecracker in terms of the kernel surfaces exposed inside of the container by that I mean we still want to use C groups and namespaces and mount in the proc file system and Mountain the sis file system with the goal that existing monitoring and debugging tools can still be used with firecracker we want to support composable applications containers that share namespaces or volumes like you'd see in a kubernetes pod or an ACS task and we'll do that by running multiple containers inside the same virtual machine and since I mentioned orchestrators we'd like to integrate with orchestrators like kubernetes and ECS we also want to minimize the additional overhead as much as we can so there's not so much extra latency in starting a container or extra CPU and memory being consumed from a security perspective the main idea is to leverage all the work that firecracker is doing for us this means hypervisor based isolation and also means that there are limits to accessing the underlying host so we do have some competing goals compatibility but also limited access to the underlying host and we're still figuring out if we believe that there are safe ways to do things like exposing volumes from the host into the guest VM and subsequently into a container but we'll talk a little bit more about that later so how do we make firecracker look like a container runtime we need to adapt to the limited feature set supported by the VM M these limited features are in the name of making it easier to reason about from a security perspective so we have container images that we store on the host in container to use content store and we need to expose those into the VM guest so we can actually run containers but guest VMs can't share any file system with the host this means that we need a different approach for exposing the root file systems and will do so as block devices that we attach to the VM firecracker doesn't support hot device attachments so we need to pre allocate the number of devices and the kinds of devices that we need including the devices that we use for presenting those file systems containers are usually are usually configured for networking with a virtual Ethernet or a ve spare that takes traffic from one Linux names network namespace and exposes it into another firecracker supports networking through a Linux tap device so we plan to make that tap device easily usable with a networking plug-in like CNI through a technique called TC mirroring the Linux traffic control subsystem and our preferred way to communicate from the host to the guest is via a technique called Vee sock this we need in order to control the containers that are running inside of the VM from outside the VM so what we ended up doing is we built a block device snapshot or a plugin to manage the VM M a runtime to manage containers from from the host and then an agent inside the VM inside the VM guest to actually run the containers so this is a diagram trying to show a little bit about how the container D and firecracker time creates firecracker micro viens and runs containers inside of them the architecture consists of four main components the snapshot or control plug-in runtime and the agent the first we're going to talk about is the snapshot that's responsible for exposing the file system to the container inside the micro VM we wrote a block device based snapshot err that Pat that were able to attach block devices to the micro VM the snapshot are currently runs as an out of process gr PC proxy plugin so it's an independent process that can be used with container D without having to recompile container D the next component up here is a control plugin that we're writing in order to provide an API for managing the lifecycle of the virtual machine itself and linking container D to the Firecracker a virtual machine monitor it's responsible for configuring the parameters for the virtual machine and has uses a disk image that contains the VM root filesystem as well as the kernel for the guest the control plugin is currently implemented as an in process compiled in plugin which means we do have to compile a slightly different version of container d in order to use it we also built a runtime that's responsible for linking container D outside the VM to the components that are running inside the VM the runtime primarily acts as a proxy and is responsible just for passing data through the runtime is implemented as an out of process shim runtime communicating over a V sock device into the virtual machine last we have an agent running inside the micro VM which is responsible for mounting the snap shop in snapshot images and invoking the containers and it does that via run C in order to create standard Linux containers inside the microvia so let's talk about each one of these in a little bit more depth first we're going to talk about the block device snapshot ur so snapchatter is our container DS way of creating a usable root filesystem from a container image for a firecracker container d as I said before we store the container image on the host not inside the guest and so we need to have a way to expose the root filesystem into the guest VM firecracker does not have a mechanism to share file assistance between the guest VM and the host pretty much for security purposes we don't know of a way that we're comfortable doing that safely so we needed a different way to expose the container file system that's based on an image into the VM and we do this by creating block devices that contain the file system and then attaching them to the VM with the fire cracker API inside the VM we then need to mount those block devices in order to make them usable to the containers running inside there are a bunch of different ways to implement snapchatters and a bunch of different ways to do block devices so we ended up writing two different ones the first one that we wrote is called the naive snapshot err and it was really implemented as a proof of concept a way for us to continue developing the rest of the system it works very simply it creates a flat file we create a file system within that flat file and then every time we want to make a new layer or make a new container we have to copy the whole file and write the new stuff into it that means that we're doing copy ahead instead of copy-on-write and it means that we have overhead when we're starting containers or overhead when we're pulling images in terms of both the amount of disk that's used and the time that it takes to do the copy the second one that we've implemented is one that we're more interested in actually running in production instead of just a proof-of-concept which is a device map or based snapchatter this one's somewhat similar to Dockers device mapper storage or graph driver both of them rely on a thinly provisioned pool of devices or pool of storage and then expose that storage as a device so the thin pool allows us to share blocks between multiple devices to deduplicate storage and that means that we do actually have copy-on-write behavior instead of copy a head behavior and it makes it a lot faster for us to both pull images and for us to start containers based on those images after so we ultimately contributed the device mapper snapshot or upstream to the container deep project and we expect it to be released in the next version of container D that comes out after that happened there were two more implementations of block device snapchatters that have started to emerge from others in the container community the first one is an LV M based snapshot er the LVN snapchatter is somewhat similar to device mapper but uses a different implementation for doing that it does look interesting because it has many of the same advantages that the device mapper snapchatter will have and so we're gonna continue to look at that and evaluate it and what see whether it's more interesting to us than the one that we wrote the last one that's up here is called raw block and is a little bit different instead of using just devices that it manages either through LVM or device mapper it uses a file system feature called ref link and ref link is a feature in some file systems like XFS and btrfs that allows for copy-on-write of files where the blocks are tracked between the copies so if you copy if you make a ref link to a file but nothing is couple cop nothing is modified you don't duplicate any of the storage but as soon as you modify things it only duplicates the block storage that's there I don't think that we're going to look too hard at the ref link snapshot or because it's not supported on all file systems Reckling isn't a feature that's available everywhere and we want firecracker container D to be usable with file systems like ext4 so next i want to talk a little bit about the control plug-in we're starting to build this new firecracker control plugin for container d to help manage the vm specific settings like which kernel to use and what resources to allocate in terms of memory and CPU and devices and so forth along with holding information about how the vm itself should be configured this is intended for the use case of running multiple containers inside the same vm we can't really embed the lifecycle of the VM in container DS container or task api's that are it already has so this is a new API that we're adding in order to have that as a separate object we are making it specific to firecracker right now just that we can get the workflow working and proven but we're also interested in talking with the container D maintainer is more about what it could look like to have a more generic group of container API or sandbox ap that would be native to container D and if that's something that the container D project is open to we were interested in helping to contribute that the control plugin is a compiled in plugin right now meaning that it doesn't work with the standard container D binaries and it does that only because that is one of a mechanism that made it easier for us to add the API / container DS existing socket that it exposes and not require clients to have a different way of connecting to it but we're looking to see how we can change that in the future and either run it as like an out of process proxy plug-in or a dynamically linked to go plug-in if that's something that we can do so next I want to talk about the runtime this is the component that runs outside of the VM and is responsible for proxying instructions to the command that runs inside the VM the agent it's responsible for taking commands and also the input and output streams for every container that runs proxying events that happen like a container starting our container stopping and proxying things like metrics from within the VM to container D outside we implemented this with container DS v2 API instead of the OCI standard as we felt like it gave us a little bit more flexibility in how we can implement the runtime the OCI standard which is what run c implements is essentially tied to a single process per container model because it specifies a command-line interface for operating with containers container to use v2 API instead gives us the ability to choose how many process we need and that lets us make choices that simplify our architecture we're implementing a model where we have a single runtime process per virtual machine and it's responsible for all of the containers within that same VM so finally I want to talk about the agent the fire cracker container D agent is responsible for managing the lifecycle of the containers that run inside the VM it's the other half of the runtime and receives commands from the the runtime that we wrote outside the VM over the V sock device it also is responsible of doing the other half of proxying the i/o streams for each of the containers and actually sends the events and metrics we use run C inside the VM in order to manage that setting up all the things that containers would normally expect like C groups and namespaces so that it still looks like a normal container when you're running inside the agent component is also responsible for mapping a given block device to a given container firecracker doesn't currently have a mechanism to expose which block devices which so we're working on approaches that map each block device to the appropriate container and its file system and we need to do that because the block device contains the file system for a particular container so now that we've talked about each of the components let's take a look at an example of how they work together to run a set of containers inside of a microvia so we'll start off with Storage Management or what's involved in exposing a root filesystem for the container the orchestrator or client program which can be something like the kubernetes CRI or ECS starts off by making a request to container D to create a snapshot from an image container D then passes this request along to a snapshot or implementation the snapshot R then allocates a snapshot and responds back with an identifier for that snapshot container D acts as a pass through here and you'll see that container D acts as a pass through for many of the API actions that were interested in once we have all the snapshots created which we need as block devices we can launch a VM the orchestrator makes a request to the control plug-in that we've written asking it to start the VM the firecracker vm m is designed to run a single virtual machine so we need to start a new vm process for every vm that we want to run because firecracker needs to know what devices to use up front we need to include that information in this request will pass in placeholder devices at this time but an equal number of devices to the number of containers we want to run because we'll each of those devices is going to hold the root filesystem for one of the containers once the plugin has started a new firecracker process and asked it to run a VM it returns an identifier back to the orchestrator now we can start to run some containers the orchestrator makes a request to container t's container and tasks services to run a container container D passes those through to the runtime which we've written for firecracker in the request we include both the VM that we the VM ID that we want to run as well as the snapshot e ID that should be used as the containers root filesystem the runtime will then invoke the firecracker API to attach the device and then ask the agent to mount the device once the device is mounted we can then ask the agent to start run C and launch the main process of the container when the container starts running will return all of that information all the way back out to the orchestrator the overseer will probably want to know about what happens to the container so it can subscribe to events at some point the container process might exit because it's finished with its job or it has an error or it's been told to exit exits will be observed by the agent and propagated out to container D because the orchestrator has subscribed it also gets a copy of the exit event when all of the containers have exited the orchestrator can make a request to the plugin to stop the VM and the plugin can then terminate the firecracker process and that's it so I have a really brief demo that's more of we captured some output and put in some slides so it's not quite a live demo but this is what it would look like this is showing how you can run a container using CTR which is the development tool for container D and the standard run C runtime and what I want to show you is the difference what it looks like from the hosts perspective of running a container versus running a firecracker my groovy m and the containers that are running inside so we're doing what we're doing here is we're running a container that is the stress program which is a fairly common linux program and we've configured it to use a total of 6 processes - that are consuming CPU and for that are just churning on Io once we have stress running we can examine the running state of the system as viewed from the default process namespace and we can see the stress program in its constituent threads viewed as normal first first-class process objects beyond seeing them just with pea grab we can go ahead and look at the inheritance we can see what is responsible for starting what process and if you look you can see all the way on the right the individual threads that are part of the stress program and you can see that it's started by an entry point and that's started by the default container D run cesium which is then started by container D so this is what it looks like to just run a normal Linux container and look at the processes from the host's process namespace now we'll see the same stress container but inside a firecracker VM with firecracker container D and you can see that there's a couple differences in the in this command-line it's mostly the same but we've got two things there that are specific to firecracker the first is we've specified the run time as the firecracker run time that we've written the second is that we've specified one of the block device based snapchatters this is so that we can expose that filesystem to the container but other than that it's pretty much the same thing it's also running through CTR and it's also like pretty much the same command line if you inspect the running state of the system as viewed from the default namespaces you can see that there is no stress command visible there is no stress program visible and instead you see a firecracker process running and if we go look at this looking at the inheritance you can see the same sort of thing here is that we have a all the way on the right is the kernel threads that correspond to the firecracker CPUs that are exposed in the VMM itself then there's the firecracker process that was started by the container de chien that we wrote and that's ultimately started by in this case it's started by the unit system and if we go ahead and look at this even though we can't see the stress process we can see that it's still consuming resources and the resources at this point are just attributed to the vmm itself and so the host doesn't have any visibility into the processes that are running inside the VM similarly the processes running inside the VM won't have any visibility into the host and so we've created a little bit of a harder security boundary there than there is just with normal Linux containers so I'll talk a little bit about the current status of the project it is an open source project a lot of it it all works but we're still in the prototype or proof-of-concept in stage 4 a lot of the parts of it we do know where we want to take it we want to make it broadly useful for running containers with a VM boundary and a separate kernel instead of standard Linux containers and we also want to make it easy to do work close that people are using with things like kubernetes where you're running a bunch of containers together and you want them to share some resources so we do have a few tactical things that we're working on right now like fire crackers v sock support right now is experimental and is not the final implementation that they'll end up with so we'll need to move away from that and we also need to work on that block device identification issue that I talked about earlier where a block device that contains a file system has to correspond to an individual container but all of these are supporting the goal of helping us run multiple containers in the same virtual machine workflows like the docker exec where you can launch a new process inside the same container and then things like health checks and metrics the biggest thing that we see that container D is missing right now is an API for groups of containers and I talked about this a little bit earlier groups of containers that run together is required for running things like kubernetes pods or ECS tasks mostly that's been implemented outside of the container run time so the there's a plugin for kubernetes in container D called CRI container D and that implements a grouping logic itself instead of delegating some of that down into container D but groups of container are also useful when modeling security boundaries we want the containers that are grouped together to share the same boundary we want them all to run together and so for us for firecracker container D we need an API to manage what is that group and what are the containers that belong to that group so this is the thing that we've built as the control plugin right now that's specific to firecracker but it's something that we're interested in in helping to contribute upstream to container D a way of grouping containers together and having that as an API object that's modeled in container D longer term there's still things that we haven't really decided how we're going to implement either because we're not sure or there's competing proposals some of these things are challenges because of our desire to keep firecrackers very minimal device model and feature set in order to help firecracker and firecrackers to container D make the arguments about security boundaries so a big one that we're trying to figure out how we're going to do is CRI conformance the CRI that container runtime interface part of kubernetes is a little bit more flexible and doesn't fully specify all of the containers that are going to be launched together in the same group the same pod or the same sandbox and that makes it hard for us to properly size the virtual machine at launch because we don't know how much CPU or how much memory is ultimately going to be needed by the application that runs in there we also don't know how many containers are going to run and that makes it challenging for us from the perspective of needing to allocate the block devices ahead of time and firecracker doesn't support dynamically attaching new block devices or dynamically changing the size of a VM once it's been launched so these are things that we're going to need to work out for CRI CRI also requires some file system sharing for pods this is for things like the kubernetes downward config for secrets and and things like that and because we don't have file system sharing from the host to the guest that's another thing that we're going to need to work out and figure out what we're going to do in terms of conformance there a typical way that this is done with virtual machines in other systems is the 9p filesystem protocol you'll see that if you use qmu or the cotta containers project but firecracker doesn't plan to support 99p so we're looking at alternative approaches one that's interesting to us and that we haven't finished evaluating yet is Verdi ofs which is a fused file system that communicates over the Verdi o device but we're also looking at other approaches like perhaps doing local NFS if we have to or other ideas so fire cracker community is an open-source project it's also something that we're developing ourselves so we're happy if you want to get involved but it's definitely not something we we're asking you if you're not interested in I just don't want to come up on here come up on a stage like this and be a big company asking for contributors so if you're interested we're more than happy to have you help but if you're not like this is not a request for help that can be as simple as using it or reporting bugs or as much as like coming to work with us and joining our team our github repository has all of our code and we're trying to put our design documents up there as well we also hang out in the Firecracker slack in the container D channel so that link up there the second one is a link to get an invite to the Firecracker slack and you can reach us there if you're interested in working at AWS come talk to me afterward the containers group is hiring across all of the services that we have like EC s EK s e CR and Fargate as well as the lower-level infrastructure like our team so I think now it's time for questions if you have any yep any questions how will you compare a G Weiser versus firecracker so the G Weiser is a bit of a different technology I definitely can't claim to be an expert in G Weiser but my understanding about it is that it's attempting to re-implement the Linux kernel in go or in a different system so we took a different approach implementing the Linux Pro means that you are responsible for ensuring that your system is compatible and ensuring that the behaviors look like Linux from our perspective we think it's a little bit easier to just let Linux be Linux and focus on the things that are interesting for us in terms of containment and isolation instead of reimplemented it are there any plans to make it so that firecracker can do the dynamic block you know mounting and that kind of thing I would have to defer to the firecracker team for that a little bit I don't think it's in their current roadmap right now having it be configured only once at startup makes it a lot easier it's very simple code paths for them and they don't have to implement things like a full pci bus to do hot blood so adding that is more in terms of code complexity and more in terms of what has to be thought about from a security perspective so I don't know whether they're ever going to do it but it's not in the current plans great presentation uh so thank you a quick question how do you control the kernel version of this micro VM so the kernel is you can choose that when you're launching the micro VM firecracker just takes an elf formatted binary as the kernel and so if you have an uncompressed kernel image that's an elf formatted binary you can use it there are some special requirements that firecracker has around the kernel like it doesn't support compressed kernel images which is what you'd normally see in a Linux system it also doesn't support things like an initial Ram filesystem so there's some different configuration choices you'll have to make but you can use arbitrary kernels with it actually you can also run nan Linux things with fire crackers so we've seen some contributions from like you know kernel projects that have been able to get their systems running with fire cracker as well and that's pretty interesting too thank you there are a couple of CRI plugins which manage or launch VMs do you see any room for standardization on managing the lifecycle of VMs or is it to implementation specific I think that there could be some room for standardization I haven't looked enough at the other CRI plugins yet to know what exactly is different between those and what we're looking at we do know that we need to there's specific things that we want to do from a firecracker perspective and so we want to make that a thing that we're doing first but we'll look at those as well thanks you mentioned near the beginning that you didn't want to implement a PCI bus for the guest OS how do you enumerate the i/o devices without a PCI bus I don't know enough about firecracker itself to answer that question so that's the question that you can ask in the Firecracker slack or in the firecracker github repository and the maintainer of Firecracker will be able to answer that thank you yep we can take up one more questions just extending on that a unique kernel question there is a fundamental difference between what firecracker does and how unique owners built out their VM and deploy they like you probably know the attack surface of the kernel itself is much smaller why did you adopt why did you go without opting a full colonel as opposed to a more optimized kernel for this sort of an architecture so for fire cracker our goal is to enable the kinds of workloads that people are already running and most people that are running workloads are running them in Linux today so we're trying to enable people to run their existing containers or when fire crackers eat fire crackers used in a diverse lamda today and we want people to run their existing lambda functions in there so we're we chose that because we're interested in compatibility and with what people are already doing unic URLs are interesting they've got a lot of cool technology behind them but they are pretty different when someone is using them and they require you to go through changes to adapt in order to use the you know kernel itself you have any further question maybe Sam will be here you can take it offline yep we have a next presentation at around two o'clock and that will be from Psalm to kubernetes and back again so see you there cool and so one final thing before we go I think there's a session survey I would love if you filled it out and gave me feedback things that you like things that you hated but please keep in mind for both me and for all of the other talks that you're gonna go to today that all of the presenters are humans I'm gonna be available just right outside right now I'm here the whole week so if you didn't get a chance to ask me questions right now please come ahead come up and find me if you don't have a chance to find me my email address is up there and also my twitter handle if you want to reach out to me that way but yeah thank you
Info
Channel: Docker
Views: 4,830
Rating: 4.8620691 out of 5
Keywords: Open Source
Id: 0wEiizErKZw
Channel Id: undefined
Length: 49min 4sec (2944 seconds)
Published: Tue May 14 2019
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.