Deep Dive: containerd - Derek McGowan, Docker & Phil Estes, IBM Cloud

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

all right I think it's time to get started so welcome this is going to be a container deed deep dive a little bit of a continuation from the intro session we had yesterday afternoon a slightly different cast of characters Mike Brown and I did yesterday's intro I'm going to give you kind of a quick start but then Derrick is one of our maintainer ziz going to do the bulk of the deep Dawg presentation into the architecture and and finish up with a demo but just to kind of get us get us going one of the things we didn't cover as much yesterday was you know where is container D kind of on that maturity path of you know who's using it where is it within the CNC F so I thought we'd give you a few minutes of that before we dig a little deeper and also down the left side this list of projects as Derrick goes over the architecture these are great examples places you can actually go look you can go look at the code on github and see how each of these projects are using you know the the go API library to actually drive container D to do the work that they're doing so Mike gave you if you came to the intercession Mike covered the CRI so that's the container runtime interface from kubernetes so how does kubernetes drive container D when you know for example the Kubla wants to start a pod so you can look at the CRI project and see how that uses the container D services so that's one great sort of integration that's that's clear and understandable docker is sort of on the path to use more of container d and so the runtime aspect of container D has been in use by the docker engine since 1712 so all through 2018 all those releases and then there's work underway to start to refactor docker to use the image side of container D instead of the code that's in the docker engine today so you can expect to see a lot of that work happen in 2019 if you've heard a build which is the open source project that that can now be used to drive docker build the back end of that uses container DS client and it can even run standalone without the rest of docker or it can even drive a run C so again build kit is a great place to look at a container to use case alibaba's pouch container that's another open source project from Alibaba cloud in China they're using container D as a runtime also using both the image at runtime aspects of container D there's also and this is an exhaustive list but a couple of our our own maintainer Xand reviewers have interesting projects that again are great places so if you look at Michael Crosby's boss project that uses the container D Go library and Evan hayslett stellar project also is another great example of using container D as a runtime so anyway that gives you a flavor of you know as you see the architecture and see how how it comes together these are projects that are actually using it today we just completed a security review that the CNC F provides to all member projects we're going to be publishing that online a PDF report from a security company it's very very positive and and so we're excited to to share that kind of in that same vein we've also proposed to graduate within the CNC F so you know kubernetes has graduated as well as fluent D I believe and so you can look at the PR in the CNC F TOC and again that expresses where we are as a project how we meet all the criteria for the CNC F graduation criteria we presented that just last month to the TOC and I think we can expect some kind of graduation vote early in the year so again that gives you a picture of a little bit about who's integrating container D using it today and some of the maturity and where we are as a project and with that I'm going to let Derek take over and do the deep dive alright so we're gonna kind of take this same architecture diagram that we went over in the intro yesterday we're going to go into it and the slightly different depths so I'm going to focus more on the different components that are actually inside the daemon so just so we kind of talked about how container D is built around having these loosely coupled services made in and these strongly defined interfaces between those services so I'm going to go through a few of those and highlight how you as an integrator can both integrate with these services and can build plug-ins that are used within container D so container D has this smart client model so what this means is the the actual go container D client it implements many of the higher-level functionality so you think of push-pull that's done actually inside the container D client we provide a very usable interface for integrating with our with our client library and this is what you'll see Mobe integrating with pouch containers our own tool CT are the boss tool that Phil mentioned they all just directly use the container D client and then the container D client itself obviously communicates with our container D API so the container the API is is very low-level it mirrors roughly the services that we have underneath it so you can think of like the snapshot or the content service that container services we have those are all exposed directly through our container D API this API we consider very stable it's intended to be backwards compatible all through one Dex releases and then actually inside the container D daemon we have this service level so this is what actually exposes all the services we have to the rest of container D so if you're implementing a plugin you can acts any of the container D service interfaces so it makes it really easy for example if you want to build something like the CRI plugin the CRI plug-in is just able to use these internal services directly without going through the API and then we have a metadata store that sits roughly right right underneath the service interfaces so this will actually be able to provide name spacing and labeling to some of our even lower level components such as snapshot airs so the things that actually touch files on disk we try to make those as simple as possible and that we can provide layers on top of it to add additional functionality as well as provide different guarantees such as a de Missa T at the metadata level so that every single plug-in doesn't have to worry about that as well as namespace ting so the the metadata service itself is what actually does the name spacing and this will actually namespace all your images all your containers all your snapshots all the content even down to the the labels that are put on any of these objects so basically any object that you're using within container D itself as a namespace and this is primarily to support multiple clients so container D is designed to be used within docker within kubernetes within any platform that wants to run containers but it's designed in such a way that they don't step on each other if one of them is pulling images managing containers you don't have to worry about some other platform or tool that you're using interfering with those containers accidentally so let's take a deeper look at what what the metadata service is actually doing so the metadata service is actually implemented in bolt DB so it's it's completely atomic it has it has references to the different objects most of these references are implemented through labels but for example when you pull an image it's going to pull the content down it's going to put it in the content store and it's also going to reference it and the metadata store as well as the relationships between all the content that was pulled likewise with images the images itself is just in the metadata store there's no there's no separate back-end for images the metadata stores are actual back-end and that will link to the actual content that got pulled likewise for snapshots snapshots tend to be something that's very heavy on disk but managing those snapshots can be complicated so the metadata stores actually what takes care of managing kind of what snapshots exist what namespace they belong to and what the relationships between that snapshots as well as the content that they're related to and this really helps us when we want to do something like deleting an image so container D actually has garbage collection that is able to take care of cleaning up this data when something gets deleted so in this case if we were to delete the Redis image you could see that the Redis image was based off the Alpine image in the content I I labeled the green stars actually represent the OCI manifests the yellow stars represent the OCI images so these are kind of the these are the image configs that actually specify what are the layers what are the runtime parameters everything for that image and then the red stars are going to represent the actual compressed r layers so as you can see here the the Redis image is actually pointing to a manifest which references to layers whereas the Alpine layers the Alpine image only has a single layer you can see how those translate to the snapshots so when you actually want to delete the Redis image we can just delete the Redis image from the API nothing needs to happen right away but when the garbage collector when the garbage collection runs it's actually going to go through it's gonna see that there's content that's no longer referenced so in this case since we actually have a Redis container which is running which is pointing out a read/write snapshot that read/write snapshot is referencing that that Redis layer which was which was owned by that that image that just got removed however the garbage collector is gonna see that it's owned and it's not going to clean up that particular snapshot but it will be able to clean up any of the content that was associated with the poll so this content is going to be the artifacts that I got from the registry and they would also be artifacts that you could push to another registry for example so after the garbage collection it's going to look like this inside the metadata store so now when we go back through we want to actually delete the container normally we delete the readwrite layer of this the readwrite snapshot layer at the same time we delete the container but you can guess what's going to happen here when the garbage collection runs it's going to see that now there's a snapshot that is no longer being referenced and it's going to go ahead and delete that snapshot so our final state in this case is we just have a Alpine image so the way we've kind of implemented the garbage collection and container D is we have the metadata store which is in both dB we try to avoid locking this as much as possible so we do this this garbage collection very quickly but as you all know that content and snapshot actually represent fairly a lot of data on disk sometimes which the leading data on disk can be very slow so this is representative of what we actually do inside the metadata store for deletion but we actually have a multi-stage garbage collector that will actually go through and it will for each of the individual content stores or snapshot errors it will do a separate garbage collection when there's data that's been removed so as mentioned snapshots um about how you can implement your own snapshot and we've tried to make an interface that's both a mix of powerful and simple and one of the ways we tried to achieve that was by removing operations that really make snapshot or is difficult to implement so you'll see that there's absolutely no data operations inside our snapshot or interface you're not going to see any data being streamed into or out of the snapshot or make you're familiar with docker graph drivers and how they work they actually handle tar streams to and from which can make them fairly complicated to implement as you're then responsible for understanding tar and how to compress decompress and unpack those there's no mounting which allows you to implement a snapshot or that's fairly stateless so as soon as you have to deal with notes you have to deal with whether or not you can unmount so it brings up a lot of extra reference counting that's needed a lot of tracking that goes along with with owning those mounts that you create so in container D we return a array of mounts and you have those array ounces just a description for how that snapshot er can be mounted snapshots themselves when you create them so we have a prepare command which will create a mutable snapshot when you're done with it you commit it and that snapshot can no longer be altered in any way so for example we're doing a poll where we're preparing each layer we're unpacking into that layer and then we're gonna we're just going to commit it and it's up to the actual pole operation to determine how it's going to mount and do that unpacking we there is label support in here so so the stats and info that you can set on snapshot erskine themselves have labels this is mostly used at the metadata store level but the the snapshot errors themselves can support those labels I put an enumeration there because that's something that was kind of missing from the the docker implementation of graph drivers where this really helps out is the ability to clean up so when you have the ability to numerate you have the ability to know kind of what's actually there and and make decisions about what can be deleted it also helps helps you kind of just do better tracking of what's there as well as giving api's to the client where you can actually see what all your with all your snap shutters are or all the snapshots in the snapshot are and the 1.2 release we added a feature called proxies snapshot err it's basically it gives you the ability to run a snapshot or that's external to the container D process so that you don't have to recompile container D to use your snapshot ur so if you've been following a firecracker at all they're using this to implement a snapshot ur for configuring container D we have a proxy plug-in section in there you just specify a name so in this case I think would be called my snapshot or you specify I type and then the address is always just a UNIX socket and as you can see in the example it's just going to listen on that UNIX socket and container T will be able to connect to that and what I mentioned about kind of these these interfaces and how we use them within container T that snapchatter interface is used at every level of container D whether it's the client the API the backend they all use the same snapchatter interface so to actually implement this proxy plugin you just implement the snapshot or interface and that can be used on either side of G RPC 1.2 we also we also released a new plug-in feature for runtimes so in the v1 of our runtimes we had the ability to make it pluggable but it it wasn't super easy to to implement we had this G RPC interface that could be implemented but there were some limitations with it for example stats we didn't have a stats endpoint so if you were trying to implement a runtime that was inside of the VM and you needed to get stats it was somewhat difficult so now we actually added a stats a stats function inside of the cast service so that if you want to implement a runtime plug-in you can do everything including creative or including returning your own stats so the the biggest feature with 1.2 wasn't necessarily that we introduced this new runtime we did add a few and points to it but mainly we stabilized it so that people can feel comfortable going out and actually implementing against this API and will continue to support it so it should make it much easier to implement these plugins and we've already seen interest in and actual implementations of this with fire cracker kata containers G visor to name a few and obviously we have a run C implementation another advantage to this approach is I I kind of wrote it in a confusing way it says at most one per container which I had to actually stop and think about for a second but what that means is previously would always be one shim for one container but now it's at most one ship for one container so if you have 10 containers running you can have 10 shims but you could also have all those 10 containers sharing the same two shim so this is useful in the VM scenario where you may have one GM's and multiple containers inside that VM say for example they're all in the same pod you wouldn't have to have a bunch of extra processes being used to to manage all of these all these containers and so that was that was another minor change that we made to the this API is just passing in the ID for each of the tasks so that this API can be used across multiple multiple running tasks so and as somebody who's using container D you're most likely going to be starting with the with our go client the go client has gotten fairly positive feedback it's it's it's fairly simple to use and we provide a lot of with functions as it's probably the best way to describe it so almost any point of the API you can change whatever you want so in the services section we're actually defining which services you can use by default it will just you do everything over G RPC but you can override any any one service so you can make it you could run container D completely embedded you could you can implement all of these interfaces yourself if you want to and still use the client and make use of those higher-level functionalities so an example of this would be like say you wanted to build a tool which pulls something for my registry but you don't want to actually have a running container D all you have to do is implement the the content store interface and then you can make use of that higher level functionality and it will store it to wherever you define it to be stored in the content store I I tried to highlight the resolver here so it's the resolver is defined whenever you do a pole operation and this is another interface that you can completely completely overwrite I didn't specify what I don't have a slide that shows what this interface looks like but it's it's fairly simple it's you resolve a a name to a digest and then you can fetch individual blobs using just the hash so by default we have an implementation that uses the docker registry API which is now the OCI distribution specification but it's fairly unup enya nated and our implementation so i'm going to go through a few of the flows that we have in container d today so this flow is the same based on light doesn't really matter what resolver you have but the default remote will be some some sort of registry so in our CTR tool we actually have top-level commands for pole fetch and unpack normally you're just familiar with pole so the first thing that pole does is just going to fetch the content fetch is just going to the registry that's going to this remote and it's pulling down the content and putting it in the content store it's not really doing anything else besides that the only tricky part is it has to understand OCI manifest so that it knows how to walk the the tree of objects that are associated with an image at the end of the fetch it will just take the image that you tried to pull and it will create a record and the metadata store saying that this image now refers to this OCI manifest so when the pole comes along it's able to it's able to actually see now what that image represents and then to do an unpack on that on that manifest it will just take that content now it knows what the layers are it will read those layers from the content store and it will unpack them into the snapshot and that's all a pole is you could do the unpack yourself if you have your own content there's nothing really here that's not configurable and not exposed in our client so push flow and container DS is very simple container D doesn't build images our CTR tool doesn't build images it pushes them so if you have an image you have the content it takes that image it takes that content and it pushes it to a registry that's it if you want to build images you can use something like build kit or other tooling that's being developed in order to create new content but container D runs runs images it doesn't create them so running a container that's one of the most important things we want to do in container D it's using many of the same underlying services so when you go to run the first thing that's always done is it's going to do an initialization step so this is going to actually read read the image that you want to run it's going to look at the configurations it's going to create the OCI specification it's going to create a new rewrite layer in the snapshot and then with those it's going to set up a new container for you that new container is going to have now defined route FS which is will be represented most likely from a snapshot ER and the OCI configuration that was created once you have a container created you can start that container that the start is actually going to mount your snapshot and it's going to start the task that you specified so I wanted to kind of drill down into what has changed with the runtime v2 so at the lowest level of run we have this this shim runtime manager and this is actually what's managing all the individual shims that that are owning the the running processes so when we actually get to the point where we're starting these containers it's going to take in this OC is with this OC i specification it creates a bungle directory so if you're familiar with run C you know what these these bundle directories look like there's specification there root of s and then this is kind of the the main change that we did with 1.2 for the for the B tube runtime is now we have a shim start so the shim start is going to take that bundle directory and it's it's going to call into the shim binary that's defined for that run time so this plugin is is it's just a binary that implements of a few functions like start it's going to the that binary when it started its going to actually return a path to a UNIX socket and that UNIX socket is going to implement the the interface that I showed earlier for the runtime service tasks interface and this start it can do really one of two things it can it can create a new shim that's going to manage that container or it can use an existing shim and it can return the same path to another another running shim then that that UNIX socket is going to get connected to and the tests create functions that we called and now you actually have a task and you can call although you can exec it you can start it you can do any anything that you need to do with a task clearly getting the stats so I'm gonna go ahead and give a demo I see the time here okay perfect so I'm gonna start off by demoing the snapshot or proxy plugin so I'm gonna show the what an example of one of these proxy plugins actually looks like when it's running as well as some of the commands and CTR for looking at snapshots so this is just a proxy plug-in that I don't know if it's is this one is not in the repo but it uses this this contribs nap shot service so the snapshot service is basically just taking the snapshot or API and it's creating a G RPC service for it so an example or an example plug-in it can be very simple creates a new G RPC server creates a new snapshot er the snapshot er could be your own custom snapshot or in this case I'm just using one built-in snap shutters the native snapshot which is which is the simplest possible implementation of the snapshot or just copies up every layer and then we use the snapshot service from contribute create the G RPC service register it and then listen that's pretty much it so let's show that running let me also show you the what the config is going to look like it showed it earlier but in this case so you can see the proxy plug-in section this test s tests s is test SS is going to be the test snapshot ur so that's actually gonna be the name of the snapshot or that's used and you'll see that in a second and this is the actual path which it will connect to so let me go ahead and run this snap shudder okay so this is just running that binary you can see that same path it's going to use this directory for snapshots should be running let's go ahead and pull an image from local registry I have so you can see in this command on just going to use a demo namespace so this is actually my running container D that I have so I have images and stuff that I just use normally for either development or just running containers so I'm just going to create a demo namespace this is the same snapshot or name that you saw inside the config and then we're just gonna do CTRM ages pull from a local registry it's really fast cuz it didn't have to go over the network let's look inside it was temp test plug-in routes so this is this is what the native snapshot r is going to do so it's going to have a database file as well as the snapshots directory and then these are the actual snapshots so as I said this is this is a the simplest snapshot ur so if I look inside 6 that should be the last one created it's just going to be a normal root FS when when it actually runs it's basically just going to be a bind mount there's there's nothing too tricky there yeah let's take a look at I'll show you some of the snapshot commands we have so we can use the same arguments you can also just set these as environment variables so CTR snapshots LS that's gonna just show you all your all your snapshots now it's just a lot of hashes you can see that there's a key there's a parent there's a kind that kind basically just says whether or not it's active meaning that you can make changes to it or committed meaning that it's immutable so let's call tree trees more interesting output so so actually show you the relationships between them so you can still see that parentage but you can also see here this one at the bottom this is going to the last one that was polled so this is this would be the the uppermost layer for how most of us think about layers even though here it's at the bottom that way so let me create a new snapshot so I'm I'm gonna take this and I'm gonna do basically what what you're gonna do when you're initializing a snapshot or for a container which is you're going to call snapshots prepare so I'm going to call this upper one it's using the same ID as you saw as the upper from the tree that's just going to create a new snapshot so if I run tree again you'll see that now there's this new there's yeah there's there's a new snapshot and that snapshot is now above the other one or where the parent is now that that previous snapshot that we mentioned so let me go ahead and yeah let's let's mount that so I'm gonna make a directory or already mounted it earlier sure it's empty yeah that's empty so in CTR we also have a snapshots mount so snapshots themselves don't know but when you run it you'll give you a mount that you can you can run remember what I said before it's really simple it's just a bind mount so it's just this tool is just a convenience tool for helping you debug it will just give you a mount command that you can run I'm not write as root let me see do that so now when I go back to temp M you can actually see that there's a root filesystem there and you know I could true it into it and pretend like I'm inside the containers it's the same thing that the it's the same thing that the runtimes going to do when it goes to run the container but this is a good way to give you kind of visibility into the snapshots let me demonstrate some cool stuff around garbage collection real quick and then we're we're gonna be out of time so let me unknown that and then we're gonna so I'm gonna go ahead and remove the Redis Alpine image so this is similar to what I showed earlier so one actually delete that you'll see all the content got garbage collected so you can see the garbage collection ran took eleven eleven milliseconds and then a subset of that should be how long the the database was actually locked for during that time it deleted all of these all of the content that's no longer used so if I were to go back and actually try to look at the content I didn't show this earlier but you would have seen all this content now there's actually nothing there but you can see that the the snapshot is still there so let's now go ahead and let's just remove that upper I already had adopted it earlier so now that that uppers removed there was nothing else that was referencing any of those snapshots so if I go back through here you can see actually all now all the snapshots got removed because as soon as I deleted that that yes snapshot that I was referring to nothing else was referencing any of that content so that's that's really how our garbage collector works you can kind of rely on it from the client perspective did it can delete things quickly and it can delete things reliably so you know that after after your command returns that something is deleted that it it is deleted that this the the metadata store no longer has any reference to it the garbage clicked and runs very quickly and it runs fairly shortly after we also have an ability to run it synchronously for some use cases where you actually need that data gone from disk so you can delete images synchronously and with that I'm out of time we rats we're at we're at time now but we're gonna be around for questions we have maintained errs Steven day Mike Brown the less T's here so feel free to come ask those questions I'll also be at the docker booth tomorrow at 10:30 if you want to come and talk more about container D thank you all for attending [Applause]

Info

Channel: CNCF [Cloud Native Computing Foundation]

Views: 1,750

Rating: 5 out of 5

Keywords:

Id: 4f_2u6rIDTk

Channel Id: undefined

Length: 36min 42sec (2202 seconds)

Published: Sat Dec 15 2018