Linux Containers on Windows - The Inside Story

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

ladies and gentlemen welcome to this first talk on day two of darker calm I hope you had a great time my name is Gabriel I am working for the training team at docker and as such I'm of course very very interested in this first talk because I will have to write about these things with me I have Chun start he's the principal engineering leader at Microsoft and he is going to talk about Linux container on Windows Wow take it away thank you thanks so first just a little bit more about myself so I am one of the container architects on the in the windows team at Microsoft and I've been working for a little while now on Windows containers windows server containers and the integration work and docker and container D to make that happen and recently I've been focusing a lot more on on Linux because actually I'm the principal engineering lead for the Linux on windows team and what we do is try to make sure that windows really is the one of the best places for you to develop Linux code and even deploy Linux code and so we work on Linux VM technologies we work on the windows subsystem for Linux which allows you to run Linux binaries natively on the windows kernel without a VM at all and of course we're continuing to do work and darker and now we've got this this new Linux containers on windows feature that we've we've added in the latest release of Windows and are actively integrated into the docker as we speak but today I'm just gonna focus on this latest thing Linux containers on Windows and you might say well you know why do we why are we building something new called this because we already have Linux containers on Windows we already have docker for Windows and darker wood darker for Windows is great it allows you to run you know existing Linux containers it on Windows for development purposes you know it creates a kind of a Linux VM a classic style with Linux VM manages it for you and gives you kind of the familiar set of docker tools to be able to create containers do docker build all that kind of thing but there's a there's a couple of limitations about docker for Windows it doesn't really make it ideal one is you can't use Linux containers and windows containers at the same time there's this there's this big mode switch where we we go from this Linux container mode to the kind of native Windows container functionality that's built into Windows and the second issue is it's really just for development and we wanted something that you can if you have Windows Windows Server VM or Windows Server host you can just run Linux containers directly on that machine without having to you know without having to manage your own separate Linux VM or something like that and so of course you can just use a Linux VM yourself and lose that integration that docker for Windows gives you and you know this is sort of Linux containers on Windows right but you know again it doesn't give you that single daemon experience of being able a single doctor daemon that can run Windows and Linux containers together and then we've gotten a lot of questions about whether we should support running docker directly in the windows subsystem for Linux because hey we've we've implemented the the Linux Cisco ABI docker is a linux application why can't we just run docker directly well don't tell anybody but actually you can with the latest windows release now with the latest insider builds I should say it is possible you can go try out if you like but there's a few limitations here one is that you know wsl windows subsystem for linux is really intended as as kind of a mechanism to run Linux developer tools on Windows to do scripting some basic automation things like that but it's not really intended for use in production so it doesn't solve that first problem that first limitation of docker for Windows that that you can't you know you can use on your dev box but you can't use it really on Windows Server it's available in Windows Server for again automation scripting that kind of thing but it's not appropriate for running currently not appropriate for running you know production web servers things like that the second problem we have is that although you know we've put a lot of love into WSL and I spent all last week working on getting like the latest getting some additional flags to the clone syscall working so I you know I I loved it I loved working on it but it has some rough edges in some of the areas that really need to be pretty polished for docker so the networking support you know around IP tables and things like that doesn't quite work out and some of our file system performance is rather you know we've got some performance issues to kind of work through before it's really fast enough for some of these docker workloads so so W sells not really ready for for a running docker and in that being a really good experience so that brings us to El cow or Linux containers on Windows which is this latest project and before I go into the architecture I think we should just see a quick demo of this so what I have here is I'm running a almost docker master there's one extra PR that we haven't gotten emerged yet but this is basically docker master and this is the Windows 10 fall creators update that was released yesterday and in here I've got I've already loaded some images up and immediately I think this is pretty cool at the top here we've got the the Microsoft nano server image that's our small windows server base image but then I've got three Linux images the same daemon and I can run these things oops so I can you know I can run both there's no mode switches there's nothing like that so here's the windows container running and here's let's run an alpine in and great windows containers Linux containers and just for fun let's also run non-cat let's get that running and we'll let the cat kind of wear itself out and come back to that in a few minutes so so how did this work what's the basic architecture so first let's just take a quick look at the you know very familiar components of Linux containers on Linux which I call alcohol but I don't think that's going to catch on but so we've got some we've got a few pieces here I've simplified this dramatically of course so there's dr. D container D kind of providing the lower level you know container system management and when you actually launch a container through dr. D and then through container D it launches a run C process passes as a no CI specification runtime specification and that's at what actually creates that process in this new set of Linux namespaces and this is all running on a shared Linux kernel so if you have multiple containers they're all going to share the same kernel you know this is basic stuff so when we came to do L cow Linux containers on Windows we wanted to preserve as much of this architecture as we couldn't reuse all the same components but just kind of extend it to work on a Windows host and so what we've done here is I've kind of exploded this out and we've added hardware virtualization we've added VMs so you can see here the green boxes are all the same but these kind of gold colored boxes are the new components and the bottom we have the Microsoft Windows hypervisor there's the Windows kernel on the left so this is the Windows host essentially on the left side and on the right we have a Linux VM so we've got this new Windows kernel on the left side running as Windows services we still have docker D we still have container D but container D is now calling into H yes which is what we call which is the host compute service this is a service in Windows that we developed initially for Windows containers but it essentially manages running VMs and containers on the on the Microsoft on the windows container and hyper-v stack so so if you if you've seen talks about how Windows containers are implemented you'll HDS should be familiar this is the same component that dr. D our container D contador on Windows containers now back on the on the Linux VM side we have a component called GCS which is guest computer vez this is essentially just a CSS buddy over in the in the Linux VM to be able to act as kind of a proxy for things that the HGS wants to do so when you create a when I created that Linux container a few moments ago what actually happened was that I'm cheating a little bit we don't actually have container D in this picture right now that'll come in a probably in a month or two but essentially we're calling down into H CS and she s just very quickly creating that Linux VM booting it up launching the GCS inside of it and then sending a message across kind of a VM transport to the GCS to launch run C and and launch the the container as usual so we're sending the same EOC I runtime spec across a GSM GCS actually don't even really know anything about that spec it's just kind of a pass through and so we're reusing all that that existing technology they've been developed for standard Linux containers the other thing I want to point out is I've only drawn one container in this picture if you have multiple containers today each of those are going to be a separate VM and this is in contrast to something like docker for Windows where there's just one big VM that that hosts all the containers we think that we can actually get the performance good enough that that's acceptable but we also see probably some value on longer term of being able to run multiple containers in a single VM instance so that's something that we're definitely looking into but I know what you're thinking like okay well now we got this VM surely this is this heavyweight thing it's a yeah maybe I have to manage it or something like that this we've really tried hard to make sure that's not the case and the so so we've termed this type of VM that we we launched for this thing a utility VM this again is maybe familiar technology familiar terminology if you've looked into how hyper-v isolated Windows containers works this is essentially a hyper-v isolated Linux container and what that means is that utility VM that we launched is it's fresh each time it's it's a stateless VM the root filesystem is immutable it's read-only now to read-only every time you start a container we create a new instance of that VM so there's no state that gets left behind or gets committed in any images or anything it's it's a stateless in that way and you know again we've worked hard to make sure that the the payload in that VM is as small as possible so we don't need much in the way of kernel modules and things because the hardware is always exactly the same it's the same virtual hardware and the root filesystem thing this thing really just needs to be enough to launch GCS to launch run C and there's a few other little miscellaneous tools that we have to throw in the VM itself is if you're familiar with hyper-v on Windows it's based on what we call a generation 2 VM it's a legacy free VM we don't have any of the familiar there's no VGA there's no PCI it's just these synthetic devices on this the VM bus virtualization transport and so that allows us to cut down on some of the the kernel boot time and some memory use and then finally if you've used hyper-v before and you've launched a VM you know that it statically consumes you know a few gigabytes of RAM for each VM how much RAM you can figure the VM for and if you go and try to you know open a bunch of browser tabs or or play a game or something like that our open visual studio this can be really frustrating because you know the VM doesn't participate in paging decisions for the that the the host memory manager might make with utility VM sin of the case the memory is actually just like any other application memory so it can be paged it can be compressed it can be deduplicated all that kind of thing but if you're still not convinced like I want to go a little bit more why via you know why VMs and you know why didn't we use W cell so reiterating that the you know W cell is not really designed for production that's not completes you know I just added clone vm support last week so if you you know if you really want to be sure of that your Linux software is running the same when you're testing out in your dev box and then when you go to production you want to make sure you're running the real Linux kernel on your dev box and we think that eventually maybe we'll get there with W SLO the compatibility will be good enough we could even certify it maybe but certainly for now a Linux kernel is is the best at running and Linux software we've already talked about production use and finally isolation so there's been a lot of debate on whether a shared kernel approach is sufficient for something like multi tenant isolation and Microsoft were really betting on the virtual hardware virtualization to give us this capability instead of a shared kernel approach it's something that we're already doing in Azure we already rely on the VM boundary being solid I have to protect our customers to protect our data centers so we thought well let's let's reuse that those investments with Linux containers so quickly we'll come back to our onion cat here and what we're going to do so it's actually still running I hit control-c what we're going to do is is look into this VM a little bit more so we can see it's still running there's this container ID here and I want to introduce a tool that you may not be familiar with that's in box in Windows called HCS diag it's a Diagnostics tool for anything that's launched through the HDS so it works on windows containers works on Linux containers even works a little bit on VMs and when we can see here is that with HDS I exist we can see this container template this is related to the windows container I launched earlier don't worry about that guy but this one here is the our Linux container so it's actually an entity that HTS knows about because it's running and it's gives us a few commands that we can use to kind of interact with the container without docker knowing about it some of this is useful is not as useful with docker for example I can I can use the exec command which is kind of like docker exec why would I do that I have docker exact maybe that's not so useful but what is useful is I can actually interact with that utility vm which is something that docker doesn't expose so here I'm using the console command with - you VM for utility VM this allows me to actually get a shell into the utility VM itself and I just want to show you the process list here where we can see that GCS is indeed running as I mentioned and then if we scroll down here there's there's Runcie and of course run see as launch nyan cat so you can kind of use this - if you're to learn more about the system or to diagnose problems we use a lot in development when things go wrong the other thing I want to show you is the task manager for this thing so the container is still running let me sort by name here I just want to show you the the processes that get created when you launch one of these things so for every VM on the hyper-v based stack we have this worker process that runs this is the thing that actually hosts some of the devices and things like that and there's two here one of them is that template thing again don't worry about that guy but so this this one here is actually the worker process for our Linux container and then corresponding to that is going to be a memory process this is actually what hosts that page will via memory I mentioned so we can see here right now is a little fatter than I'd like it to be it takes 160 Meg's to run yon cat we can do better but but it's actually I think fairly good overhead compared to running a whole VM where you know immediately you're consuming two gigs or whatever so if you want to know you know which you know how do I know which worker process is which is a little bit of a pain one trick is you can just start killing them until your can dies and that's the right one actually I think that you know as I was preparing for this I realized we should just add that to HDS diet so it's easy to figure out which which worker process is which so maybe next release well we'll get that in so great so we've we've talked about kind of the basic architecture we've we've got this vm how did this vm boot what where did this file system and kernel come from so the basic idea is we're not gonna you know we don't ship this stuff in box and windows of course and in the hope would be that once this is available in either maybe docker for Windows Update err to use this technology or it's available in some other darker products that a kernel and a root filesystem for that VM will will be shipped with that but if you're doing development or you want to replace this stuff or play around with it you have to kind of provide your own so the idea is that the kernel is basically standard there's a few patches that haven't made it upstream yet we're basically to fix some bugs and the hyper-v drivers we're actively working to get those upstream right now and we've documented the K config that we recommend and you can tweak it further it's on our github repo for that GCS component which I forgot to mention is the GCS side at least is open source and so and then for the file system itself we support either entered Rd which is really convenient to build but has but has to be decompressed and loaded into memory and things during boot or you can just give us a file system image and we can actually use that directly but it's a little more a little less convenient to construct that file system in image as I mentioned just contains a minimal init process GCS run C tools some basic tools and if you've if you watched Ralph's talk yesterday on Linux kit you know that well this is this is basically what Linux kit was made for is constructing these images in this way so this is the definitely the best way to construct these use Linux kit if you haven't seen the talk check it out should be available once they release all the talks so definitely the easiest way to go as far as booting this thing we do use a UEFI firmware you know if you compare to something like um you know that can boot the Linux kernel directly we actually go to the firmware first there's a little performance trade-off there I mean that's something we'll improve but essentially this this in turn Rd and the kernel are just sitting in the host file system we we load them into address space and off you go we don't use grub we just use the EFI stub loader from the Linux kernel and one thing that's kind of disappointing about this is that because we're loading a current Colonel Lane gets decompressed we can't actually share the memory for multiple VM instances so each each VM has its own copy of the kernel in memory so this is something I you know I'd love to improve it looks to me like the the Linux kernel on amd64 doesn't actually support an execute in-place mode yet but arm on arm it does so if anyone wants to come up and talk about that afterwards I'd love to to get some thoughts on on if that's some time we could have improve on so I kind of rushed through that because I wanted to get to IO so so great we can we can boot this VM we know we're running inside of it but what about the two kind of tricky parts about containers the networking in the storage how do we get those into the container seamlessly and allow you to you know use volumes and and and network options and all sorts of things in a way that makes sense well the good news is for networking is that it's pretty straightforward you know if you look at a standard Linux or a shared kernel based Windows container it basically works by creating a some kind of virtual network adapter on the host and assigning it into a network namespace and you know all the interesting parts of the networking really are happening on what is that on how that virtual adapter is hooked up to your host networks and things like that was where whether you're using bridging or overlays or whatever all the interesting stuff is really happening in the host so for extending this to vm's it's pretty straightforward we just instead of a virtual NIC that pops up on the house we just use standard virtualized NIC / vm bus in our case to expose that same network endpoint to to the VM and then once it's there GCS can move that NIC into the network namespace for the container for the appropriate container and actually leaves the utility VM itself without any networking in kind of the initial namespace the brood namespace which is great because you don't want this utility VM to appear on your network you don't want it to have to worry about an additional attack point and on your server or anything like that so the utility VM can't accidentally you know interfere with your containers networking or anything like that of course one side effect of this is it this only works on cases where you would have assigned a virtual network adapter to a namespace so if you were trying to do a new name space for that container so if you want to share that namespace with with the hosts you want to show the host network namespace or you want to share a name space between containers that's obviously not going to work we don't have a way to share the TCP stack between a host and a VM especially a Windows host in a Linux VM and that's probably another good use case for if you know if we could support multiple containers in a single VM at least then we could share the network namespace there so that's kind of that we think the best we can do but it's networking is easier than storage is much trickier there are a lot more questions about the right way to implement storage and the reason for this is that storage and VMs is presents a few challenges there's you know with with standard containers with share kernel containers you can essentially just need to get if you want stores to be available to a container you just need to get it onto the host somehow get the host file system mounted and we can use bind mounts or whatever to attach that storage into the end of the container but with a VM and with the VM you can do that too you can use something like SMB or NFS or plan 9 some kind of network file system to expose any file system that's on the host into a container into a VM and then thus into a container right but this file based approach has downsides the every file system operation that has to go to the hosts it means extra latency for that operation right if with with shared kernel approach it's just every file every stack open whatever is just a syscall way but if suddenly we have to go across the VM boundary then you know it's a full VM exit and you know a very expensive transition so it's much more expensive than a syscall and then also the the kind of network file systems in Linux or in Windows weren't really designed for this really low latency goal that we have I think you know there kind of expect Network latency is not cisco latency so you know even if that the the physics of going across to the host wasn't weren't where they were I think there'd still be some challenges so then you have to start playing games with well do I really need to go to the hosts for every operation can I cash some things in the guest and suddenly start to lose some of the flexibility of having you know that the the file is available on the host anyway because if you cache things and the guest and they get modified and the guest and then you try to read them from the host that cache data may not be available and vice versa so there's a lot of challenges around that and then so so at that point you say well why am i exposing host files to the guest at all can I just use expose kind of the raw storage primitives can I expose a block device to the guest and just use file system use the guest file system to interpret that block device and that's great you get you get better performance you get compatibility but now you definitely can't share files between the hosts and guests at the same time because it's really the guest has complete control over that block device so these trade-offs you know you okay with the extra latency yeah you okay with paying for these coherency issues having these coherency issues ruk with that kind of weird configuration of exposing a block device it really depends and so what we tried to do for this first version is come up with kind of a sane set of compromises and to see how far we can get and so this is definitely an area that that I'm looking for feedback on what works well where should we you know give some extra options where should we just spend some time optimizing to really make the the storage solution you know sufficient for all the different use cases up there so and of course I've been talking about storage generally but we really can split this into two two sections the first is the root filesystem for the container itself which of course in docker there's on Linux there's a bunch of different graph drivers there bunch of different options for implementing this we've chosen just one strategy for the first release and it's based on the observation that at least with the root filesystem you don't care too much about the host and the guest being able to access files concurrently you know if you want to get files out of a container you can use docker CP and we can enlighten that to actually call into the VM to pull the files out we don't have to be able to access the files directly on the host so what we came up with was kind of this block file hybrid is the way I think about it and essentially we've keep the layer concept of something like a whoa faster overlay FS but we expose these layers to the VM each has a separate persistent memory device that has an ext four partition on it and we squeezed for these read-only layers we squeeze all the free space out of this so this is minimally sized and like I said we expose this not as a traditional block device but as a new persistent memory device of pmm device and what that means is that this the the in if the guest wants to read from this device instead of having the choose some kind of RPC call to the host actually issue a device read it just accesses physical addresses it basically it's kind of basically memory-mapped that host file into the physical address space of the guest and the cool thing about this is that the Linux kernel is added support for direct access or Dax mode for ext for partitions so that then when you go to memory map a file from that device into your process it doesn't use the page cache it doesn't do additional caching in the VM it just Maps those physical pages directly in their process so if you're doing a read it's not going to go through the cache if you're doing a map there's no additional caching so we're actually using the hosts file cache for for all of these files essentially and the nice thing about this is we do get memory sharing between multiple containers even though we're using separate VMS for each container if you're using the same image in multiple containers we actually can share the the cache and reduce overall memory footprint then once we have all these layers we just took the same approach that darker defaults to which is let's use the overlay filesystem and Union these things together inside the VM so we think that works pretty well the what works slightly less well but still I think works is is the volumes in by Maps so for here we basically said by default we can't take a block approach we can't take a an approach that that loses cache coherency with the host all right well loses coherency with the host you really need the files to be immediately available on the host and the container so we have to take a file based approach and we have to do one that doesn't have a lot of extra caching so what we did is we implemented the we use the plan nine filesystem which is a network file system that's been available in the Linux kernel for a while and we've used the mode that it's actually a kind of a variant of the plan nine filesystem at this point it's designed specifically for the Linux VFS model the the kind of protocol additions are and we've implemented that over this vm Network transport called in Linux it's exposed to the address family v sock on the host side we actually have a different address family for interacting the same thing called AF hyper-v but is essentially the same thing it gives us a network you know stream socket to communicate between the guest and the host and we've implemented this file system over it and then we had the problem of well how do we actually map the Linux file system semantics on to the windows file system because now I'm mounting you know by default if I've got a volume or a bi mount I'm just going to take that some piece of an NTFS filesystem and map it into the to the VM well the cool thing is we've already solved this problem once in the windows subsystem for linux we have the same problem there you want to be able to access your windows files from Linux tools so we've shared code here we use the same mechanisms that we use in wsl to expose to decide what linux metadata to expose how to respond to Linux operations and you know if you've used up your so a lot you know that this isn't perfect yet and it's not perfect in ltaly either but now as wsl gets better alcohol will get better and vice-versa so we're pretty pretty proud that we could you know use those kind of bring these things two things together and that way even though the architectures are pretty different otherwise and like I mentioned by default well actually it's hard-coded right now that we we disable any kind of caching in the in Linux guest around metadata especially so that essentially every file system operation is going to have to go to the host and this gives you correct behavior all the time and I think what they've what doctors found with docker for Mac is they haven't had a similar problem where correct behavior is probably what you want most of the time but it can be expensive and so in docker for Mac they've implemented some options to allow you to weaken that coherency when you want to to get better performance when you know that the host isn't going to be my to find the files at the same time we didn't expose that yet but architectural II that seems pretty straightforward so it's definitely something I think we'll we'll be thinking about for probably early on to address some of the performance issues but we have other ideas as well so we're going to keep thinking about this space and yeah watch this space for more so I'm going to end on a demo here on the storage bits and we're going to go back first to our it's not working there we go we're gonna go back to our non-cat guy here all right there we go and we're just gonna go back to the utility vmi just wanted to show the mount table for this thing so nyan cat is pretty sophisticated it requires seven layers to be implemented and so you can see them all here so each of these is one of these pmm devices that I mentioned you can see it's mounted read-only which makes sense and here's that DAX option that tells you X t4 you know I don't use the page cache just map these pages directly through I did leave out one detail which is that here we have actually still a scuzzy device so the writable portion the writable layer for our container root filesystem is currently a classic block device we haven't quite figured out how to get that working with a PM device but that's something that isn't as important anyway you don't really need Dax and you don't really need to avoid the page cache for the writable portion because that's not shared between containers anyway and then here's that overlay amount where we've stitched together all these layers and into a single view so that's great and then let's let's just show the volume stuff real quick so the right so essentially what I want to do is let's create a alright we'll do a live so let's create a Linux container with a volume called foo I'm just going to run in the background here and I think this is kind of neat let's create another container but this time let's create a Windows container let's map in the same volume and what we're going to be able to do will use docker exec for this well we're going to be able to do is you know access the thing concurrently from both Windows or Linux and windows so if we you know write a file to from the Linux side so we've here we've execute command in the Linux container to write into that volume now we can come back to our Windows container and we can actually do that file so pretty cool volumes in both Windows and Linux same time so with that we can take some questions but I've got some more information here I'll be tweeting out some of these links so you don't have to scramble to write them all down right now you can follow me on twitter with giga at giga starks and you know we'd love to hear your feedback and we'd love for you to contribute as well thanks very much [Applause] thank you very much if you have questions we have Microsoft microphones in the area so please come to the microphone or raise your hand so I can come to you one question about stalkers war mode is it also possible to run Linux and windows containers when I create a progress one minute with my understand node not yet we haven't you know that there are some additional API changes in docker to be able to expose this multi-platform mode and I think that we'll want to make sure we get those completely merged before we kind of tackle the swarm issues but that's definitely I think something it'd be very interesting to be able to do thanks for the talk I think I just have quicken America exit day I can try it on my Windows 10 laptop its latest updates yeah so sorry the question was can you can you try this with the latest Windows 10 yep yeah so this so actually on this laptop here I'm running the the latest fault that we would call the fall creators update of Windows 10 which was released yesterday and so if you have the right version of the dr. Damon and the client and you have a Linux kernel and the in and RD if you get all the pieces together then yeah absolutely the platform support is there because as I understand it was supported on the new Windows cerebration 1709 and I didn't know about thank you yeah yeah that's fine more question one question about mighty stage builds can I mix Windows and Linux with a willingness stages across cross-compiling golang binders on linux and then creating small nano server that's not no i mean it's not that idea at all so so John Howard has been working hard on the docker support for this and he came to me a few days before I left for the conference that said you don't want to have multi stage builds working for the conference D it's definitely on our list but it's just a few steps down so I have myself a question you mentioned multi container support in one of those VMs utility VMs do you think about pot yeah Midas but that's certainly one of the motivations for you know one of the key motivations for supporting the the multi container VMs because certainly the pause container even if you didn't have anything else in your pod just one container you still need that pause container so that's something we're definitely thinking about okay thank you very much and have a good lunch and see you soon [Applause]

Info

Channel: Docker

Views: 14,845

Rating: undefined out of 5

Keywords: Edge

Id: JZtQnYaO874

Channel Id: undefined

Length: 40min 21sec (2421 seconds)

Published: Fri Nov 03 2017