Cgroups, namespaces, and beyond: what are containers made from?

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hi hi everybody thanks for coming to that session when I started learning about docker two years ago and I started digging into it really one of the best resources I found were Jerome's presentation at the time to learn about what was like see groups and namespaces and how you play with that so today we have a another session in depth with a home potato knee from docker about namespace EC groups and what containers are made from hola Barcelona come estas that's why I know in Spanish so I will have to continue in English but I made a point when I go to someplace to start speaking a little bit in the local language even if it's just to say hello I'm Jerome and I don't speak your language I'm sorry I did that in Russia when Andre that you said just before invited me there so I had to run a few world version that was hard so I'm here to talk about containers and what they are made from namespaces she groups a little bit of copy-on-write yes so short intro for those who don't me yet so I'm Jerome I work for docker I'm based in the docker HQ in San Francisco and I was lucky enough to be with docker before it was docker so I have like five six years of experience with a project that is only two and a half years old that's reconvened so when I have recruiters calling me hey we're looking for someone with five years of docker experience a yeah I can do that so I was part of the team who that built dot cloud the past that we had and eventually became darker and also I'm a member of the house of Bosch we have a few members Jesse is also member of the house of Basha even if she doesn't know about it so we put things in containers and we replace people with tiny shell scripts that's what we do so the outline for today first I will have a really quick thing about what's a container I guess that most of you know about but I just want to make sure we are on the same page then I will talk about those container building blocks so namespaces see root scope in rights then I will talk about different container runtimes some that are based on several namespaces and order that are not based on those things and finally we will do a little bit of crazy demos because no talk would be complete without some Christ demos so first what is a container so you know like there's this high-level approach where we say well a container is a little bit like lightweight virtual machine and then we also say well but the container is not a light virtual machine stop thinking that because that puts you in the wrong mindset but when you really don't know it gives you a little idea of what to expect it feels like a VM like you could get a shell into it you could SSH into it but don't you can have your own process things so when you do PS stop H stop you only see your own processes you do ifconfig or IP address you only see your local network interface so it really feels like VM you can you can install packages in it you can run services great but at the same time it's also more like chroot on steroids because it's not a VM it's just a bunch of normal processes running on a normal kernel and if you are on a machine that has darker or another container runtime installed and you do PS you will see all the processes inside all the containers so it's small transparent than VM you can't do stuff like having a different kernel for your container or having a different OS because it's only one kernel and then you put little words between the processes each process is living is in nice little world where it can only see its own environment and not the rest of the machine so how is it implemented almost five years ago a little bit after joining that cloud I was starting to debug some problems and I wanted to understand how containers worked so I took the best tool I had back then which means grep and I started to look in the kernel source code okay where is Lexi it's no where there is no single reference to Alexei in the whole Linux kernel it's like what so you look for containers then you have tons of things but those containers are not the containers that you're looking for those containers are like ACPI containers I have no idea what it is but after a few hours looking on that it's not those containers or you have containers like lists so it's a kind of container like in data structures but not the container we're looking for neither so at some point I'm like okay do container really exist and so after digging a little bit more I realized that I was looking in the wrong place containers are not in the kernel what is in the kernel however is those famous C groups and namespaces so let's start with control groups control groups let you implement metering and limiting on the resources used by processes so you can count the memory and limit it do the same with CPU with IO either block IO or Network IO you can also set some kind of SEL kind of permission management on device nodes and you can also do what I call crowd control I will explain what it is in a few slides so some big generalities about C groups with C groups each subsystem acts CPU memory and so on and so on has its own Araki which looks like a tree with nodes and each process belongs to one node in each hierarchy so a given process will be in one node for the CPU thing in one node for the memory thing in one node for the block iosing and so on and so on in the beginning when the machine boots you only have one node so it's a tree of one node and the first process is in the first node so it looks like it looks like there is nothing special which by the way means that even if you're not using containers on your machine you still are in containers your whole machine is a container with no limits and everything but it's still in a container so if you think you can go faster by like not being in containers nope because you're still in a container even when you're not in the contain you are in a container that's how it works all right this is a little example with just like two Yerkes CPU and memory so you can have some things for instance in CPU we said we're going to have real time jobs and then batch jobs and form memory we'll just have one like one subcategory from databases and then in each group we can have more subgroups and then the numbers are P IDs so that's a picture example now let's talk about those different see whoops so first the memory c group so as i said we can do accounting so we can count how much memory is used by each process or rather a group of processes so we can keep track of every single memory page used by your group of process the granularity is a memory page so it's not down to one byte at a time it's the memory page of four kilobytes on on most architectures and those pages are sorted between different groups so there is like file and anonymous five pages are ones that you can track down to one specific location on disk like if there is somewhere some something on disk that corresponds to that page because basically that page that memory page was loaded from disk then it's counted as a disk page and why is this important because if at some point you're like in a Russian you need some memory quick you can remove that page because you know it's still on disk so you don't have to swap it out now you have any limo's memory which is not related to the top project that jesse talked about this morning but which is the memory that does not correspond to something on disk like when you do a malloc or like to to simplify things it will come from anonymous memory which is a little bit more annoying when you want to reclaim that memory because you have to swap it out first all right and then the kernel will put those two things into pool there is like active memory and inactive memory it's completely arbitrary it's not like oh this is an active page this is an inactive page it's more like by default we put everything into active and then when we are kind of gate a little bit out of memory then we start to put things into inactive almost arbitrarily but each time you touch a page it goes back to active so it's a very simple mechanism that will put the pages that are accessed often into the active set and the ones that are not accessed often into the inactive set each page is charged so to speak to to to a group when multiple groups are using the same page they don't exactly split the bill that was changed some time ago so basically there is only one group that pays for the page that means if you look at the memory usage of all the groups ok that page is accounted for that group so in the other groups the page is invisible but if that group goes away because the process is terminated or something like that happens then the cost is moved to another group so when you have pages shared between groups it's a little bit tricky now you can set limits so each group can have its own individual limits or not the limits are purely optional and you have two kinds of limits soft and hard so the soft limits are not strictly enforced in fact I will first explain the hard limits because it's slightly easier how limits is if you go above your hard limit the process gets killed like the maybe you see what happens when you're on a Linux machine and you're out of memory and then the out of memory killer triggers and it will just like randomly remove processes and then people are like oh that's pretty bad because suddenly my sequel disappeared or that process disappeared because I was out of memory so the harm limits for C groups will do that but on the simulator so no now that means that if one specific container goes out of memory instead of randomly killing a process somewhere else it killed the process in that container which by the way is why we say all the time put one service per container try not to put multiple services in the same container that way you can have a good granularity and you avoid the scenario where this process went away because this process was out of memory soft limits so they are not enforced so you can go above your soft limits just fine so what's the point of soft limits is that when the memory pressure starts to be really strong when the machine is like okay I need memory because soon I'll be out of memory and it will be really bad then it will look at the processes or rather the see groups that are above their soft limit and the more you are above your soft limit and the most likely you are to get pages taken from you by the kernel all right you can also set those limits for different kinds of memory you can set limits for the physical memory so like physical RAM but also total memory so physical plus web and you can also set limits for kernel memory so all like the D entries and all the internal Kalinin structures because at some point yeah we have limits for you know residence size and swap and everything but one process could still use the kernel well abuse the kernel in ways that it would use lots of memory and that was bad so now we can also set limits on that so I talked about like the out of memory killer and so this improved a lot now you can set amount of memory notification system so that when a container or memory C group technically speaking is out of memory instead of randomly killing something inside that C group we can say okay let's stop all the things so we kind of freeze that C group and then you have a notification and the program can handle the notification and can decide between killing the container or maybe giving more memory because we have no memory right now or move the container to another machine that's the kind of thing we can do some little details so each time that the kernel gives a page to a container or reclaims a page from a container it has to update those counters and that has a performance cost so this performance cost means that when this is enabled you have a little performance hit on not exactly memory a lock and free but more like this action of moving pages between the free pool and the used port which is not exactly the same thing so again you can think haha I will not use containers because that way I don't have this overhead no no this is a global thing set a good time and even if you don't use containers if this is enabled then this operation will happen and you will pay this overhead this is one little unfortunate thing it's not something you can set Persie group its global on the whole machine so you either boot with it or not and then that's it if you want to change it you have to reboot next regroup the huge CLBC group so who here knows about the huge pages well yeah a bunch of people great so you know what this is about so this is a way to limit the amount of memory given by the huge TLB is in a four four four processes because by default a process can use all the TLB amount all the huge pages that he wants that way we can have multiple process using huge pages and not one single one and comparatively it everything now the CPUC group so this lets us track CPU usage but on the granularity of a whole C group so that's a kind of an improvement over just checking one single process there are a lot of single features that are here because when you want to track things you want to track a group of processes or a group of threads and there are operations that you can do in a single process easily but if you want to do them on a group of process it's harder and sometimes even impossible so did give you a super easy way to say ok I'm putting those processes here and I now I have a super easy way to track how much CPU usage they use of the machine so you can set weights as well but you can set limits which is often extremely annoying the first time you are a I would like to limit that group to 10 percent of CPU but you can't so you're like why why the F did they not implement that the answer is because it doesn't make sense so at first you like wait a minute and of course it makes sense when I were on top I see that I have something using ten percent of CPU so I just don't want it to go up ten percent no it really doesn't make sense trust me because if you are using only a small amount of CPU percentage and you have tons of like CPU cycles available modern distance machine will slow down the CPU because save the planet and everything so yeah so if you only use 10% of the CPU the CPU slows down and then if you want the same amount of CPU you should use more but then the CPU will speed up and then that's kind of a mess so you could say okay okay let's count the number of cycles that a given group is using okay maybe the person tells us if it'll make sense fine let's count the cycles number of instructions that doesn't make sense neither because most machines either if done under the hood they are risk machines outside it's cysts or some instructions like loading something in a register will be super fast some instructions like taking something at the address indicated by register multiplying by the address well the content of something at the address of the register and storing that at the address pointed by the third register this is broken down into like half a dozen instructions and that will be much slower so counting number of instructions would not work neither so well you can set CPU usage in person next thing the CPU set C group so CPU set allows you to pin groups of processes to one CPU auto reset of CPUs and so that lets you dedicate CPUs to specific tasks that allows you to for instance avoid processes that are constantly being moved to one CPU to the other like you can if you want you can dedicate a CPU to one specific process because it's super important that the latency is as best as possible it's also great on Numa systems not only for memory architecture that's when you have like multiple CPUs while multiple sockets and you have bunch of memories that just act kind of connect them to a specific CPU each time so when that CPU want to talk to the memory up there it's slower because it has to go to the the other CPUs and in that case is like super convenient to be able to say ok this process of those processes will stay there you know this CPU and the memory that goes with it that's that's why for instance sometime people were seeing stuff like that's where I'm running this huge my sequel database server and if I only have 60 gigs of RAM it's super but when I have 150 gig is slower that doesn't make sense that because of that kind of Numa considerations block IO so block ay yo let's you measure and limit the amount of block is done by C groups and containers so it will keep track of the IO for each group of m per block device it will keep track of reads verses rights and also synchronous versus asynchronous operations synchronous being when like Emily I'm doing this sis coal and I'm getting the the reason right away and a synchronous being basically almost all rights because when you've right it goes to the page cache and later the kernel is like okay we have to write this thing and and then it's a synchronous right which by the way means that if you try to set a right limit and then you try to write a value like that doesn't work I said 1 megabyte per second and I wrote 10 megabyte instant and usually so that doesn't work it did it brought the 10 Meg's in memory immediately but then flushing those 10 Meg's to the blood device will be done at 1 Meg per second but it will be done slower so this is absolutely great and works wonderfully either if you are actually running VMs in your containers because then the block layer will typically do direct IO or if you are doing direct IO yourself because you know what you're doing but otherwise sitting right limits that's a good way to have some surprises so IO is not only disc the resource it also network so here you have net CLS for classifier and then pry off a priority and you so here it's a little bit of a disappointment generally because people think that they will do echo 1 megabyte per second to this and I get 1 megabyte per second limit in my container that doesn't work like that what you can do with those two groups is to tell the colonel to put a kind of tag a kind of mark on the traffic that comes from a specific container and then you still have to use TC traffic control for instance and queuing the suppliant and things like that so that the traffic that has been marked in specific ways will be shaped accordingly so you still have to do some extra work next thing device is your group so that one lets you control which container can read or write on which device thanks to this C group a random container can't like open your /dev / SD 8 this device and randomly read and write which would be pretty bad this is used to prevent the container from just screwing up everything on the machine so generally what you want to do is to allow like the harmless devices like def tty def 0 devil dev random is a little bit special because you might know that on a linux machine dev random is filled with entropy entropy happens when you have like random stuff happening like disk access network access think that you know there is some random element in the timing of that so you gather some entropy it means that if you're trying to generate strong random numbers with a machine typically will get like 100 bytes of random things and then it will stop and you will need to move the mouse or if this remote server and try to do some disk IO or anything to refill that so generally in a container you might want to have something here instead of dev random that that's a known problem for people doing crypto and by crypto I mean not just encryption but generating keys for instance in containers is that you can quickly deplete the random and then you're like well that's we're generating keys in containers takes a long time no it's not because you're in a container that's because you end up with thousands of key being generated and depleting your entropy pool there are some very interesting stuff you can do with the devices see group you can expose dev Neptune so that that's the thing to make like virtual network interfaces often often used by VPN stuff so you can have a VPN client or server in a container super easily without polluting the network stack of the hosts fuse so you can have custom file systems in containers TVM so you can have VMs in containers and then it's like the hey dog you like containers so I put like docker in docker in VM in docker yay and you can even expose like the GPUs with dev DRI and that video so you can do Bitcoin mining or any kind of GPU intensive application in containers nvidia recently released a bunch of containers to to make that easy by the way the freezer c group so that's what i call or your crowd control because the freezer c group can say okay i have a container and i want to do the equivalent of a six stop to the whole container i want to stop to freeze that container so you might wonder why do we need that can't can't we just like do six stop on all the processes in the container well we could but that would be slightly different because if you want to do six stop process cannot stop six stop but it can know that it has been stopped and if you are p tracing a process it will there will be some kind of interferences between p trace and between six top second so the the free c group lets you work around that that way you can freeze the whole container then unfreeze it later and that's great to do like job scheduling that's also great due process migration because you can freeze the whole container move it and then unfreeze and and and everything is fine and there is no side effect at all Satoshi is so the first process that you create on the machine so basically in it or system D is created at the top node of each C group and then when the new process is created its created in the same group as its parent process if you want to move so basically if you do nothing special all the processes are in the same C group but you can move processes around and it's extremely easy you can do that to the pseudo filesystem which is typically unsafe a festive and if you want to create a cigrip like if you want to limit the memory of something you just create a directory and then you echo the PID of the process that you want to limit to the tasks special file there that's all you have to do so there is something I call the C group wall which is that people think that this is not so great so and also well that know the really reason is that if you want to do stuff like I want to reserve one CPU for an application all the other users of that interface have to agree okay CPU number three will be to that super fast for this instance and nothing else as to use it so by using a higher level interface like you say okay everything will go through system D or CG manager then you can say this CPU is reserved and you can make sure that nobody else will be using it next big building block namespaces so sig whoops we're about limiting what you can use but like in quantity namespace are limiting what you can view so it's more like in quality yeah quality versus quantity so the a multiple name spacer here as well so PID net Amenti UTSC IPC user and again each process is in exactly one name space of each kind so a given process will be in that PID name space and that other net name space and these other MNT name space and so on and so forth let's review them so the PID namespace is the thing that will let a given process see only the other processes in its own PID namespace and when you are in a PID namespace there is a local PID one which is different from the p1 machine obviously but remember what I said earlier when you are in the machine you can see the processes inside the containers which means that when you are in the container you can see PID 1 and that PA t1 in the container could be something else outside the container so you end up with a process that actually have multiple pids depending on the level that you are in and if you do containers and containers in containers in containers in containers at each level you have a PID for this level so that can be a little bit confusing the network namespace the network namespace is the thing that lets each container have its own network resources its own local host its own eth0 its own routing tables its IP tables its own EVs rocking things because ipbs has been Network name space aware for like five plus years its own sockets and everything there is something really nice that you can move a network in your face from what networking space to the other so you can create a network in your face somewhere and then you can move it to another container so you can have a container that sets up some VPN thing and then we move the wpn interface to another container if you want to the typical use case pop containers is to use the ve th vehicle interfaces so that - we are truly in your faces that are just like connected with a crossover cable between them and so you you put one of those in your faces in the container and the other stays on the machine on the on the bridge and so you have a virtual switch inside your da commission or container host and then all the containers are connected to that switch the MNC namespace so that's basically the thing that lets each single container be able to mount something but not have that something being visible in the other containers that's a few nice examples of that I like if you want to have each user on the machine have their own /tmp nominee /tmp is global for the whole machine and it's a big security risk because when you create a temporary file if someone is really small they could create it just before you and bad things could happen if you if you're using a maintainer spaces Pro user each user has their own /tmp and you reduce the security risk a lot the UTS namespace that's just the thing that lets a container have its own host name that's pretty simple the IPC namespace alright so who knows about IPC here ok still a few hands who cares about IPC Oh still a few people ok so I have to talk about it so that's the thing that lets have IPC resources so simmer for shell memory message queues because in the beginning that was not namespaced so it means that one process typically why Postgres up to 9 point something was used I PC resources and so without the IPC namespace you could have like Postgres server colliding with colliding with resources from another so now that's that's if the user name space that's super interesting it allows to map you IDs so basically you can be UID 0 in your container that's great I'm rude I can do everything but outside the container you're really user 1 2 3 4 and so you can't do anything but as long as you are inside the container it looks like you can do everything and so everything kind of works that is really great for security improvements but it's more in the line of usable security remember what we said this morning to read the keynote awesome security ok that's great but if it's not usable people will just work around it and they will put like passwords and post-it notes because you asked them to put like 15 character passwords and nobody can remember a 15 character password except maybe Rain Man and even then I'm not sure so news on namespaces is a lot about usable security by letting you have okay you can be rooted inside the container so everybody is like ah George in the container that's terrible but outside you're not fruit and that's just fine so it's easy to use because you don't have to deal with new ideas and remapping and everything but outside it's still safe user namespaces so there are two ways to see that I start security plantation a while ago and somebody said well those docker people ready circuits been like almost one and a half years since I said that user namespaces would provide good security and they still don't have user name spaces I'm going to give you another version is that one and a half years ago we knew that user name spaces would be a great security feature but it just took a while for user name spaces to be really usable yeah because as we started to get user name spaces into docker people realize there were tons of security problems with user name spaces so those problems had to be fixed first so user name spaces just landed in docker experimental please if you are among those people that were completely worried when they saw that you had routine containers and everything test docker experimental Feinberg reports talk to us tell us what works what doesn't we need you to make sure that it lends to stable as soon as possible if you don't test it it will take longer than if you do min space manipulation so how do we deal with namespaces now basically you create namespaces when you create a new process when you create a new process you give extra flags to say I want this new process to be in new namespaces and then the process has its own things you can view kind of the namespaces of a process by looking in /proc /p ID number / NS in this directory you have sudo files corresponding to the namespaces so normally when a process like when the last process of a given namespace goes away then all the namespaces go away too but if you want you can do by nouns to retain a namespace like you say okay I have this network namespace I'm sitting up route and everything but I want to reuse it later then you can do a bind mount and even when the when the container has gone away and there is freed up all the memory and everything the kernel can retain a reference on that namespace you can use it later right the last building block is copying right and here I'm going to be a really quick because copyright could be a full 45-minute talk and so I can't do a 45 minute task within a 45 minute talk except with if we do inception but I don't have the dream machines and everything so we'll skip that copy on the right is super important and it took me a while to realize that but if you just take C groups and namespaces you could say okay that's done live containers we can go home nope because one of the things that make containers really great is the fact that you can do docker run something and boom you have your container almost instantly and this is thanks to copy-on-write you can do docker commit blah blah blah or docker bill and the build process can be superfast except if using device mapper but then maybe well anyway so copy-on-write was really essential for the adoption of docker and containers generally speaking and so that's why when you think about building blocks of containers copy-on-write should be on your mind as well there are tons of options available a ufs overlay better FS ZFS device mapper and so on and so on if you want to know more about that just look for a deep dive into docker storage drivers and we have a full talk on this topic other details so something pretty important is orthogonality which is that all those features can be used independently so like if you for some reason I hate containers I don't want to use them well first I'm going to warn you the next five years in computing I'm going to search for you but that being said if you say I don't want to use containers but I just want this memory C group thing of this network singular this network namespace thing that's just fine if you want you can cherry-pick one of those single features and use it super easily like you just write a few lines of shell or you're following pretty language and you can use those features without like having everything coming with it some missing bits so some things that I did not talk about but are really important capabilities capabilities are this mechanism to break down root into multiple things because by default on UNIX either you root and you can do everything you want or you will not root and you can do nothing with capabilities you say well some things are more important than others like the ability to load the kernel module that's your use the ability to bind to a network pub below 1024 maybe not that critical so with capabilities we can break that down into multiple permission bits and so here again the idea is okay you are routing the container but we stripped all the capabilities and so you can't do any of the little things that route could do if it were so inclined but then you can really do some things like hey this container is going to have a VPN so we are going to give it a cat net admin and stuff like that so that it can configure the track interfaces for instance on the topic of security SC Linux and apparmor well if you want containers that actually contain you need to use that and that would be a whole talk on the topic and there will be a talk on security little bit later I think something really nice until recently if you wanted to do like custom security with containers and docker it was mostly a silly necks until Jesse wrote something called Bane which lets you easily write custom apparmor profiles that's great especially if your servers are on Ubuntu and you're like well I could use a silly mix but then I'm going to that's going to really be tricky now you can use that to generate profiles for your containers if you want some really fine grained permissions all right container real-time so here I want to talk not only about docker so by the way everything I said up to now applies to a bunch of container runtimes not only docker so there are container runtimes that are based on say groups see groups and namespaces so including docker and then some others that are completely different so those based on three groups and namespaces Aleksey so it's one of the first ones it started as a bunch of usual and tools that what I said Alex is not in the kernel alex is a set of user and tools that leverage on three groups and namespaces so the early versions of Alexi so that was great because it was at least it existed so we could use it you can create containers it was super flexible but it had no built-in support for copy-on-write no easy way to move images around like the equivalent of docker pushed akka pool and you still need to really understand how things work and right container profiles and everything everything which was great for system means but terrible for developers which is why a lot of system means initially were like do we really need docker can't we just write 50 lines of configuration for which container the answer is in the question I think next one system V and span so that's something belonging to system D so the main page of system D says it's for the beginning testing and building a little bit like CH hood but more powerful it implements the container in your face I don't know what is the container in your face but anyway system D seems to be the only thing incrementing it so it positions itself at plumbing but it's kind of weird like recently well that Riesling fee but they added support for docker images but apparently the systemd developers are so afraid of docker they think it's like beetlejuice if you say docker three times in the codebase named trigger boom and we write itself in go because this was in the patch it's like instead they didn't want to put docker in the code so they put do some command C key here that's for real I see if anybody knows why I really would like to know because that really provokes doubt on like their mental sanity or something anyway the doctor engine so we know what it is a big demon controlled by REST API tonal features you can build moving majors and everything the first versions of docker for those of you who started docker while ago know that it used to shell out too Aleksey and then eventually we wrote it container to be able to run without alexei so some people said okay docker those way too many things we want something smaller that we can kind of put into our own system and that's how systems like rocket and Renzi appeared it's kind of let's get back to the basics so the idea here like if you take run see you it's the docker engine but you remove the API you remove the build system you remove a bunch of other things until you only have the thing to run the container so Renzi is using lip container the same library as Don the docker engine it just takes a bunch of files in the local directory and it starts your container and it has some very unique features that docker doesn't have yet like like migration rocket has the same idea but is built on top of a different specification but the key idea is ok I just want something to stop my container and everything else nope so if you like ok which one is the best well obviously it's docker but now joke on the side if you like thinking okay I have this workload where performance is really important should I use this or that well under the hood they all use the same kernel features so there will be no difference at all so you shouldn't think about like performance but more about usability and what you need and how it will integrate in the rest of your system now there are also runtimes that are not based on namespaces and see groups so openvz that so that's also in linux it's older it's a super robust security story solid because like if you use Travis CI you get root inside of openvz containers and nobody ever managed to break out so it should be pretty solid it has tons of really cool features like P loop which is like device mapper but in a good way checkpoint restore so that you can do live migration of containers and so on and so on and even though it's old it's still maintained still actively maintained and there are features that are slowly trickling from open VG to the other containers now outside of the Linux world jails and zones on FreeBSD and Solaris so the the key thing with openvz with jails with zones is that the the first concern was okay we want to container to sexually run processes so security was there from the beginning however opportunity like this thing of I just want this feature I just want this namespace that was never a concern from those systems those systems were built by or for people doing hosting so if you have a hosting provider that's great if you are there is some echo if you are not a hosting provider if you are just a developer looking for content of runtime it's not so great because it will be less flexible for instance the the proof is that if you're looking for the equivalent of docker run - giu boon - if you're using jails zones or openvz that's going to be way more complicated and way more lines of codes building our own containers so that's the demo part first warning don't do this at home you're likely to burn the carpet and make things happen this is not for production use but we will actually build containers so okay that's my shell good right all right so I have a an empty better FS volume here and we are going to make containers with that so first I'm going to make sure that my mount point are private because otherwise when I'm going to do mount in my containers they will bleed out to the host system and that would be pretty bad right now I'm going to create directories for images and containers and now I'm going to create an Alpine image bit erasable create that's there and now I'm going to use docker here just to get that base image just to get a plain double of that image just because I'm lazy so docker run - the Alpine true and then I can use so I have a container here that has the Alpine base image and I'm going to use docker export so that will give me a double of that container and I'm going to unpack this inside the Alpine image alright so now in images Alpine I have a tiny container image that I can use but before using it I will make a snapshot using better FS so I'm not going to actually do anything in that image but I will create a snapshot so very fast yes Volos snapshot images alpine containers Tupperware right so now I have to probe a container best container ever and just to kind of keep track of where we are I will create a file here like this is Tupperware and now if I look in the top our image yep I have this file here that will let us know that we are in the container indeed so I will do a little chroot alright so yep I'm in this Alpine thing I could do a PK if I wanted that's great alright so now I'm going to use namespaces so I do and share that - mom - Utes - - I PC - - net - PID - - fork so that lie okay give me all the namespaces except the user name space and okay so it looks like nothing happened but really it did like if I do or snail total where and then exec bash yep I'm in my Tupperware container everybody can see them from the back yep okay cool all right so now I'm in the beginning of a container it's not exactly a container yet we just have namespaces and if I do PS I should only see the processes inside my container and I see that but look there is something weird the PIDs are like not PID one what's going on so that's expected actually that's because /proc is still the /proc of the system so i still have the view of the system and if I want I could do something like let's see look at this PID off and share okay if I try to kill this process to tell me no such process because I'm in the namespace and that's pH it doesn't exist in this namespace but if I mount Prague if I mount Prague and then I do PS now I only see the processes inside okay good so let's remove this /proc thing so now I want to get into the file system of my container so I'm going to go to bare FS containers - per where and I'm going to use pivot route so basically you read the man page of pivot that uses you okay you have to create an old root directory and then you do pivot route from the thing that will be the new route to old route and it should works and then it doesn't and then you read the man page again and again you're like what is going on you know like okay there is some completely undocumented thing which is that the that you should be almost at the top of the file system you are key otherwise it doesn't work okay fine so we will use a bind mount to make that happen so I'm going to bind mount to well to make better FS continent upper where that to transform that into a mount and that one verified on typo and then I will mount that to almost the top of the Iraqi so by the way those kind of little weird quirks explain why stuff like Alexei docker and everything is not easy it's because they are ton of little tiny things like that all over the place alright now I can go to slash better if s so that night to probe a container fine and now I can do people through dot all route okay so right let's let's finish the job so I'm in my Tupperware container let's mount proc and now if I do PS yep all is good and now if I do mount I still have tons of mounts from the whole system so we have to get rid of that so I'm going to do unmount everything okay but it also removes slash product so let's remount proc again right and I still have all route I have to get rid of it so I can try a moon but it is busy so here again I have to use a magic flag to say yeah let's like to Elysian mount right now which cane we have something that looks like a container that's great and we could do something like you know install something in the container so let's see if we have Network well obviously not we are in the container with the new network mystic so we don't have no network so we need to go to the host ok and in the host I'm going to find the PID of my container and I'm going to create a pair of virtual network interfaces so the container PID is 6 902 ok so I'm going to do ok IP link add H 692 type V H P names 6c 692 so now I have a pair of interfaces so I will move one of them to the container so IP link so just to see what's happening here if I do ifconfig I just have hello and it's down now I do IP link set C PID Nathan s CP ID right typos all over the way and now if I do this again okay marry me face judge show that in the container see how easy it was that super convenient really then I'm going to take the other inner face like the one that is still in my machine and I'm going to put it in the docker bridge so here I didn't go through all the way of creating the bridge and so on and so forth I'm just reusing okay so what did I forget here IP link set yeah all right and now I'm going to go to the container and in the container I we set up the network IP link set up IP link set so c6 902 will click eth0 alright and i will give it an IP address I'm just picking a random address in the docker range okay good and then I'm adding a default route all right and now if I do this that should have worked so I'm wire London no I'm not we're on the network so I might be tempted to blind the Wi-Fi but that would be really crappy so let's just make sure that it's not another IP address somewhere can I at least ping Mai yeah I probably forgot something so many interfaces up here and here yep I have I'm in your face and it still and it's up so that's good what's the address of my daughter in your face that's yep right I can ping my doctor machine but I can go out so let's check yeah so that should have worked if it doesn't let me double-check my address one seven two one seven forty two three yeah I might need an extra iptables rule or something I'm almost out of time so the last step I want to show here is that and that's when the important part is that if you know the Alpine image you know that it there is no Bosch in it they're just like the basic SH but here in my container I'm still running Bosch and that's one of the really important bits when you are setting up your own containers is that the last step at the very last step before ray like handing off control to the container is to do this chroot /sh and when I do that probably with exec then I'm really in the container before doing that I was still running Bosch which is not installed in the container that's really important and there is a kind of complicated handoff that has to happen between your container runtime and your container because all those operations like the mono stuff I was doing here okay IP this IP that and so on and so forth this is done by the container runtime and so it doesn't need cooperation from the things inside the container so that's the one of the the really important bits and that's one also one of the complicated things that that docker and other end times are doing I won't go all the way to show like C groups and everything because we are out of time but just to give you a list of the things that we haven't touched in in the demo so C groups devices like this container it could access to the SGA and corrupt the whole disk if you wanted to capabilities selinux own power and adventure for little things including automating all that stuff so what I wanted to show here is that yeah if we want you just with bash and like maybe 20 lines of scripting we can do our own containers but just because we can do it doesn't mean that we should do it and if you think we'll do our own container runtime I strongly advise you against it and Andre who were speaking just before me the first time we met it told me okay I've seen the docker early project it was like 0.5 like more than two years ago and the Yandex team was about to build their own content runtime and they came to the conclusion that it would be less work to take this docker project and just use it for the names rather than reimplemented that from scratch so if the Google of Frisia prefers to use a container engine developed by a team of 10 engineers on the other side of the world maybe you can choose fishy after 2 years that's all I got then we became maybe answer a couple of questions before the next session thank hi a quick question maybe you can tell a couple of sentences about the new SEC comp integration and whether it's going to be possible to put your own second poll files so what about the second integration yeah and the extensibility of Jessie has an open public quest for that so I haven't seen it yet because I know it's super recent but I recommend to go to one of our talks and ask her directly she will be able to tell you exactly yeah if you have a question like wave your arm so we can see you and otherwise it's no going once going twice thank you
Info
Channel: Docker
Views: 220,630
Rating: undefined out of 5
Keywords: docker, containers, dockercon, Namespace, Cgroups, Software (Industry)
Id: sK5i-N34im8
Channel Id: undefined
Length: 54min 24sec (3264 seconds)
Published: Thu Dec 03 2015
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.