Everything you Need to Know about using GPUs with Kubernetes - Rohit Agarwal, Google

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

[Music] let's chat hi everyone my name is Robert Agarwal I'm a software engineer at Google I work in the Cuba native Engine team which is part of Google cloud and today I'm going to talk about using GPUs with kubernetes before we begin a quick show of hands how many of you are already using GPUs on Cuban IDs awesome and how many of you are using GPUs that are not NVIDIA GPUs this may be a one single person okay yeah okay so in this talk I'm not going to talk about why to use GPUs or when to use GPUs I'm only going to focus on how to use GPUs and communities I'll start by talking about what makes it hard to use GPUs from containers I'll go into a bit of history about the GPU support in communities and then I'll talk about what you need to do as a user and as an operator to get cheap use working on communities and finally I'll end with describing what's missing and where the support is going next okay so containers and GPUs other things which make containers great and haven't made them so popular is that you can package your application and all of its dependencies in a container image and then you don't have to worry about missing dependencies or conflicts with other applications you can run that container image anywhere however this property breaks down when one of your application dependencies is a kernel module containers use the host kernel so if one of your application dependencies is a kernel module then the host on which your application container is supposed to run needs to have that kernel module installed and applications using nvidia gpus are examples of such applications using nvidia gpus require the nvidia kernel module to be installed on the host using nvidia gpus also require the press of Nvidia's shared user level libraries to be accessible from inside the container these libraries are used to communicate with the kernel module and therefore the GP devices so you need the libraries you need the kernel module on the host and you need the libraries to be accessible from with within inside the container container so maybe you can package the libraries in the container image and assume that the kernel module will be present on the host accept the version of the Nvidia shared libraries needs to be the same as the version of the Nvidia kernel module so if you package these libraries in your container image then your container image is no longer portable because it depends on the version of the kernel module on the host so if you package it for a particular version it will only run on hosts which have the kernel module of that version right so this is not good you don't want to make your container images non portable because that defeats one of the main advantages of containers so in communities the first thing we decided to do to support GPUs was very simple we decided to let the user deal with with these dependencies all communities would do is that it would assume that the Nvidia driver is already present on the host and then it would see the GP devices present and host and expose them as scheduled level alpha dot communities retire slash NVIDIA GPU resources and when a container asks for this resource in its resource requirement then communities would add these devices to the container so this is all it would do it would keep track of the GP devices on the host and attach them to the container when requested so now you have devices attached to the container how do you access the device from inside the container so either side before to access the device you need those user level libraries but also if have those user level libraries than you container image is not portable so we've recommended the users to install both the kernel module and the shared libraries on the host and then use host path volumes to inject these libraries inside the container so install with them and the hosts create a hotspot valium inject that volume into the container so this is how it looked like in blue you see that this container is this part of this container is requesting two GPUs and this is the same place where you'd add CPU and memory at the top right and red color you see that we are creating a host path volume called Nvidia libraries and the path there is the path and the host where these shared libraries are present and at the bottom in red you see that we are adding this host path volume called Nvidia libraries inside the container at the mount path and this path in the container needs to be a path which is in the shared library search path so after you do this your container image will be portable but and things will work but this this is still terrible because your u container images portable but your pod spec is no longer portable your pod spec contains some for specific stuff and we don't want the pod spec to be not portable so there were some other things which were undesirable about this approach so this this support for NVIDIA GPUs was in tree it was part of Cuban Indies core code base and that's not ideal because what happens when other vendors want support for their devices in communities right we cannot continue expanding the community's codebase to support all these things right we cannot add an MD GPUs Intel GPU Xilinx FPGA and so on we at the same time we also cannot play favorites to a particular vendor because communities are the open source software and like supposed to work well with everything as a result of these things we deprecated the entry support in 1.10 and in 1.11 we are completely removing the entry support so this resource will stop working in 1.11 completely to replace this entry support we added device plugins in humanity's device plugins are aware to support generic devices in communities it's an extension mechanism which allows the vendor specific code to remain outside the core community street but still allows these devices to be available natively through the communities API and they also enable the pot spec to be portable the pots if you use device plugins the pod spec will not contain any host specific parts which we had in the previous part spec so this is how it looks like now in in blue you see nvidia GPU so all you need to do now is add the resource exposed by the device plugin in your resource requirements for nvidia gpus that's Nvidia dot-com / GPU and once you do that everything will work for you you'll notice that there's no specific part and this in this pot spec right but now like there is the magic like there's no hood specific part in the pot spec how is the container getting access to the shared libraries present on the host and the magic is in the device plugin api's device plug-in API is allows the device plug-in to set environment variables inside the container or mount volumes inside the container so for example the cluster administrator could configure the device plugin to point to the part on the host where the shared libraries are present and when the device plugins sees a pod requesting GPUs it will add that volume to to the container so that the container can access this right so with this you have your container image which is portable because you don't have user level shared libraries in the content in the container image and your port spec is also portable because you and your pod spec is also portable because you don't have any host specific stuff in the Pacific all the host specific stuff is part of the device plugin which is which is configured by the cluster administrator who know about the host so device begins were introduced as an alpha feature in 1.8 they went beta in 1.10 and if you using GPUs the Kuban it is you should start using device plugins so just to recap what do you need to do as a user at the user you should build your images without the user level shared Nvidia libraries like Lib Nvidia and l dot s o and so on they should not be part of your container image however the container image should still contain the cuda toolkit so the cuda toolkit does have some dependence on the host driver version so for example each cuda version requires a minimum driver version but this dependence is not as onerous as the one with shared libraries where you have to match with the exact kernel version so you build your container image with the cuda toolkit but without the user level shared libraries and then all you do is request nvidia calm slash GPU in as part of your resource requirements and if the cluster administrator has set up the cluster correctly then everything should work for you at the user this may be one more thing you may have to do at the user so let's say your cluster has multiple types of GPU nodes so for example some GPU nodes have Tesla P 100 and some GPU nodes have Tesla V 100 so there is a and you want your application a particular type of GPU so you you want to say that I want my application to run on nodes which have V 100 so this unfortunately no native way of doing this targeting in Cuban it is communities does not natively understand different types of GPUs it only understands that there is you're requesting a GPU doesn't understand that you're requesting of V 100 GPS etc and the workaround what you can do is you can ask your cluster administrator to label GP nodes with the type of GPUs that a present on the net node so you can ask your operator to label the P 100 nodes so P 100 V owner or even read and so on so this is for example what we do in GK where you can say that I want this part to run on a node which has Tesla keys and then this part will schedule on a node which has just like eighties okay so we learned about what we need to do as a user what do you need to do as a cluster operator to create a functioning GPU cluster so first of all you need to have notes for GPUs these things are very expensive and as I said before if you have multiple types of GPU nodes you you should label them so that your users can target particular types of GPUs then and this is the one of the most important things need to install the Nvidia driver and I can talk about just this point alone for a long time unfortunately but I will just keep it short here so few things which we need to keep in mind about this parts of the Nvidia driver are closed source and Linux is GPL license so keep that in mind and the other thing you would want to do is you'd want to keep up with the driver version required by the latest cuda release so as I said before CUDA toolkit has some dependence on the driver version so you'd want that if CUDA releases a new version that you update to a driver which is required by that version and if you like because what would happen eventually is that your users who start using newer CUDA versions and if you have not updated the drivers on your nodes those containers will stop running because good a 9.2 requires I don't know 395 got something driver which your nodes to under in GK we use a daemon set to install the drivers you could you can look at that and like folk that and do your own thing you'll also need to install the device plugins there is a device plug-in for NVIDIA GPUs available from Nvidia there's one available from Google you can choose whichever one fits your needs I am hopeful that in the future we will converge them so that you get the best of both worlds but right now there are two different device plugins and you can use whichever one you want okay so now you have a cluster with GPUs running what else you may want to do other things you may want to do is you want to set a resource quotas per namespace you want to say that my marketing namespace or my production namespace can only use so many GPUs so we added support for doing that in version one point 10 and this is how it looks like in purple I think you have what this is saying that whenever this object is created in a particular namespace the total number of GPUs that can be used by different parts or single part in that namespace is four so you can you can set limits like that per namespace using resource coders we also have so one of the other things which you users move on to you after you have a GPU lesser is for GPU metrics for their workloads so communities already supports a couple of important metrics which users care about for their workloads so these are memory usage and duty GPU duty cycles they are collected by a C advisor which monitoring agent which is built into the cubelet and they collected using env ml which is a closed source nvidia as management library which can be used to collect GP metrics and these metrics are can be accessed using SI Advisors Prometheus endpoint or hipster or Google stack driver and they were added in 1.9 so you can already use them this is one other very interesting thing which a lot of you should do if you are not doing it already is that you you want to because GPUs are expensive you want to create dedicated nodes for GPU workloads and why why do you want to do this right like so you want to prevent parts that are not requesting GPUs from getting scheduled on nodes which have GPS this helps to make sure that when a pod which is actually requesting GPU arrives there's some place to schedule it because if that GPU node is already busy running on GPU workloads then you will get into a situation where your GPUs are idle and your pods requesting GPUs are pending and you don't want to be in that situation because they're just wasting money another thing this helps with is to aggressively downscale nodes if you are in a cloud environment so if you let's say you don't have any GPU workloads at the moment your machine learning scientists are away on vacation or whatever and if you prevent parts that are not requesting GPUs from scheduling on GP nodes then those GPU nodes will be empty when there is no GP workload and the cluster autoscaler can scale those nodes down and save you costs and how do you do this the way to do this in Kuban it is is to add panes to node so what you do is you would paint the GPU nodes with some paint and then no pod will schedule on them until it has a corresponding Toleration so you would tall you'd add those Toleration to the pods requesting GPUs and so what will happen is now you have notes with GPUs that have pains someone submit support that does not request GPU it doesn't have that toleration it doesn't get scheduled on a GPU node someone some it support with GPUs adds that toleration so it gets scheduled on that nodes right in 1.9 we added an admission controller called extended resource toleration so this extended this controller recommends that you add a particular tain to your GP nodes so it tells you what the key of that tin should be and if you do that what this control what this admission controller will do is that it will automatically add a toleration to your GP pots and so like so what you do is you add the taint suggested by this admission controller to your nodes this admission controller will automatically add toleration for that taint in parts that are requesting GPUs and so those pots will get automatically scheduled right so you would have like your users would not have to worry about that I need to add toleration to my part spec that will happen automatically for them so the pot specs will remain they will not know that there is a dedicated GPU node pool or something they'll just submit the GP workloads only the workloads which request GPUs will schedule on those nodes and nothing else is getting on those nodes okay so we we talked about what is already present in communities for GPUs what's missing so there is no support for GPU in mini cube a lot of people have expressed interest in this a lot of people have asked about it but the development of this is not done yet and this is a great area where people can contribute there's also no fine-grained quota control now what do I mean by that I mentioned that you can use resource quotas to cap the number of GPUs that can be used in a particular namespace but because humanities does not natively understand the types of GPUs it only in the sense whether something is a GP or not we cannot say that in my names in this namespace I want to only allow 4k 80 GPUs or 4 P 100 GPS you can only say that I want to only allow for GPUs right so there's no fine-grained quota control and there is some design discussion going on on how to support this and as I said there's some user level metrics already present in the in communities for GPUs but more metrics can be added specifically a lot of metrics which are required by cluster administrators for example they may want GPU temperature or power usage etc they are not present at the moment and we are working like we're designing how to add these metrics without polluting the cumulative score and like having an extensible way of adding these metrics there's also no support for GPU sharing so when a GPU is attached to a part or a container it's exclusively used by that container no other container can use that GPU at the same time also you cannot request half a GPU like you get the whole GPU or you don't you get nothing so this no GPU sharing its exclusive usage right now and we are thinking about how to do this and like there's just thoughts right now I think Kuban is also not aware of GPU topology so if for example you get scheduled in a way that the topology is wrong you may not get the performance you want with your workloads so that's this is another thing which we are actively thinking about but there's no concrete design or anything yet finally the auto scaling support in communities for GPUs is not ideal so there is auto scaling support you can auto scale based on the number of GPUs and if their pods pending you can the autoscaler will know these GP pods are pending so I need to add norsu GPUs but it's not ideal and the reason for this is again the same community is aware of GPUs it's not aware of GPU types right so pod is pending it requires two GPUs what type of GPU note should I add should i add a KT node or should I add a P 100 node and there's there's some hacks in the autoscaler which work with us but like that's not an ideal scenario and we are again working towards fixing this situation okay so I have a shameless plug now if you don't want to do all of this you can run these two commands in the in GK and you will have a GPU cluster working the first command creates the GPU cluster and the second command installs the driver yeah you still have to install the driver any questions yes so the question is what if there are multiple types of GPUs on the same note yeah so we have not heard a lot of use cases and I am not sure if there are any plans to immediately support this I'm hopeful that the design we come for where we make GPUs or we make humanities or device plugins aware of GPU types may handle that your skills but like that's not like that's not a very common use case people usually have heterogeneous clusters heterogeneous nodes are not that common so yeah so the question is has there been cases where newer versions of CUDA don't work or older versions of CUDA don't work on newer versions of driver I have not seen them like thankfully CUDA versions have been backward compatible like the drivers have been backward compatible with the old CUDA versions so that has been very nice yes yeah yeah so the question is what about power metrics for workloads yeah so we are we're working on making sure that we can add more metrics or we can working on making sure that we can add a way to extensively add more metrics to humanity's workloads but there's that's I think that will happen maybe in 1.3 like we are hopeful for design in 1.12 and then some implementation 1.3 yeah so the question is is there a recommendation for a base container image to build your containers on we recommend Nvidia's CUDA images which are available on docker hub and on get good lab I think yeah so they are they work really well I think back yeah the question is how does the Nvidia runtime environment fit into all this so for NVIDIA container runtime you still need the driver installed I think there is some talks about how to integrate both of them together but right now you still need the driver installed for NVIDIA container runtime to work and you so so as I mentioned there are two device plugins right now the nvidia has device plugin uses nvidia container runtime underneath to make your make the shared libraries available to the container the google one does not doesn't do that yet but like yeah so that's how it feels like it it's a way of getting these shared libraries available to the container it does some other things as well but is the question that you whether you can use GPUs without end in NVIDIA container and time I see I see so the question is what is the relationship between the darker diamond nvidia container n time and the driver right so so the darker demon allows you to have multiple container n times right by default you use run C Nvidia container runtime is run C plus app restart hook which adds some things which make your shared libraries available and so on so the doctor demon is like you communities does not actually require darker anymore right there is a CRI API which which can talk to cryo which can talk to multiple container n times so docker is basically talking to the container in time underneath to start the container and media container runtime is just one of those runtimes and it's very close to run C which is the default runtime in awe in docker and in cryo and in communities and the driver is totally separate driver is basically a collection of the kernel module the shared libraries and some debug utilities so you need to install the driver which will install these things then video content container runtime after that will make these shared libraries available in the container and docker is just one way to use the Nvidia container runtime you can use communities directly and you can use cryo as well yeah oh so the question is what are the like the estimated time then these the things which are missing arrive it's really hard to say it's like software estimation right but like so there's some so the monitoring thing is one of the top priorities of the resource management workgroup which works on device plugins the GPU sharing and GPU topology is in very early design phase and then mini cube is like we're just looking for someone to add mini Q support like it should so many cube support is you can add it on Linux you can add it on Mac the Linux one should be fairly easy because mini cube on Linux can run with a mode called driver equal to none not the Nvidia driver but the VM driver and that should be fairly easy but no one has just gotten around to do it on Mac I'm not sure I think be a hard thing on Mac or Windows maybe on Windows it'd be easy and why else was there I forgot but yeah like this all of them are yeah so the other things were making the autoscaler better and making finally encoder control that's also in the design phase so hopefully in one point well you'll start seeing some sorts of like some it is some design invent well for all these things and hopefully later you'll have some alphas and betas yes so people have used device plugins I think to implement F like provide an FPGA device but I am NOT very familiar with that like there there are examples out there and so there is a you should look up resource management workgroup where we talk about these things and like you'll have you can talk to more people there and they'll have more sites about this okay all right thanks a lot everyone [Applause]

Info

Channel: CNCF [Cloud Native Computing Foundation]

Views: 3,971

Rating: undefined out of 5

Keywords:

Id: KplFFvj3XRk

Channel Id: undefined

Length: 31min 33sec (1893 seconds)

Published: Sun May 06 2018