Deploying machine learning models on Kubernetes

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
Hey there! Today, we will talk about  how one can deploy a machine learning   model on Kubernetes and since this is  a very broad topic I will just show you   a minimal example and I will refer  you to some external documentation   and packages if you want to get some extra  features anyway I hope you enjoy the video! Here you can see a sketch of the three major  steps that one needs to perform. First of all,   we need to turn our model into an API  server that is able to receive requests run,   inference and finally return a response with  the result. The reason why we do this is to   standardize the way we communicate with our model  and to have a layer of abstraction that hides the   complexity of the model inference. The second  step is to take our API and containerize it.   The reason why we do this is to make sure that  we can run it on any device and that all the   dependencies are available. Finally, we take our  image and we deploy it as a pod on Kubernetes. I   guess now it is the correct time to talk about  some cool features of Kubernetes. First of all,   we can deploy a pod on its own, however, it  is better to create it with a parent object   called a deployment. If you do so the deployment  will make it possible to track version history,   it will make sure that the pod is always running  and it also will allow us to create replicas of   the same pod and many other things. Second  of all, we can create another parent object   called a service that is going to load balance  requests between all of our replicas. In short,   the deployment and the service together will  allow us to scale horizontally. Kubernetes has   many other cool features but these are the two  main ones that we will be focusing on. Here you   can see a table of some popular frameworks that  can help you with the three-step process and they   also add a lot of features and simplifications.  Just a small disclaimer I'm not familiar with   all of them so feel free to write in the comments  if you find some mistakes or if you think I left   out some important framework. Also note that  I did not include any cloud service provider   services and I also did not include any HTTP  web Frameworks like FastAPI and Starlettte. Okay so let's start with the hands-on end-to-end  example. To make things fast I decided to go for   a minimal solution which is an API creation tool  inside of the Transformer CLI. It was the last row   of the table I just showed you, however, before  I do this let me just set up my environment. Pyenv is a tool that enables  you to change Python versions. Here I create a virtual environment. And yeah we are ready to go. So let's  first start with installing Transformers. The extras_require equal to serving  because it will install things like   FastAPI and uvicorn. Let's inspect  this Transformer CLI a little bit. I actually forgot to install one of the  deep learning Frameworks so let's get torch. This is the documentation of the CLI and  let me just quickly show you that there   are a bunch of commands that one can  use like convert, run and for us not   surprisingly the relevant one is going to be  serve. So let's see the documentation of it. To be able to run this we need to  provide a task which you can see here   and in our case I just decided to go for fill-mask  and then we can specify things like the host,   port, workers and most importantly  the model. Let's try to run it. As you can see we have a server up and running  and now the big question is how do you actually   use this server? What are the endpoints? To  my knowledge there is not really that much   documentation on this so let's just go straight  to the source code to try to figure this out.   Here I'm on the GitHub of Transformers and let  me go to the src and it should be under commands   and in this module called serving. First of  all, one interesting thing is that we can   look at the imports and we see that we're using  FastAPI, Starlette and Uvicorn - these are the   underlying web frameworks that we are using  under the hood. And if we scroll down again,   I'm just looking for the endpoints ... and  as you can see here for each of the routes   we basically implement some kind of a logic.  I would imagine that this root route just   represents some information about the model that  we are serving. Tokenization and detokenization   there are self-explanatory and finally forward  is the one we are going to be using and it's   literally both the tokenization and the forward  pass. Anyway, let's try out this root endpoint. We just literally sent a get request to it and  let's see what we get back. You can see that   the server received the request and it returned  response with a status 200. The response itself   is a JSON and it's not really readable so  let me pipe it into this tool jq and let me   make the pane bigger. And as you can see  there are a bunch of let's say details   about the architecture and about the model  that we're using. This is as we expected.   Cool, let's try to use the forward  endpoint now and see what we get. First of all, it needs to be a post request  as we saw in the source code. We paste the   URL and we specify headers both for the type of  response body and the type of the request body. Not surprisingly both of them are JSONs.  Finally, we need to send over our request   body and yeah the way the fill-mask model  works we just have this special token called   MASK and we want the model to suggest the best  possible words that could fill in the blank. Let's try to send it over and yeah  of course what I forgot was to   specify the right endpoint which  in our case will be forward. And I think I misspecified it again. Let me  fix this. It's not here. Sorry about this. We managed to get a response the server liked  our request and let's again use the jq tool   to see what's inside and we get some answers  I mean probably this was not a great example   so did not really fill in any name but I guess  that's just the way the model works. So here I   thought that the most likely option was to just  end the sentence. Let's let's fix the input. Instead of this let's just do "Today  is going to be a MASK day". Let's see. This one is way better. As you can see it  thinks that the most likely option is "long".   The second most likely option is "great"  and so on and so on. That's all we needed,   that's all we wanted. We have an API server  that receives requests and does tokenization,   inference and then sends over a response. So  this first step as described in the diagram   is done. However, let me just point out  that we made a lot of simplifications.   There are actually a lot of cool features  that one can add to this. For example,   things like adaptive batching. Now we are only  sending one request at a time and you can imagine   if the server was used by a lot of people it might  be beneficial to actually batch the requests and   only run the forward pass once or twice especially  if you have a GPU. That's one of the features.   Also what's very common for these APIs is to have  a "metrics" endpoint that will give us status   on how the server is doing, how many requests it  received, what were the timings and so on and so   on. Again, this is a minimal example. Frameworks  like BentoML they give you way more than this. Now the idea would be to containerize our  API. I'm going to be using Docker for this,   however, there are other technologies  like Podman. Anyway, when trying to   build a custom image one always needs to  specify the base image and since our API   is basically using the Transformer CLI we  need to look for an official Transformers   image on DockerHub. And this one seems to be a  great candidate. As you can see it was updated   very recently. I believe that there's always a  new release or a new Docker image being pushed   to DockerHub whenever there's a new release on  GitHub and yeah it supports both CPU and GPU.   Here are some of the tags. We're going  to get the latest one. We pull and it was   pretty fast. I already did this when I was not  recording. Let's double check that we have it. As you can see it has 17.6 GBs which  is quite a lot but it is what it is.   Let's now use it as a base image to  be able to create our custom image. We start from this one. What we do here is to instantiate  our model when we're building this   image and it's a hack that will allow  us to store the model weights inside   of the docker image because this  automatically triggers the download.   Let's do the same thing for the tokenizer.  Not sure if it's necessary but whatever. And what you also need to do  is to get dependencies for the   serving logic. So in our case  it's just FastAPI and Uvicorn. Let's now expose a port. And finally let us write a custom entry point  that is literally going to launch the server. We should be done! Let's try to build this image. And let's call it "cool-api". It basically was instantaneous. The  reason for this is that I already did   this before. If you were doing this  for the first time this will take way   longer. Let us just verify that we really have it. Do it like this. What's interesting here is  that before the base image was 17.6 GBs I   believe and now we are on 18.1 GBs. And  it's again because of the fact that here   we downloaded the model weights inside  of the image. Let's try to run this. We do port forwarding. We right away get a warning about the platform  and this is something specific to my computer.   I'm on a Mac M1 and unfortunately the base  image we used did not support my platform   and as you can see the server is running.  Let's figure out what port was exposed.   And as you can see here it was 55008 and  we can do the same thing as before but   instead of sending the request to the  port 8888 we need to change the port. We managed to get an answer, unfortunately,  because of this platform mismatch this is   way slower than having the raw API. But from the  functionality point of view it is the same. Let me   just quickly actually write a separate image where  I basically build everything for my platform. And let me just copy paste this from my notes.  Here I'm using conda but the the logic is the   same. Let me call the previous server and let me  docker build. By default docker build is going   to use a file called Dockerfile but we can also  manually change it. So we can say file, if I'm not   mistaken and we can say DockerfileConda. Let's not  override the old one. Let's call this version 2   and again I did this before so that's why  it was so fast. And finally let us run it   and hopefully it's going  to be a little bit faster. I'm not getting the platform warning. Let's again check the port. Now it's  a little bit different. Let's do curl.   Now it's pretty fast - you can even time it. It works literally the same way as the  raw API but now we managed to containerize   our application and if you're wondering why it  was necessary, why do we need to wrap it? Well,   Docker containers are amazing at capturing  all the dependencies and they are very   portable so you know I can upload the  Docker image on some repository or in   some Cloud platform and chances are I can  run them right away without any issues. Actually the second image is way smaller. Now let's focus on the third step which is  deploying our Docker image on a Kubernetes   cluster. For the purposes of this video I'm  going to create a single node cluster on my   laptop using minikube. However, in real life  one would have a Kubernetes cluster on premises   or one would use a Cloud solution. Two of  the most famous ones are EKS (from AWS) and   GKE from Google. Anyway first of all let  me start the minikube Kubernetes cluster. It is done and let me actually show  you what this cluster contains. For   this I'm going to be using the  command line interface kubectl. There are a lot of controllers and other  Kubernetes specific things. The idea is   that we would like to deploy our  Docker image on this cluster in a   form of a pod but before we do so I  just need to load the image from my   Docker daemon into the minikube Docker  daemon. Let me show you what I mean. Our "cool-api:v2" is not here in  the list so we need to load it. This will take some time. It seems  like we are done! Let's verify this. As you can see we have the "cool-api" image here.   Now the idea is to create a deployment which  is basically a parent object for pods. In   our case each of the pods will contain a  single image which will be our "cool-api:   and we can actually create the deployment  in a very simple way using one command. So I will be calling our deployment "cool-deploy". We will be using our "cool-api" image and  that's more or less everything we need.   And let us now inspect what the side  effect of running this command was. We have one deployment object and we also have one pod which was automatically  created by this deployment. And what's important   is that here there is always random hash.  Deployments are used in a case where the   pods don't have any state so there is no  notion of order. And if this given pod dies   the deployment will make sure there's a new  one with a completely new hash. Now we need   to create a service which will basically take  all the pots we have in our deployment which   is for now just one single pod and it will  load balance the traffic among these pods. We basically say: "Hey! Look at all the pods  that are managed by the deployment 'cool-deploy'. We   name our service "cool-service" and  now we just need to specify two ports. This is the port inside of the image that we   exposed. You remember this  port from the Dockerfile. And the second port is just any port that we  want that will eventually be the one that is   exposed on the outside. Everything was created.  So let's do a small recap of what we have.   We get all the objects. Internally, there's  also a replica set but don't worry about it.   Under the deployment and the replica set there  is a single pod and also we have a service that   load balances request to all of the pots. Let's  try the service out and to do this there are   multiple ways. The simplest way with minikube  is to just simply do a bit of port forwarding. As you can see it automatically opened my browser,  sent the get request to the root if you recognize   this info JSON. But if we go back to the terminal  we can basically use this URL and this given   port to send requests to our service which will  eventually end up on the pod. Let's give it a try.   Let me copy paste the curl command. And here let me change the port. That should be it and yeah let's  see whether things are working.   We received a response from our Kubernetes  server and up until now we don't   necessarily see any benefits compared to Let's  say the pure Docker solution but let me actually   demonstrate two very cool features that you  get for free if you use Kubernetes. And the   first one is simply the fact that even if you  kill your pod or something happens internally   (it most likely won't be you). Let's say  there's an issue and then it crashes. The   deployment will always make sure that there is  a certain number of pods running depending on   how many replicas you chose at the beginning  - in our case right now it is one replica. We haveone deployment. We saw this before  and let me artificially kill our pod. Let's kill it and let's do kubectl get pods again  and as you can see right away five seconds ago   the deployment made sure that there's a new image.  Just note that the hash is different. And related   to this (that's not something we did) we can  have liveness probe so even if the container   is running maybe something is broken on the inside  and Kubernetes can periodically query the pod to   know how it's doing and if there's something wrong  again it can restart it. So that was one of the   features and the second feature is using a load  balancer. So first of all, before we increase the   number of replicas let me just get the standard  output of the one single pod that we have. This is the standard output without  color from our single pod and if we   send the request to it. You can see that  we are pinging this single pod. Let us   now add two more replicas and again  we can do this in a single command. We just use the scale command and request  three replicas. Kubernetes is not going   to kill our current single replica  and it's going to add two new ones. There are two new replicas and since we created  a service before it will automatically load   balance the request between these three  replicas. Let's verify that this is the   case. The way one can do this in Kubernetes  is to look at the endpoints of each service. Our "cool-service"   actually distributes the load among three  different pods. Let's actually try this out. I'm going to create a separate pane for each  of these pods. I already have this first one,   let me create a new one so that I  can inspect their standard output. And finally this one. And let's rearrange this in a nice way. This should be good enough. Here you can see  the three pods and here this is the original   one where we already sent some requests  but let's now actually send a new request.0 As you can see this specific request  ended up here on the second pod.   Let's add a new one. This one  ended up on the third one.   This one again ended up on the  second one. Let's do multiple.   As you can see the load is being distributed  between the three pods and since I am on a single   node laptop this load balancing is not really that  powerful but in real life one would actually have   each of these pods on a separate node and one  can really scale horizontally. Anyway, I think   that's all I wanted to show you. Again, Kubernetes  has many other amazing features that are really   relevant but I thought that these two that I  showed you are pretty useful for people who want   to deploy their machine learning models. Anyway,  that's it for today I hope you learned something   new and if you have any questions or if you think  I forgot about something feel free to write a   comment and I'll be more than happy to read it  and reply to it! And I will see you next time!!
Info
Channel: mildlyoverfitted
Views: 15,255
Rating: undefined out of 5
Keywords: docker, fastapi, machine learning, artificial intelligence, kubernetes, hugging face, masked language modelling, BERT, REST API, container, containarization, deployment, service, horizontal scaling, minikube, kubectl, natural language processing, MLOps, DevOps, python, production, serving models
Id: DQRNt8Diyw4
Channel Id: undefined
Length: 26min 31sec (1591 seconds)
Published: Sun Mar 19 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.