Deploying machine learning models on Kubernetes

Video Statistics and Information

Video

Captions Word Cloud

Captions

Hey there! Today, we will talk about how one can deploy a machine learning model on Kubernetes and since this is a very broad topic I will just show you a minimal example and I will refer you to some external documentation and packages if you want to get some extra features anyway I hope you enjoy the video! Here you can see a sketch of the three major steps that one needs to perform. First of all, we need to turn our model into an API server that is able to receive requests run, inference and finally return a response with the result. The reason why we do this is to standardize the way we communicate with our model and to have a layer of abstraction that hides the complexity of the model inference. The second step is to take our API and containerize it. The reason why we do this is to make sure that we can run it on any device and that all the dependencies are available. Finally, we take our image and we deploy it as a pod on Kubernetes. I guess now it is the correct time to talk about some cool features of Kubernetes. First of all, we can deploy a pod on its own, however, it is better to create it with a parent object called a deployment. If you do so the deployment will make it possible to track version history, it will make sure that the pod is always running and it also will allow us to create replicas of the same pod and many other things. Second of all, we can create another parent object called a service that is going to load balance requests between all of our replicas. In short, the deployment and the service together will allow us to scale horizontally. Kubernetes has many other cool features but these are the two main ones that we will be focusing on. Here you can see a table of some popular frameworks that can help you with the three-step process and they also add a lot of features and simplifications. Just a small disclaimer I'm not familiar with all of them so feel free to write in the comments if you find some mistakes or if you think I left out some important framework. Also note that I did not include any cloud service provider services and I also did not include any HTTP web Frameworks like FastAPI and Starlettte. Okay so let's start with the hands-on end-to-end example. To make things fast I decided to go for a minimal solution which is an API creation tool inside of the Transformer CLI. It was the last row of the table I just showed you, however, before I do this let me just set up my environment. Pyenv is a tool that enables you to change Python versions. Here I create a virtual environment. And yeah we are ready to go. So let's first start with installing Transformers. The extras_require equal to serving because it will install things like FastAPI and uvicorn. Let's inspect this Transformer CLI a little bit. I actually forgot to install one of the deep learning Frameworks so let's get torch. This is the documentation of the CLI and let me just quickly show you that there are a bunch of commands that one can use like convert, run and for us not surprisingly the relevant one is going to be serve. So let's see the documentation of it. To be able to run this we need to provide a task which you can see here and in our case I just decided to go for fill-mask and then we can specify things like the host, port, workers and most importantly the model. Let's try to run it. As you can see we have a server up and running and now the big question is how do you actually use this server? What are the endpoints? To my knowledge there is not really that much documentation on this so let's just go straight to the source code to try to figure this out. Here I'm on the GitHub of Transformers and let me go to the src and it should be under commands and in this module called serving. First of all, one interesting thing is that we can look at the imports and we see that we're using FastAPI, Starlette and Uvicorn - these are the underlying web frameworks that we are using under the hood. And if we scroll down again, I'm just looking for the endpoints ... and as you can see here for each of the routes we basically implement some kind of a logic. I would imagine that this root route just represents some information about the model that we are serving. Tokenization and detokenization there are self-explanatory and finally forward is the one we are going to be using and it's literally both the tokenization and the forward pass. Anyway, let's try out this root endpoint. We just literally sent a get request to it and let's see what we get back. You can see that the server received the request and it returned response with a status 200. The response itself is a JSON and it's not really readable so let me pipe it into this tool jq and let me make the pane bigger. And as you can see there are a bunch of let's say details about the architecture and about the model that we're using. This is as we expected. Cool, let's try to use the forward endpoint now and see what we get. First of all, it needs to be a post request as we saw in the source code. We paste the URL and we specify headers both for the type of response body and the type of the request body. Not surprisingly both of them are JSONs. Finally, we need to send over our request body and yeah the way the fill-mask model works we just have this special token called MASK and we want the model to suggest the best possible words that could fill in the blank. Let's try to send it over and yeah of course what I forgot was to specify the right endpoint which in our case will be forward. And I think I misspecified it again. Let me fix this. It's not here. Sorry about this. We managed to get a response the server liked our request and let's again use the jq tool to see what's inside and we get some answers I mean probably this was not a great example so did not really fill in any name but I guess that's just the way the model works. So here I thought that the most likely option was to just end the sentence. Let's let's fix the input. Instead of this let's just do "Today is going to be a MASK day". Let's see. This one is way better. As you can see it thinks that the most likely option is "long". The second most likely option is "great" and so on and so on. That's all we needed, that's all we wanted. We have an API server that receives requests and does tokenization, inference and then sends over a response. So this first step as described in the diagram is done. However, let me just point out that we made a lot of simplifications. There are actually a lot of cool features that one can add to this. For example, things like adaptive batching. Now we are only sending one request at a time and you can imagine if the server was used by a lot of people it might be beneficial to actually batch the requests and only run the forward pass once or twice especially if you have a GPU. That's one of the features. Also what's very common for these APIs is to have a "metrics" endpoint that will give us status on how the server is doing, how many requests it received, what were the timings and so on and so on. Again, this is a minimal example. Frameworks like BentoML they give you way more than this. Now the idea would be to containerize our API. I'm going to be using Docker for this, however, there are other technologies like Podman. Anyway, when trying to build a custom image one always needs to specify the base image and since our API is basically using the Transformer CLI we need to look for an official Transformers image on DockerHub. And this one seems to be a great candidate. As you can see it was updated very recently. I believe that there's always a new release or a new Docker image being pushed to DockerHub whenever there's a new release on GitHub and yeah it supports both CPU and GPU. Here are some of the tags. We're going to get the latest one. We pull and it was pretty fast. I already did this when I was not recording. Let's double check that we have it. As you can see it has 17.6 GBs which is quite a lot but it is what it is. Let's now use it as a base image to be able to create our custom image. We start from this one. What we do here is to instantiate our model when we're building this image and it's a hack that will allow us to store the model weights inside of the docker image because this automatically triggers the download. Let's do the same thing for the tokenizer. Not sure if it's necessary but whatever. And what you also need to do is to get dependencies for the serving logic. So in our case it's just FastAPI and Uvicorn. Let's now expose a port. And finally let us write a custom entry point that is literally going to launch the server. We should be done! Let's try to build this image. And let's call it "cool-api". It basically was instantaneous. The reason for this is that I already did this before. If you were doing this for the first time this will take way longer. Let us just verify that we really have it. Do it like this. What's interesting here is that before the base image was 17.6 GBs I believe and now we are on 18.1 GBs. And it's again because of the fact that here we downloaded the model weights inside of the image. Let's try to run this. We do port forwarding. We right away get a warning about the platform and this is something specific to my computer. I'm on a Mac M1 and unfortunately the base image we used did not support my platform and as you can see the server is running. Let's figure out what port was exposed. And as you can see here it was 55008 and we can do the same thing as before but instead of sending the request to the port 8888 we need to change the port. We managed to get an answer, unfortunately, because of this platform mismatch this is way slower than having the raw API. But from the functionality point of view it is the same. Let me just quickly actually write a separate image where I basically build everything for my platform. And let me just copy paste this from my notes. Here I'm using conda but the the logic is the same. Let me call the previous server and let me docker build. By default docker build is going to use a file called Dockerfile but we can also manually change it. So we can say file, if I'm not mistaken and we can say DockerfileConda. Let's not override the old one. Let's call this version 2 and again I did this before so that's why it was so fast. And finally let us run it and hopefully it's going to be a little bit faster. I'm not getting the platform warning. Let's again check the port. Now it's a little bit different. Let's do curl. Now it's pretty fast - you can even time it. It works literally the same way as the raw API but now we managed to containerize our application and if you're wondering why it was necessary, why do we need to wrap it? Well, Docker containers are amazing at capturing all the dependencies and they are very portable so you know I can upload the Docker image on some repository or in some Cloud platform and chances are I can run them right away without any issues. Actually the second image is way smaller. Now let's focus on the third step which is deploying our Docker image on a Kubernetes cluster. For the purposes of this video I'm going to create a single node cluster on my laptop using minikube. However, in real life one would have a Kubernetes cluster on premises or one would use a Cloud solution. Two of the most famous ones are EKS (from AWS) and GKE from Google. Anyway first of all let me start the minikube Kubernetes cluster. It is done and let me actually show you what this cluster contains. For this I'm going to be using the command line interface kubectl. There are a lot of controllers and other Kubernetes specific things. The idea is that we would like to deploy our Docker image on this cluster in a form of a pod but before we do so I just need to load the image from my Docker daemon into the minikube Docker daemon. Let me show you what I mean. Our "cool-api:v2" is not here in the list so we need to load it. This will take some time. It seems like we are done! Let's verify this. As you can see we have the "cool-api" image here. Now the idea is to create a deployment which is basically a parent object for pods. In our case each of the pods will contain a single image which will be our "cool-api: and we can actually create the deployment in a very simple way using one command. So I will be calling our deployment "cool-deploy". We will be using our "cool-api" image and that's more or less everything we need. And let us now inspect what the side effect of running this command was. We have one deployment object and we also have one pod which was automatically created by this deployment. And what's important is that here there is always random hash. Deployments are used in a case where the pods don't have any state so there is no notion of order. And if this given pod dies the deployment will make sure there's a new one with a completely new hash. Now we need to create a service which will basically take all the pots we have in our deployment which is for now just one single pod and it will load balance the traffic among these pods. We basically say: "Hey! Look at all the pods that are managed by the deployment 'cool-deploy'. We name our service "cool-service" and now we just need to specify two ports. This is the port inside of the image that we exposed. You remember this port from the Dockerfile. And the second port is just any port that we want that will eventually be the one that is exposed on the outside. Everything was created. So let's do a small recap of what we have. We get all the objects. Internally, there's also a replica set but don't worry about it. Under the deployment and the replica set there is a single pod and also we have a service that load balances request to all of the pots. Let's try the service out and to do this there are multiple ways. The simplest way with minikube is to just simply do a bit of port forwarding. As you can see it automatically opened my browser, sent the get request to the root if you recognize this info JSON. But if we go back to the terminal we can basically use this URL and this given port to send requests to our service which will eventually end up on the pod. Let's give it a try. Let me copy paste the curl command. And here let me change the port. That should be it and yeah let's see whether things are working. We received a response from our Kubernetes server and up until now we don't necessarily see any benefits compared to Let's say the pure Docker solution but let me actually demonstrate two very cool features that you get for free if you use Kubernetes. And the first one is simply the fact that even if you kill your pod or something happens internally (it most likely won't be you). Let's say there's an issue and then it crashes. The deployment will always make sure that there is a certain number of pods running depending on how many replicas you chose at the beginning - in our case right now it is one replica. We haveone deployment. We saw this before and let me artificially kill our pod. Let's kill it and let's do kubectl get pods again and as you can see right away five seconds ago the deployment made sure that there's a new image. Just note that the hash is different. And related to this (that's not something we did) we can have liveness probe so even if the container is running maybe something is broken on the inside and Kubernetes can periodically query the pod to know how it's doing and if there's something wrong again it can restart it. So that was one of the features and the second feature is using a load balancer. So first of all, before we increase the number of replicas let me just get the standard output of the one single pod that we have. This is the standard output without color from our single pod and if we send the request to it. You can see that we are pinging this single pod. Let us now add two more replicas and again we can do this in a single command. We just use the scale command and request three replicas. Kubernetes is not going to kill our current single replica and it's going to add two new ones. There are two new replicas and since we created a service before it will automatically load balance the request between these three replicas. Let's verify that this is the case. The way one can do this in Kubernetes is to look at the endpoints of each service. Our "cool-service" actually distributes the load among three different pods. Let's actually try this out. I'm going to create a separate pane for each of these pods. I already have this first one, let me create a new one so that I can inspect their standard output. And finally this one. And let's rearrange this in a nice way. This should be good enough. Here you can see the three pods and here this is the original one where we already sent some requests but let's now actually send a new request.0 As you can see this specific request ended up here on the second pod. Let's add a new one. This one ended up on the third one. This one again ended up on the second one. Let's do multiple. As you can see the load is being distributed between the three pods and since I am on a single node laptop this load balancing is not really that powerful but in real life one would actually have each of these pods on a separate node and one can really scale horizontally. Anyway, I think that's all I wanted to show you. Again, Kubernetes has many other amazing features that are really relevant but I thought that these two that I showed you are pretty useful for people who want to deploy their machine learning models. Anyway, that's it for today I hope you learned something new and if you have any questions or if you think I forgot about something feel free to write a comment and I'll be more than happy to read it and reply to it! And I will see you next time!!

Info

Channel: mildlyoverfitted

Views: 15,255

Rating: undefined out of 5

Keywords: docker, fastapi, machine learning, artificial intelligence, kubernetes, hugging face, masked language modelling, BERT, REST API, container, containarization, deployment, service, horizontal scaling, minikube, kubectl, natural language processing, MLOps, DevOps, python, production, serving models

Id: DQRNt8Diyw4

Channel Id: undefined

Length: 26min 31sec (1591 seconds)

Published: Sun Mar 19 2023