Hey there! Today, we will talk about
how one can deploy a machine learning model on Kubernetes and since this is
a very broad topic I will just show you a minimal example and I will refer
you to some external documentation and packages if you want to get some extra
features anyway I hope you enjoy the video! Here you can see a sketch of the three major
steps that one needs to perform. First of all, we need to turn our model into an API
server that is able to receive requests run, inference and finally return a response with
the result. The reason why we do this is to standardize the way we communicate with our model
and to have a layer of abstraction that hides the complexity of the model inference. The second
step is to take our API and containerize it. The reason why we do this is to make sure that
we can run it on any device and that all the dependencies are available. Finally, we take our
image and we deploy it as a pod on Kubernetes. I guess now it is the correct time to talk about
some cool features of Kubernetes. First of all, we can deploy a pod on its own, however, it
is better to create it with a parent object called a deployment. If you do so the deployment
will make it possible to track version history, it will make sure that the pod is always running
and it also will allow us to create replicas of the same pod and many other things. Second
of all, we can create another parent object called a service that is going to load balance
requests between all of our replicas. In short, the deployment and the service together will
allow us to scale horizontally. Kubernetes has many other cool features but these are the two
main ones that we will be focusing on. Here you can see a table of some popular frameworks that
can help you with the three-step process and they also add a lot of features and simplifications.
Just a small disclaimer I'm not familiar with all of them so feel free to write in the comments
if you find some mistakes or if you think I left out some important framework. Also note that
I did not include any cloud service provider services and I also did not include any HTTP
web Frameworks like FastAPI and Starlettte. Okay so let's start with the hands-on end-to-end
example. To make things fast I decided to go for a minimal solution which is an API creation tool
inside of the Transformer CLI. It was the last row of the table I just showed you, however, before
I do this let me just set up my environment. Pyenv is a tool that enables
you to change Python versions. Here I create a virtual environment. And yeah we are ready to go. So let's
first start with installing Transformers. The extras_require equal to serving
because it will install things like FastAPI and uvicorn. Let's inspect
this Transformer CLI a little bit. I actually forgot to install one of the
deep learning Frameworks so let's get torch. This is the documentation of the CLI and
let me just quickly show you that there are a bunch of commands that one can
use like convert, run and for us not surprisingly the relevant one is going to be
serve. So let's see the documentation of it. To be able to run this we need to
provide a task which you can see here and in our case I just decided to go for fill-mask
and then we can specify things like the host, port, workers and most importantly
the model. Let's try to run it. As you can see we have a server up and running
and now the big question is how do you actually use this server? What are the endpoints? To
my knowledge there is not really that much documentation on this so let's just go straight
to the source code to try to figure this out. Here I'm on the GitHub of Transformers and let
me go to the src and it should be under commands and in this module called serving. First of
all, one interesting thing is that we can look at the imports and we see that we're using
FastAPI, Starlette and Uvicorn - these are the underlying web frameworks that we are using
under the hood. And if we scroll down again, I'm just looking for the endpoints ... and
as you can see here for each of the routes we basically implement some kind of a logic.
I would imagine that this root route just represents some information about the model that
we are serving. Tokenization and detokenization there are self-explanatory and finally forward
is the one we are going to be using and it's literally both the tokenization and the forward
pass. Anyway, let's try out this root endpoint. We just literally sent a get request to it and
let's see what we get back. You can see that the server received the request and it returned
response with a status 200. The response itself is a JSON and it's not really readable so
let me pipe it into this tool jq and let me make the pane bigger. And as you can see
there are a bunch of let's say details about the architecture and about the model
that we're using. This is as we expected. Cool, let's try to use the forward
endpoint now and see what we get. First of all, it needs to be a post request
as we saw in the source code. We paste the URL and we specify headers both for the type of
response body and the type of the request body. Not surprisingly both of them are JSONs.
Finally, we need to send over our request body and yeah the way the fill-mask model
works we just have this special token called MASK and we want the model to suggest the best
possible words that could fill in the blank. Let's try to send it over and yeah
of course what I forgot was to specify the right endpoint which
in our case will be forward. And I think I misspecified it again. Let me
fix this. It's not here. Sorry about this. We managed to get a response the server liked
our request and let's again use the jq tool to see what's inside and we get some answers
I mean probably this was not a great example so did not really fill in any name but I guess
that's just the way the model works. So here I thought that the most likely option was to just
end the sentence. Let's let's fix the input. Instead of this let's just do "Today
is going to be a MASK day". Let's see. This one is way better. As you can see it
thinks that the most likely option is "long". The second most likely option is "great"
and so on and so on. That's all we needed, that's all we wanted. We have an API server
that receives requests and does tokenization, inference and then sends over a response. So
this first step as described in the diagram is done. However, let me just point out
that we made a lot of simplifications. There are actually a lot of cool features
that one can add to this. For example, things like adaptive batching. Now we are only
sending one request at a time and you can imagine if the server was used by a lot of people it might
be beneficial to actually batch the requests and only run the forward pass once or twice especially
if you have a GPU. That's one of the features. Also what's very common for these APIs is to have
a "metrics" endpoint that will give us status on how the server is doing, how many requests it
received, what were the timings and so on and so on. Again, this is a minimal example. Frameworks
like BentoML they give you way more than this. Now the idea would be to containerize our
API. I'm going to be using Docker for this, however, there are other technologies
like Podman. Anyway, when trying to build a custom image one always needs to
specify the base image and since our API is basically using the Transformer CLI we
need to look for an official Transformers image on DockerHub. And this one seems to be a
great candidate. As you can see it was updated very recently. I believe that there's always a
new release or a new Docker image being pushed to DockerHub whenever there's a new release on
GitHub and yeah it supports both CPU and GPU. Here are some of the tags. We're going
to get the latest one. We pull and it was pretty fast. I already did this when I was not
recording. Let's double check that we have it. As you can see it has 17.6 GBs which
is quite a lot but it is what it is. Let's now use it as a base image to
be able to create our custom image. We start from this one. What we do here is to instantiate
our model when we're building this image and it's a hack that will allow
us to store the model weights inside of the docker image because this
automatically triggers the download. Let's do the same thing for the tokenizer.
Not sure if it's necessary but whatever. And what you also need to do
is to get dependencies for the serving logic. So in our case
it's just FastAPI and Uvicorn. Let's now expose a port. And finally let us write a custom entry point
that is literally going to launch the server. We should be done! Let's try to build this image. And let's call it "cool-api". It basically was instantaneous. The
reason for this is that I already did this before. If you were doing this
for the first time this will take way longer. Let us just verify that we really have it. Do it like this. What's interesting here is
that before the base image was 17.6 GBs I believe and now we are on 18.1 GBs. And
it's again because of the fact that here we downloaded the model weights inside
of the image. Let's try to run this. We do port forwarding. We right away get a warning about the platform
and this is something specific to my computer. I'm on a Mac M1 and unfortunately the base
image we used did not support my platform and as you can see the server is running.
Let's figure out what port was exposed. And as you can see here it was 55008 and
we can do the same thing as before but instead of sending the request to the
port 8888 we need to change the port. We managed to get an answer, unfortunately,
because of this platform mismatch this is way slower than having the raw API. But from the
functionality point of view it is the same. Let me just quickly actually write a separate image where
I basically build everything for my platform. And let me just copy paste this from my notes.
Here I'm using conda but the the logic is the same. Let me call the previous server and let me
docker build. By default docker build is going to use a file called Dockerfile but we can also
manually change it. So we can say file, if I'm not mistaken and we can say DockerfileConda. Let's not
override the old one. Let's call this version 2 and again I did this before so that's why
it was so fast. And finally let us run it and hopefully it's going
to be a little bit faster. I'm not getting the platform warning. Let's again check the port. Now it's
a little bit different. Let's do curl. Now it's pretty fast
- you can even time it. It works literally the same way as the
raw API but now we managed to containerize our application and if you're wondering why it
was necessary, why do we need to wrap it? Well, Docker containers are amazing at capturing
all the dependencies and they are very portable so you know I can upload the
Docker image on some repository or in some Cloud platform and chances are I can
run them right away without any issues. Actually the second image is way smaller. Now let's focus on the third step which is
deploying our Docker image on a Kubernetes cluster. For the purposes of this video I'm
going to create a single node cluster on my laptop using minikube. However, in real life
one would have a Kubernetes cluster on premises or one would use a Cloud solution. Two of
the most famous ones are EKS (from AWS) and GKE from Google. Anyway first of all let
me start the minikube Kubernetes cluster. It is done and let me actually show
you what this cluster contains. For this I'm going to be using the
command line interface kubectl. There are a lot of controllers and other
Kubernetes specific things. The idea is that we would like to deploy our
Docker image on this cluster in a form of a pod but before we do so I
just need to load the image from my Docker daemon into the minikube Docker
daemon. Let me show you what I mean. Our "cool-api:v2" is not here in
the list so we need to load it. This will take some time. It seems
like we are done! Let's verify this. As you can see we have the "cool-api" image here. Now the idea is to create a deployment which
is basically a parent object for pods. In our case each of the pods will contain a
single image which will be our "cool-api: and we can actually create the deployment
in a very simple way using one command. So I will be calling our deployment "cool-deploy". We will be using our "cool-api" image and
that's more or less everything we need. And let us now inspect what the side
effect of running this command was. We have one deployment object and we also have one pod which was automatically
created by this deployment. And what's important is that here there is always random hash.
Deployments are used in a case where the pods don't have any state so there is no
notion of order. And if this given pod dies the deployment will make sure there's a new
one with a completely new hash. Now we need to create a service which will basically take
all the pots we have in our deployment which is for now just one single pod and it will
load balance the traffic among these pods. We basically say: "Hey! Look at all the pods
that are managed by the deployment 'cool-deploy'. We name our service "cool-service" and
now we just need to specify two ports. This is the port inside of the image that we exposed. You remember this
port from the Dockerfile. And the second port is just any port that we
want that will eventually be the one that is exposed on the outside. Everything was created.
So let's do a small recap of what we have. We get all the objects. Internally, there's
also a replica set but don't worry about it. Under the deployment and the replica set there
is a single pod and also we have a service that load balances request to all of the pots. Let's
try the service out and to do this there are multiple ways. The simplest way with minikube
is to just simply do a bit of port forwarding. As you can see it automatically opened my browser,
sent the get request to the root if you recognize this info JSON. But if we go back to the terminal
we can basically use this URL and this given port to send requests to our service which will
eventually end up on the pod. Let's give it a try. Let me copy paste the curl command. And here let me change the port. That should be it and yeah let's
see whether things are working. We received a response from our Kubernetes
server and up until now we don't necessarily see any benefits compared to Let's
say the pure Docker solution but let me actually demonstrate two very cool features that you
get for free if you use Kubernetes. And the first one is simply the fact that even if you
kill your pod or something happens internally (it most likely won't be you). Let's say
there's an issue and then it crashes. The deployment will always make sure that there is
a certain number of pods running depending on how many replicas you chose at the beginning
- in our case right now it is one replica. We haveone deployment. We saw this before
and let me artificially kill our pod. Let's kill it and let's do kubectl get pods again
and as you can see right away five seconds ago the deployment made sure that there's a new image.
Just note that the hash is different. And related to this (that's not something we did) we can
have liveness probe so even if the container is running maybe something is broken on the inside
and Kubernetes can periodically query the pod to know how it's doing and if there's something wrong
again it can restart it. So that was one of the features and the second feature is using a load
balancer. So first of all, before we increase the number of replicas let me just get the standard
output of the one single pod that we have. This is the standard output without
color from our single pod and if we send the request to it. You can see that
we are pinging this single pod. Let us now add two more replicas and again
we can do this in a single command. We just use the scale command and request
three replicas. Kubernetes is not going to kill our current single replica
and it's going to add two new ones. There are two new replicas and since we created
a service before it will automatically load balance the request between these three
replicas. Let's verify that this is the case. The way one can do this in Kubernetes
is to look at the endpoints of each service. Our "cool-service" actually distributes the load among three
different pods. Let's actually try this out. I'm going to create a separate pane for each
of these pods. I already have this first one, let me create a new one so that I
can inspect their standard output. And finally this one. And let's rearrange this in a nice way. This should be good enough. Here you can see
the three pods and here this is the original one where we already sent some requests
but let's now actually send a new request.0 As you can see this specific request
ended up here on the second pod. Let's add a new one. This one
ended up on the third one. This one again ended up on the
second one. Let's do multiple. As you can see the load is being distributed
between the three pods and since I am on a single node laptop this load balancing is not really that
powerful but in real life one would actually have each of these pods on a separate node and one
can really scale horizontally. Anyway, I think that's all I wanted to show you. Again, Kubernetes
has many other amazing features that are really relevant but I thought that these two that I
showed you are pretty useful for people who want to deploy their machine learning models. Anyway,
that's it for today I hope you learned something new and if you have any questions or if you think
I forgot about something feel free to write a comment and I'll be more than happy to read it
and reply to it! And I will see you next time!!