Choosing the right deployment option for your
model can have a significant impact on the future of your AI application. If you've ever worked with open-source models,
you know that training a model is just the beginning. After investing time and effort into perfecting
your creation, choosing the right deployment option can make or break its success. It's a decision that needs careful consideration,
because it can have a significant impact on cost, latency, scalability, and more down
the line. So, in this video, I will demystify the most
popular deployment options, specifically focusing on serverless deployment and guide you through
a complete exercise for deploying open-source models from Hugging Face, so you unlock the
full potential of your AI models. Let's dive in! # **Understanding the Tradeoffs: Different
Deployment Options** First, we need to start with a brief overview
of the most popular deployment options, which are cloud based, on premise edge and a new
serverless alternative. Cloud-based deployment means hosting and running
your AI models on a virtual network of servers maintained by third-party companies such as
AWS, Google Cloud, or Microsoft Azure. It offers scalability and low latency, allowing
you to quickly scale up or down based on demand while providing very fast responses. However, once an instance is up and running,
you're paying for it whether it's in use or not. This means that hosting a model in the cloud
can cost at least few hundred dollars per month. Models that are larger and require multiple
GPUs can be significantly more expensive, even if you only submit one request per day. This is why I recommend this option only for
mid-sized projects that maintain consistent model usage throughout the [day.IT](http://day.IT)
OF On-premise, on the other hand, deployment
involves hosting and running your AI models on your own physical servers. This gives you total control over your infrastructure,
which can be particularly appealing for businesses with significant resources or strict data
privacy and security requirements. Despite that it requiresd substantial upfront
investment for hardware, without recurring subscription fees, it could be more cost-effective
than cloud based deployment in the long term. Nevertheless, due to the complexity of managing
on-premise infrastructure, this option is only recommended for enterprises or large-scale
projects with significant investments. As for the edge deployment, it means deploying
models directly on edge devices like smartphones and IoT. This method allows for real-time or low-latency
predictions and enhances user privacy as data is processed on the device instead of being
sent to a central server. However, it may not be appropriate for complex
models that require significant computational power. This is why a new solution has recently emerged
that aim to address all these challenges at once: on-demand serverless model deployment. # Serverless Deployment: An Efficient Solution Basically, instead of maintaining and paying
for idle servers, serverless deployment allows you to focus more on core product development
while enjoying the benefit of reduced operational costs and complexity. At the core of this approach is the power
of containerization coupled with an intuitive interface. You deploy your model inside a container and
the clock only ticks when your model is in action. This means that if your model is idle, you
are not being charged with anything. You only pay for the time your model is actually
running, down to the GPU seconds, which makes this option is perfect for for applications
in the early stages or those with a smaller user base. However, one downside of serverless systems
is the "cold start" issue, which occurs when a serverless function is put to sleep or "made
cold" by the provider to save resources if it hasn't been invoked for some time. When a request comes in after this period
of inactivity, the function has to be "warmed up", causing a slight delay in response time. As for the providers themselves, there are
several options like Replicate, Bento ML, Beam, and AWS, which also recently introduced
this feature. However, in this tutorial, we will focus on
the fastest and simplest option for serverless model deployment: inference endpoints provided
by Hugging Face. # **A Practical Walkthrough: Deploying a Model
from Hugging Face** In this exercise, we will deploy a fine tuned
Falcon-7B-Instruct model with QLoRA adapters from one of my previous tutorials, but you
can use the same process for almost any open source model on hugging face. The first step is to save the pretrained model
itself. I added additional code at the end of the
notebook for that previous video, which essentially merges LORA adapters with the original model
weights and pushes them to a new repository on hugging face. ~~After uploading your model, copy the special
files required to run your model like configuration_RW.py, modelling_RW.py, config.json, and [handler.py](http://handler.py/).~~ ~~If you want to run your model in 8-bit,
ensure that you set load_in_8_bit to true in the [handler.py](http://handler.py/) file. Additionally, add trust_remote_code for both
the model and the tokenizer. You will find the full code for this file
at the end of fine tuning notebook from previous video.~~ ~~Next, create a requirements.txt file and
add any packages that you used during fine tuning.~~ ~~Keep in mind, that these steps are only
necessary if you fine-tuned your model with a technique like Qlora beforehand. If you want to run a default model from Hugging
Face, you can skip these steps, and simply deploy from your original model’s repository.~~ After completing the above steps, you are
almost halfway there. To finish, click on "Inference Endpoints"
under "Deploy" and select your desired deployment options. To make your endpoint serverless, change the
automatic scaling from "never" to "0" after 15 minutes. However, keep in mind that this approach will
also introduce the cold start problem we discussed earlier. ~~Now, if you added requirements.txt or any
other additional files, select a default container type, otherwise keep it on text generation
inference, or on your model’s default task.~~ Now, simply click create endpoint and after
a few minutes your model should be live. Test it a few times using the web interface,
and if everything is working as expected, you are ready for production. To call this endpoint from your application,
you can use the Hugging Face inference Python client. To begin, install Hugging Face Hub and import
the inference client. Specify your endpoint URL and obtain your
API token from settings. Then, specify the generation parameters and
call the text generation method. Make sure to adjust the last method according
to your specific model type. If you want to stream responses, set the streaming
parameter to True. # Conclusion Overall, the future trends in AI model deployment
point towards the most flexible approaches. As there has been a shift in backend development
towards serverless and microservices in recent years, I believe that this shift will also
extend to model deployment. The only current limiting factor is the "cold
start" problem, which is challenging due to the large sizes of model weights. However, I am sure that we will soon see substantial
improvements in this area as well. Recent rumors even suggest that OpenAI is
considering opening an app store for AI models, which might be quite similar to inference
endpoints on hugging face. So, let me know your thoughts on the future
of model deployment. Have you already tried deploying your own
models to production? If so, which options have you selected? And as always, if you want to learn more about
leveraging the power of AI, don't forget to subscribe.