The Best Way to Deploy AI Models (Inference Endpoints)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
Choosing the right deployment option for your model can have a significant impact on the future of your AI application. If you've ever worked with open-source models, you know that training a model is just the beginning. After investing time and effort into perfecting your creation, choosing the right deployment option can make or break its success. It's a decision that needs careful consideration, because it can have a significant impact on cost, latency, scalability, and more down the line. So, in this video, I will demystify the most popular deployment options, specifically focusing on serverless deployment and guide you through a complete exercise for deploying open-source models from Hugging Face, so you unlock the full potential of your AI models. Let's dive in! # **Understanding the Tradeoffs: Different Deployment Options** First, we need to start with a brief overview of the most popular deployment options, which are cloud based, on premise edge and a new serverless alternative. Cloud-based deployment means hosting and running your AI models on a virtual network of servers maintained by third-party companies such as AWS, Google Cloud, or Microsoft Azure. It offers scalability and low latency, allowing you to quickly scale up or down based on demand while providing very fast responses. However, once an instance is up and running, you're paying for it whether it's in use or not. This means that hosting a model in the cloud can cost at least few hundred dollars per month. Models that are larger and require multiple GPUs can be significantly more expensive, even if you only submit one request per day. This is why I recommend this option only for mid-sized projects that maintain consistent model usage throughout the [day.IT](http://day.IT) OF On-premise, on the other hand, deployment involves hosting and running your AI models on your own physical servers. This gives you total control over your infrastructure, which can be particularly appealing for businesses with significant resources or strict data privacy and security requirements. Despite that it requiresd substantial upfront investment for hardware, without recurring subscription fees, it could be more cost-effective than cloud based deployment in the long term. Nevertheless, due to the complexity of managing on-premise infrastructure, this option is only recommended for enterprises or large-scale projects with significant investments. As for the edge deployment, it means deploying models directly on edge devices like smartphones and IoT. This method allows for real-time or low-latency predictions and enhances user privacy as data is processed on the device instead of being sent to a central server. However, it may not be appropriate for complex models that require significant computational power. This is why a new solution has recently emerged that aim to address all these challenges at once: on-demand serverless model deployment. # Serverless Deployment: An Efficient Solution Basically, instead of maintaining and paying for idle servers, serverless deployment allows you to focus more on core product development while enjoying the benefit of reduced operational costs and complexity. At the core of this approach is the power of containerization coupled with an intuitive interface. You deploy your model inside a container and the clock only ticks when your model is in action. This means that if your model is idle, you are not being charged with anything. You only pay for the time your model is actually running, down to the GPU seconds, which makes this option is perfect for for applications in the early stages or those with a smaller user base. However, one downside of serverless systems is the "cold start" issue, which occurs when a serverless function is put to sleep or "made cold" by the provider to save resources if it hasn't been invoked for some time. When a request comes in after this period of inactivity, the function has to be "warmed up", causing a slight delay in response time. As for the providers themselves, there are several options like Replicate, Bento ML, Beam, and AWS, which also recently introduced this feature. However, in this tutorial, we will focus on the fastest and simplest option for serverless model deployment: inference endpoints provided by Hugging Face. # **A Practical Walkthrough: Deploying a Model from Hugging Face** In this exercise, we will deploy a fine tuned Falcon-7B-Instruct model with QLoRA adapters from one of my previous tutorials, but you can use the same process for almost any open source model on hugging face. The first step is to save the pretrained model itself. I added additional code at the end of the notebook for that previous video, which essentially merges LORA adapters with the original model weights and pushes them to a new repository on hugging face. ~~After uploading your model, copy the special files required to run your model like configuration_RW.py, modelling_RW.py, config.json, and [handler.py](http://handler.py/).~~ ~~If you want to run your model in 8-bit, ensure that you set load_in_8_bit to true in the [handler.py](http://handler.py/) file. Additionally, add trust_remote_code for both the model and the tokenizer. You will find the full code for this file at the end of fine tuning notebook from previous video.~~ ~~Next, create a requirements.txt file and add any packages that you used during fine tuning.~~ ~~Keep in mind, that these steps are only necessary if you fine-tuned your model with a technique like Qlora beforehand. If you want to run a default model from Hugging Face, you can skip these steps, and simply deploy from your original model’s repository.~~ After completing the above steps, you are almost halfway there. To finish, click on "Inference Endpoints" under "Deploy" and select your desired deployment options. To make your endpoint serverless, change the automatic scaling from "never" to "0" after 15 minutes. However, keep in mind that this approach will also introduce the cold start problem we discussed earlier. ~~Now, if you added requirements.txt or any other additional files, select a default container type, otherwise keep it on text generation inference, or on your model’s default task.~~ Now, simply click create endpoint and after a few minutes your model should be live. Test it a few times using the web interface, and if everything is working as expected, you are ready for production. To call this endpoint from your application, you can use the Hugging Face inference Python client. To begin, install Hugging Face Hub and import the inference client. Specify your endpoint URL and obtain your API token from settings. Then, specify the generation parameters and call the text generation method. Make sure to adjust the last method according to your specific model type. If you want to stream responses, set the streaming parameter to True. # Conclusion Overall, the future trends in AI model deployment point towards the most flexible approaches. As there has been a shift in backend development towards serverless and microservices in recent years, I believe that this shift will also extend to model deployment. The only current limiting factor is the "cold start" problem, which is challenging due to the large sizes of model weights. However, I am sure that we will soon see substantial improvements in this area as well. Recent rumors even suggest that OpenAI is considering opening an app store for AI models, which might be quite similar to inference endpoints on hugging face. So, let me know your thoughts on the future of model deployment. Have you already tried deploying your own models to production? If so, which options have you selected? And as always, if you want to learn more about leveraging the power of AI, don't forget to subscribe.
Info
Channel: VRSEN
Views: 6,444
Rating: undefined out of 5
Keywords: AI, artificial intelligence, model deployment, serverless deployment, open-source models, Hugging Face, cloud-based deployment, on-premise deployment, edge deployment, scalability, cost-efficiency, latency, AI application, AI tutorial, AI video guide, AI trends, AI models, machine learning, ML deployment, GPU utilization, containerization, cold start problem, AWS, Google Cloud, Microsoft Azure
Id: VdKdQYduGQc
Channel Id: undefined
Length: 5min 47sec (347 seconds)
Published: Fri Jul 14 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.