How to Host an LLM as an API (and make millions!) #fastapi #llm #ai #colab #python #programming

Video Statistics and Information

Video

Captions Word Cloud

Captions

hey guys in today's video we're building something very important uh you know that llms are disrupting every single business use case and all businesses are becoming more effective and efficient with the help of llms right uh now there are llms like open AI which are closed Source you don't have the code for them but then there are open source LMS that you can self deploy and use them on your own after fine tuning them so uh in today's video right we're going to be discussing a very important part which is how do you use an llm and serve it as an API so that you or your users can start using the API in the sense they're effectively using uh the llm itself now today's video is going to be part of this AI projects playlist where I'm going to be adding a lot of AI and llm projects and we have done this video which is how to train train an AI model we used Lama 2 in that video we used Google collab and Lama 2 and I showed you how you can find tune and AI model model and in today's video we're doing uh we're basically taking it a step forward we're going to be hosting the LM as an API and I'll show you how to do this on Google collab so let's go step by step and let's start building before we get started just want to tell you that there is a 53 killer goang projects playlist on my channel so if you want to learn goang by building actual projects these projects are curated in the increasing level of difficulty so make sure you go through them you will become an expert level goang developer similarly I have a 50 rust projects playlist as well which I keep adding videos too so it's not 50 yet these are 34 videos only but this will be 50 minutes completed again videos in an ascending order of difficulty you learn goang and Trust become an expert in goang and Trust okay now we're using Lama 27b so I'll be sharing this Google collab file with you which will have all the documentation along with the code so you'll be able to run the code yourself I've shown you how to use Google code lab in a previous video right so uh what we're doing here the first step is we are installing llama for python so we're using Lama 27b uh for for python we using pip pip install and we've done that we've also added some cmake arguments when we run this cmake uh script we are also going to be uh adding some arguments from here and you see cuas on here and that basically means that we want to utilize cuas for GPU accelerated Le algebra operations which we'll be using a little bit later on so just make sure you pass these CME arguments in the next line we install using pip all of these libraries so let's go uh through them one by one so we have the fast API framework for uh building HTTP servers right and then we have uvon which is a lightning fast agsi server then we have python multiart for multiart and form requ requests uh in Python then we have Transformers so we're getting the hugging face Transformers Library here I've shown you this in a previous video as well I think using hugging phase then the uh pythonic Library uh if you're a python developer you know this is for data validations and settings management then you have uh tensor flow tensor flow uh is very famous another it's a technology by Google it's an open source machine learning framework uh by Google and the reason we're using um fast API is because of this because we're going to be using our llm uh which is llama 2 in our case so I'll just write here llama 27b is going to be our LM we're going to host it we're going to first we'll have to install it and host it and we'll have to create a fullon API service server using uh fast API so you'll say you'll create an API server which interfaces with our llm and it's going to then uh we also going to have something called as enro so if you check out what enro is enables you to create uh apis so it's basically uh gives you reverse proxy firewall API Gateway Global uh Global load balancing Etc and you can convert uh your Local Host into a uh API server basically you can you'll be able to use it so because we using uh because we're using uh Google collab so we want to be able to turn that into an API server so we'll be using enro here for Ingress management basically so we'll be importing en gr also into our project and we'll be importing fast API creating the API server and we'll be importing the LM and uh also interfacing with the API so let's uh now do that so the next step is going to be all about getting enrock so I have updated uh I've given you the updated scripts for getting enrock now uh if your code doesn't work that just means that enro is old in the sense you might be using like 2.3.4 or something like that just remember remember you have to be using 3.0.0 and above for enro to be able to use this project so just make in case you're not sure what your uh Eng gr is just go here in your terminal uh in your Google collab uh not sure why it's not opening up and just write Eng gr just to check the version so it's saying right now uh Andro command not found if I run this if I run this command again because I've changed my server in the background so it's not recognizing enrock or NG Rock however you want to pronounce it and here what you're going to do is now if I run it it's going to say um it's going to return all of that so it it remembers uh it now knows what NG Ro is and if I say NG grock uh minus minus V which should show me the version so it doesn't recognize the uh the minus V flag if I say version so it gives me 3.6.0 okay so I'll get the version for so that's the one that I'm using for enro make sure you're also using a uh new version so I've added the latest lines of code that you need to get engro uh working perfectly fine and then what we want to do is we want to um add our Au token now uh please know that I will be deleting this a token this is the a token that I have in my account but I will be changing it or deleting it uh make sure you get your own the the way to get your own Au token is you go to nro and you log in with Google or anything else that you want and you go to your au token and you can copy and paste your o token now this I will obviously be resetting after this video so no there's no need to try to um use my my token obviously right uh so make sure you use your own token here and this will help you set the enrock aut toen because we'll need that obviously to create our server the next next step is creating the fast API app okay so let me take you line by line through this code to just show you how um the API uh is created and we creating this file called app.py and if you see here you could be using you could you could not use Google collab and still run this project you'll just have to have multiple files but just Google collab makes it very easy you could put all the files here step by step and keep creating uh the things that you want so the first line is a Jupiter magic command that writes the following code to a file named app.py then you have these import statements and we're starting with fast API because that's what we need to create web apis HTTP exception for handling HTP exceptions then we have the base model for from pantic that we imported earlier for creating our data models we have HF hub download from hugging face to be able to download the models and then we're getting uh from Lama CPP that we had earlier that we imported earlier we'll be importing Lama the model and then we'll be using tensorflow stf because uh we'll be checking for GPU availability very quickly uh one thing to remember here is that you need to be using T4 here so on the right in case you don't know you just have to go to change runtime type switch to T4 that's the one that you'll require for this program and um we will then uh set up these parameters like the for example the model repo uh and the AI model name file name so the file name is Lama 27b and the model repo is uh Lama 27b ggf this is the one that we need and we'll pass this to model path and then we'll get our Lama 2 model with the model path and GPU layer 64 CTX 2000 this is where now we we have access to our AI model and we'll print it out so we'll print out the uh Lama 2 model and we'll prompt hello and we'll use max tokens one so we'll end up only using one one token for this which is the hello token and then here on this line you're creating the app so you're using fast API and creating your fast API app which will be available to you in the app uh variable so it's almost like if you use goang goang Fiber if you use goang Fiber you do this if you use uh nodejs Express you do the exact same thing create an app variable get access to the fast API or your API Library here then you have this line which is class uh text input and the the reason you have this is because it defines the data format expected for the Post end point it expects a Json object with keys uh input which is string and parameters which is which are optional dictionary and um then you have your end points you have the root endpoint and you have the generate endpoint the root endpoint you're just going to have an API for checking the status of the GPU so we're going to use use tensor flow that we have here uh stf so tensor flow and test if the GPU is available or unavailable and then we'll return the status that I'm alive GPU uh from the GPU message and the other route that we have is SL generate that we'll be using actually when we test when we start the API server and then we test it so we'll be using the generate API route so as this is like almost like creating your own server right this is creating your own server basically so you have two routes the root route and the generate route the generate route has is taking in uh data which is of type text input and returning uh a dictionary as the output and here uh the data that we have we going to print out its type and we're going to print out the data itself the data will also have some parameters we're going to capture that in this variable called params and this function is called generate text function so it will uh get the response from the Lama 2 model now the whole purpose of our fast API server is to be able to interface with the Lama 2 model and that's what's happening here so we going to give it get some prompt The Prompt is available to us in data do inputs that's where we get the prompt from with the params that we have here getting sending the params as well and we'll get some response back so the response we uh if you've if you worked with any uh LMS you'll know that the response usually in uh you you have multiple choices in the response so you'll get the First Choice and then the text from that response is what we'll uh take out and then that'll be in our model output and then that's the model output then we'll be returning from this function so generate text function Returns the generated text that's what we'll be returning from uh this function then we have some exception handling also for the same route which is/ generate and now we'll start the next stage is starting our fast API server so let's go through that quickly so the next section which is starting the fast API server is a bit important so I've zoomed in even further and I've removed my face so that it doesn't end up covering any of the lines like this which will take up the entire screen and now let's go through theode code uh step by step so the first line is importing subprocess which is for running our shell commands time for working with time related operations you're getting HTML from I IPI widgets which is for displaying HTML content and display from I python. display for displaying widgets in the notebook very soon we'll have a widget we that we'll display here I'll I'll tell you about it uh very uh soon basically so um this is the HTML widget that we're creating okay and here what we're saying is we're initializing the widget T is equal to HTML with the initial value of 0 seconds and description indicating that the server is starting up with this elapse time and then we display that widget in the notebook we initialize a flag which is true and then we'll make it false and I'll tell you why we make it false so first we start with it being true the timer being set to zero then we have our try and exception block here in the try block we make a cur request to Local Host 8,000 if it's successful it sets the flag to false to indicate that the server is already running if it's unsuccessful then it starts the server using UV cord that we had imported earlier and redirects the server output to a log file so let me go here the log file is is server. log so we uh redirecting the server output here then we enter into so we we start with uh empty response and we enter into a v Loop so it uh this says that our timer starts and then until it reaches 600 seconds which is 10 minutes we are in this Loop inside the loop we attempt to make a cur request to Local Host 8,000 order to check if the server if our server is running or fast APS is running if the and uh then we if if if the request fails then we uh wait for 1 second and we increment uh so sleep is basically waiting for for one second and we increment the timer and then we update the value of the HTML widget to display the time elapsed and continues uh and we continue to Loop and if the request succeeds we uh set the the flag to false indicating that the server is running and exit we exit the loop now if we have uh timed out basically we have reached more than 600 seconds without the server starting successfully we will have to print out a message saying that uh it took more than 10 minutes so we basically timing out okay and finally make a cur request to local uh Local Host 8,000 to ensure if the server is running even after this so that was uh now the what you get from here is that server is starting up elapse time 50 seconds that that was our uh HTML widget and then we get this output I'm alive GPU available so where do you get that from you get it from the previous one because that's what the server is supposed to return right if you remember uh we returned on status I alive GPU from the GPU message okay and this is the GPU message available that's what we created in the fast API now finally we'll be using engro to create a public URL for the fast API Ser now we've created our fast API app we uh which was here with these two um routes the root route and the the generate route and we created our fast API server which is serving those routes now finally it is time to create a public URL and I told you that if you have a local host is something learning on Local Host you want to create a public URL so that anybody can access it you use enro it's basically for Ingress and now we will uh we will go through this code which is the part of using enro to create a public URL API so in this section we start by importing subprocess times CIS and Json that we'll be using in a minute so you'll understand why we're using this then we get IPython um because we use this get IPython to run a terminal command this terminal command is for um enro so what we say to it is that hey start an enrock tunnel uh that forwards data or traffic from Port 8,000 to a randomly assigned enro URL so we'll be assigning uh so basically enro what it does is it it converts whatever you're running on your Local Host to a publicly servable API it does that by creating a randomly generated URL for you guys so that's what we're saying that hey uh we we running the server the fast HTP server on our Port 8,000 but we'd like to redirect it to whatever uh enro gives us right that's why we've run this command the Amper sand here might throw you off because you might not have seen this earlier uh but this is uh only to tell enro that the this uh that this command needs to be allowed to run in the background so this is for the get IPython then we wait for 1 second so time do sleep is for waiting for 1 second to ensure and gr has enough time to start and allocate the tunnel that we want the tunnel is available to us Here Local Host 4040 API tunnels that's where we'll get the curl out or the the URL of the um the endro using which we'll be able to access this API uh publicly and uh right now here we make a curl request to Local Host uh 4040 API tunnels to retrieve information about the enro tunnel this endpoint provides details about active tunnels that are managed by grock and uh then we par the NG gr URL so we use the output here and curl out and we use json. Lo loads so that's why we had the Json package here and time we used because we have we've already used time do sleep and subprocess because we used subprocess here to check the output of the of the tunnel then we're going to get the enro URL it's going to be a URL like this here that you see in the output and uh it'll be available to to us in tunnels uh the first tunnel the public URL that's what we get in the enro URL and then we store it so this with with the the percentage you have the P the Jupiter notebook commands so you store engro URL and then you print it out and this is the stored URL this is what we've printed out this is the UR that that enro has created for it for us and using this we'll be able to access the API so now uh the next job is about testing the API so we are going to import requests and using that we'll be able to uh make it hit the API which is enro URL okay so here what we do is we Define the data to send in the post request so data is going to have inputs that tell me how much how to make a chocolate cake and the parameters temper temperature is 0.1 so it'll be between 0 and and one basically 0.1 basically means it's close to zero that means there will be less ambiguity in the message uh in the in the response and we want to use only 200 tokens which is very less it might get clipped the output might get clipped but that's okay we don't want to end up using 2 many tokens because that puts a lot of burden on the server okay so uh now we use our requests package that we have here to make a request a post request especially to the enro URL we have the enro URL from the previous uh uh code block and we hit the generate API so if you remember if you go back all the way up you remember that we have a/ generate API that's actually going to help us hit or or get the response from the Lama 2 model that we have imported from earlier along with the parameters and then we are going to uh come back here uh and then we have our Json data uh for and we'll make a request to the uh to the URL and we'll get back a response if the response score is 200 then we'll uh get the Json into result and we'll print out the the result we'll say gener text is this but if there was an error if it it was not 200 then we'll say request failed with status code and whatever the status code we got back from the response we will print it out here so this is the Gen text that we got back so when we run this command you'll get tell me how to make a chocolate cake that's the request and this is the response I'm not sure if I can but I will try my best to blah blah blah blah blah blah blah and then it just got stuck so luckily we only had like 200 tokens so after that anyways we didn't want more output from it and then we'll just shut down we can uh we can shut down uvon and enro so everything the server will shut down and the enrock URL server also will shut down so we we won't have our API available to us anymore and then we won't be will test our API so the link of this um uh Pi notebook will be available to you guys uh in the description of the uh video make sure you make a copy of it make sure you get your own uh enro o token put it here and then test out this awesome project now the thing is you can uh quickly build a project over the weekend and deploy it onto the internet as well so this is the whole llm and AI opportunity of uh 2024 which people are talking about on Twitter as well as on YouTube and Linkedin so people are taking up these open- Source pre-trained F uh LMS fine-tuning them and making them available through an API to the rest of the world saying that they now have an LM powered application it's as simple as that you don't have to rely on open AI or or anthropic or any of these paid tools you can build your you can like like I showed you you build your own tool build your own API over the weekend and have a production level uh I I won't call this production level of course but I would say that this is like the small PC that you can make it available to the rest of the world with a with a URL for a production level app if you want to build production level AI uh software then I have this 6 AI plus go projects Advanced course uh the link for this will also be there in the description and we've built six killer projects like uh Discord bot whisper API bot telegram bot all of them using API Ai and kubernetes AI assistant terraform AI assistant and terminal air assistant the best part is we it's it's like 26 hours of content and we go through extremely detailed planning exercises for each of the projects before we build them uh so if you're looking for a job in 2024 if you want to upskill yourself if you want to show a good AI a good amount of AI on your CV and resume and want to know how to build production level AI softwares this is the project for you uh 20124 is going to be the year of AI so make sure you check this out make sure you uh start learning all right thank you so much for watching this video uh share this with your friends because you don't get such awesome content for free online not not sure if I've already mentioned this video is going to be part part of the AI projects playlist that I'm building up you can expect a lot of projects in this series uh thank you so much for watching and see you in the next video

Info

Channel: Akhil Sharma

Views: 1,277

Rating: undefined out of 5

Keywords: llm, python, llama2, colab, fastapi, ngrok, hosting llm, how to use llm as API, how to host an llm, LLM API

Id: duV27TUwH7c

Channel Id: undefined

Length: 22min 39sec (1359 seconds)

Published: Thu Feb 15 2024