Deploy Your Private Llama 2 Model to Production with Text Generation Inference and RunPod

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
in this video you're going to learn how you can deploy your own private Walmart 2.0 job model and we'll build a private rest API and I'm going to show you how you can prompt the model as well can you deploy wama 2.0 on your own private machine yes you can in this video I'm going to show you how you can deploy the Walmart 7 billion chopped model on a single GPU using rampot also I'm going to show you the text generation inference library and how you can deploy it on the runpod then we are going to make some API requests using the rest API Library requests and finally I'm going to show you how you can stream the responses from your Walmart 2.0 model using the text generation client Library let's get started there is a complete text tutorial that is available for ML expert Pro subscribers and in it you can also find a link to the Google Co-op notebook so if you want to support my work please consider subscribing to ml expert Pro thanks the library that we are going to use in order to run our inference for the lama2 model is going to be called text generation inference and this is a library created by the hanging face company and you can see that is open source and one of the important things about this library is that it is used in production actually at Hanging face and it Powers the hanging face shot the inference API and the inference endpoint so it is really well maintained and it works in production already so one of the important things is that it supports token streaming using server sent events so we can stream the responses from our work language models it supports quantization with bits and bytes and gptq something that we are not going to actually use within this video but you can try it out and it was as to what safe tensors which pretty much compress the size of our models so we're going to use that and there are a lot of optimized architectures for which this Library works and as you can see at the last point right here and they say that it is optimized for Obama version two so this is great what this Library actually holds or provides is a Docker container which you can use in order to run the inference and this is something that we're going to use with the armpot two host our text generation inference Docker container I'm going to use arompot and this is one of the options that you have in order to host your gpus of course you can use AWS Azure or something else but this is something that I found that is pretty easy to set up and I'm going to show you how you can do this within this video Once you sign up for a runpod account you will be present it with something like this from here you can basically choose your GPU but also you can choose the data center location that you're going to use in my case I'm going to use the European Union Romania Data Center and from here you can also select a Network volume and you have some other filters as well uh how many mem how much memory how many CPUs Etc and the disk type as well Cuda version as you can see so in our case I'm going to use the rtxa4500 and it has 20 Gigabytes of vram and I'm going to use the on demand option I'm not in three month options I guess that you can rent out for a specific time also there is a community called option but I'm not going to go through that it is basically uh some people that are renting out their gpus uh in this case I'm going to use the secure quad you also need an API key which we're going to use within the Google Cloud notebook and to get yours go to settings API keys and then create an API key with the plus API key button right here I have a Google app notebook that is already running and you can see that I've also installed some of the dependencies I'm using the rompot client Library the text generation and library that comes from text generation inference and then I'm upgrading the requests Library within the Google clock notebook next you can find some of the IDS of the gpus that are available from a runpod and from here I'm going to actually take the Nvidia rtxa4500 and in our case I'm also going to get the imports from the request the rampot and the text generation client the first link that you need to do is to pass in your API key right here within the Google Cloud notebook and then I'm going to create a GPU instance with this create pod function and here I'm going to name it and I'm going to pass the image of the docker container to text generation inference one of the pretty much the one of the latest versions the latest one is 1.0 and they did something with the licensing there so here I'm also passing the GPU type ID the data center that I'm using a secure quote and the one of the most important parameters here is the model itself so in our case I'm going to use this repository in her face by the blue and this is pretty much a copy paste from the original wama model but it doesn't require the key that you need in order to download your model so I think that this is pretty much the same model but just cloned version of it so thank you the blog for that you don't need to actually request the model itself but if you want to be sure that everything is all right please go ahead and download the model from the Walmart repository also you can pass in your API key to the hearing face so you can actually download this within the environment variables uh one of the important things here is that I'm increasing the volume uh gigabytes and here I'm passing in a 50 gigabyte the model itself is going to take more than 10 I believe let's see what the files are yeah uh roughly 14 gigabytes of memory just to store this model and we are going to download these two and the tokenizer itself of course but that's pretty small compared to this so once you pretty much go through this uh you can go to the rampot and after some time after you run this you have this instance that is going to be deployed and it's going to be running and you get something like this note that we're using 93 of the GPU and you can go to the box right here it should be something like this and this is confirming actually that it is downloaded the model within the two files and then it is doing a warm up yeah right here so it's passing up some tokens and running through the inference on your own it all and the system works you see that it is when I hit and downloaded the complete Docker image so you can deploy it within your machine so it should look like this and everything should be running smoothly from there so let's continue with the Google Co-op notebook so this is actually um some information about the pot the image name and you have of course a unique identifier for your machine and in order to run this you have this server and on This Server you'll have something that is called Swagger UI and if you open this yeah so this will be actually your server UI and here you can pretty much look at the two methods that we're going to use that generate tokens method or the generate method and here you can go through the parameters that you are allowed to pass in the most important one being the input or the prompt and then some of the parameters you can see that you can have a temperature sampling repetition penalty Etc and this is a sample response that you're going to get from the API and then you have this generate stream endpoint of course which is going to work within the server sent events so this is pretty much the documentation that you get from the Swagger UI you have a hair check and you can actually try out the requests let's see the here check what does it tell us let's execute this so the first thing that I'm going to do is to properly format the system prompt and the prompt for the Obama 2.0 and this is pretty much the template that we need to use to generate the proper format for your prompt uh you have to pretty much use a template that the authors are using and this is the official warmer repository and if you go through this method called chat completion you'll see that it essentially provides a formatting for your prompt so it has started with B instruction on begin instruction and then end instruction you can see here also you probably have to go to the constants on the top which represent the begin instruction and and instruction right here so these are the tags that are they're using and also there is this begin system and and system I would say so these are again special packs that you need to wrap your system prompt within also you are presented here with the default system prompt that this library is providing for you so in order to run the one model properly we'll need to essentially use this same format and to do that I'm going to create a function called generate prompt and within it I'm going to pass in the prompt and the system prompt and from here you can see that I'm using the same format the instruction then the system within that and then I'm passing in the system prompt and then the prompt itself so in order to test this I'm going to run this cell and then we are going to create our first request to the API of our model so this is pretty much a parameter called with best of one so we want just a single sequence and then I want the temperature to be a very low number unfortunately you're not allowed to pass a temperature of zero I'm not sure why but the API is like throwing an error if you do that and I'm also passing the max new tokens in our case this is going to be 512 a note that warm up 2.0 supports up to 4 000 context length so take a look at your parameters right here and here is the prompt that I'm going to use write an email to a new client to offer a subscription for a paper supply for one year and know that I'm passing in the default system prompt which we're going to use and this took about 10 or 11 seconds let's see the Response Code 200 and this is the email that was generated by the Walmart 2.0 model and yeah you can see that it is very good and it is providing a bullet point bullet points points yeah at least with the benefits to get started it simply reply to your do this email yeah it looks pretty good I would say that the prompting is working much better compared to my previous video where I found that the model didn't perform as expected probably the prompting techniques are really important for this model as well so now I'm going to change the default system prompt your assessment and bit farmer known as Dwight shoot from the office so yeah I just want this model to behave as just what Dwight would reply so I'm going to use the same prompt but again I'm going but this time I'm going to change the system prompt right here and let's run through this of course this is going to take a bit of time again and now the model might actually output a lot more tokens so yeah it took about 12 seconds a bit slower compared to the first one and let's see the response so be the awesome paper Supply subscription offer so yeah it says BT all right so the bit farmer greetings from Dunder miffon Scranton okay so it knows that it is uh the TV show it knows about the TV show and then as a valued member of the dunder miffon Community I'm excited to offer exclusive opportunity to subscribe to our top-notch paper supply for the next one here so this sounds something like Dwight might say but yep here it looks like the Emojis are a bit too much at least happy subscribing my friend warmled watch should best farmer and salesman extraordinaire okay some of this is a pretty good but probably you might want to dial out or dial down the Emojis themselves okay so another way instead of using the API is to use the text generation inference client and this is uh from the library itself you just need to pass in the server URL and then I'm going to pass in a Timeout of 60 seconds since the default value is 10 seconds and this might be a bit low on your waiting for responses that are much bigger and I'm going to create the same a response for this one and yeah you're just passing in the generate the prompt which is again is the same thing that we had this far and then the max new tokens parameter uh this took about 14.5 seconds and the response is uh pretty much the same I would say yeah it should be using pretty much the same temperature yeah okay and finally and we can use the generate stream and this will create an iterator which you can iterate over and then you can print out the text if you want to so let's see you can see that the streaming is working yeah and I would say it's pretty fast uh this is hosted on a single GPU and it costs about 33 4 cents per hour in order to run this one 7B model so this is pretty cool and the final thing that I'm going to show you is to how to actually terminate the machine this will go ahead and actually delete the pot so you don't have to pay anything over that and this should go and terminate your machine this is it for this video in it you've seen how you can deploy your private Walmart 2.0 model we've used the 7 billion parameter model and we've deployed it on a single GPU we took a look at the text generation inference library and we made some API requests to the API from the library and then we have a look at the text generation inference client which is going to allow you to First Call your model and then use the generate stream in order to stream your responses from your Walmart to point model thanks for watching guys please like share and subscribe also please join the Discord channel that I'm going to link down into the description and I'll see you in the next one bye
Info
Channel: Venelin Valkov
Views: 14,662
Rating: undefined out of 5
Keywords: Machine Learning, Artificial Intelligence, Data Science, Deep Learning
Id: FdcXJ7d3WQU
Channel Id: undefined
Length: 17min 20sec (1040 seconds)
Published: Mon Jul 31 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.