Go Production: ⚡️ Super FAST LLM (API) Serving with vLLM !!!

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
running and serving llms is not an easy task that's exactly where new library called vllm comes and helps you massively it's so easy to set up and it is really fast and it is really fast because of a new unique algorithm called page attention so in this video we're going to see how to set up your vlm that is for llm inference and serving how to it's it's basically like an llm inference engine I'm going to show you the entire demo in Google collab and I'm also going to show you how you can hit the endpoint like just to show that this entire thing works I've picked up a very simple and easy model in this case which is a blooms model from hugging face and the reason I did it is because I'm fitting entire thing in Google collab so I've got like limited GPU so that's one reason you could see like my GPU fluctuating a lot but if you are doing it on an actual server or if you are doing it for your professional Enterprise reason then you could basically load any model depending upon the memory that you have got before we move forward I would like to give you a very important note here which is this model is an Apache 2.0 license unlike hugging face TGA which was recently in some kind of controversy because they change the license this is completely an open source license permissive license and you can use it and before we actually see the Hands-On code let's look at the inference time you can see when you use vllm the throughput is massively more like for example if you normally use hugging face you get like six requests per minute if you use TGI you get like 61 requests per minute but when you use vlm you can make 154 requests per minute and you can see these things for different kinds of machines and all the machines for all the gpus and servers you've got vllm has got like almost like 8 to 15 percent 15 times higher throughput than hugging face the typical inference that you get and that is possible because of a new technology that they have called paged attention page attention takes the the concept from computer science which is traditionally how computers are built to use a paged memory to manage your Random Access Memory their RAM memory this is like being there for a very very long time so what page the attention is trying to do here is Page attention is trying to take that concept the paged memory concept and then implement it for KV cache which is a very integral part for any attention based models KV here is the key value that is being stored for llms like especially the auto regressive decoders like the models that we use the text generation models so it is trying to use that concept and then do the memory management in such a way that we can use the remaining memory efficiently and that is why you're getting the speed upgrade here if you are interested in seeing a separate video about page retention I can do it with separate example but just in simple language they've got better memory management system that's what a vllm is doing so now getting into the implementation implementing or making vllm work is very very simple all you need to do is PIP install BLM so I'm also running this on collab on T4 machine like a GPU see typically you are not going to do this on collab I want to like get this straight and a lot of people seem to these days for various reasons I respect the opinion but I make tutorials on Cola because it's easier for me to share the files with somebody who do not own an Nvidia GPU just like me so that's that's the whole point you don't have to necessarily do this on collab I'm making this tutorial so that you can do this in your company you can do this if you're building a product in a lot of other places wherever you have got a cloud GPU provider so first step install vllm pip install vllm once you pip install vlm which is going to take a lot of time after that you can either have use this within python there are different ways to use it you can say from vlm import llm and then you can give this set of problems that you want to use and you can load the model you can have an llm class and you can load the model and then create an alarm object so what are the models that you can load at this point these are the models that you can load at this point so it supports by shoe on Bloom Falcon gpt2 GPT big code GPT J GPT Neo X llama and llama to models MPT opt and the reason why I went ahead and picked up a blooms model because I had to fit this in a Google collab notebook that's the primary reason otherwise like if I had enough memory I would have picked some Cutting Edge model like llama 2 to show this demo so the fact that I'm using a gpt2 or the fact that I'm using a blooms model is not a reflection of vllm it's a reflection of how much memory that I've got I wanted to set this clear once you have the llm ready then you can just simply generate the output and then print the output that will print the output this is how you would use vllm in a python environment but like I said like most of the times you want to serve it to the end user which means you want to create an APA endpoint like the easiest way to do online serving with an open AI API compatible server which means the endpoints will have like a very similar structure like opening endpoint that is using something called vllm.entrypoints dot open a DOT APA underscore server so one you can use it within python that's going to bring you some efficiency but if you want to serve it as an API endpoint this is how you would do it I'm going to use a method called local tunnel to Tunnel the localhost links again if you know Google collab Google collab is a computer somewhere hosted on Google data center but you don't get to use a lot of things for example you cannot access this localhost link if I click this nothing will happen I'll just get an error so what we are trying to do here is that we are trying to serve this model the blooms model in this link like the localhost link and tunneling that link to an external internet address that is this address so that's why you know we have to do this thing so when you open this for the first time it will ask you for IP address that's why I'm getting the Google collab missions IP address which you will have to ultimately enter here but having said that the way you run the model is python simple python code dash M BLM dot entry points DOT open AI dot API server and then specify the host in which you want to run specify the port in which you want to run and then specify the model which you want to run and there are a lot of other parameters that you can go ahead and then add here and because I'm running it on collab after this I am tunneling this to this particular port and that will give me this link when I click this link ultimately nothing will happen but I would like to show you just by clicking the link you can see one thing the thing is it says detail not found it's not giving me completely like a 404 but it says detail not phone like a Json response that means something is working but what we can ultimately do is we can actually call this like we can make an API call to this endpoint with the open AI compatible very similar like open a endpoint like if you know if you have used open a APA you know it's their completions endpoint very similar like that you can make a curl command and then you can actually see how does it work on any computer you can do this on any computer and to show you this demo I'm going to take this and go to a different service called Hopscotch that lets you make APA equals this is very similar like Postman but with a web interface so I've got the API endpoint here which is where this is the tunneled link where we are running this or where we are exposing and serving this llm and then the endpoint complete endpoint and we have got certain body parameters we have got the model name we have got the prompt we have got the tokens and we have got the temperature and we have got no other details because we are not you know doing headers as authorization here so I can send this when I send this thing what it gives me is it gives me a Json response that says okay for a prompt that says San Francisco is a it says a city of about 100 1500 people let me ask a different thing I can go here and say uh question right a joke about Elon Musk answer and this is what I'm going to give and maximum tokens I'm going to give 10 and I'm going to send this when I send this you can see that it is going to make this call and it is the prompt that it gives and this is the response that we have got Elon Musk is a great is a man who is a great engineer so the number of prompt tokens that we gave is 11 the output completion token is 10 so the total number of token is 21. so input is 11 output is 10 total token is 21 and you can play with this this is a very similar open a com format so if your company is already using open a but you don't want to spend a lot of money to open a I think the best solution is to use vllm as an engine and then use one of these models that you have got and then see the cost comparison how much does it cost to run once again to quickly summarize what we are trying to do here is we are trying to use a new engine an inference engine that's called BLM that supports a lot of hugging waste model out of box and the reason why vllm engine is what we want to use it's actually fast like super fast if you want to do llm inferencing and you can see the benchmarks here it is faster than a hugging face it's faster than the TGI the text generation inference package or library that hugging face had released sometime before and it does a really good job because of its new technology called paged attention and the way we can run this is very simple all you have to do is install vllm pip install vllm then you can use it within python just like this you can import from vlm input llm and then you can load the model whatever model that you like one of the compatible models that they've got and then you can generate the output but if you want to expose the model or if you want to surf the model as an APA endpoint all you have to do is python-m and then use vllm.entrypoints.openai.ap underscore server the host in which you want to run the port in which you want to run the model you want to expose and that is it you would be able to serve the model and because I'm using it on collab I cannot show you the link with the localhost link so I'm tunneling it to a different link and then I've taken that link and then shown it in hot Hopscotch which is very similar like Postman which is for you it's a it's an API software or a software that helps you do APA endpoint heading easily so I'm going to show you one final demo before we wrap up this video so I'm going to show something else just remember this link this is the link that I'm showing here like I'm not doing anything else and you can always see here like the log here so I'm going to go back here and I'm going to say um question write a YouTube title oh no this is a this is a base model right so you cannot give instruction this is a YouTube title for one little code who is amazing and give the space I don't want it to be a question and then I wanted to give me 15 tokens and I'm going to send this and it's going to take a little bit more time oh no it's already finished I said max 15 tokens so it's a great video maker so 15 promises input total prompt is 21 and the completion token is six and a great video maker so it's so simple and very efficient for you to serve your llm so if anybody tells you that running llm is hard it is hard but when you have Solutions like vlm it makes it really easy and really simple for you to do it because it does a lot of Av Lifting for you starting with uh you have got the state of the art model and really good so throughput value you've got they've got their own technology called page attention and it is optimized for Cuda kernels and it does a lot of other things that typically people do in API management like continuous batching of incoming requests I hope this video tutorial was helpful to you in taking your llm skills to the production level if you have any question let me know in the comment section otherwise I'll serve this Google collab notebook in the YouTube description you can directly click and then get started this YouTube sorry this Google collab doesn't mean any ways that you have to use it in only Google collab if you have got an Nvidia GPU you can do it on local machine and you can serve it and use it within your different applications let me know in the comment section what do you feel about vllm but I'm really happy that it is an open source project kudos to the Developers thank you very much see you in another video
Info
Channel: 1littlecoder
Views: 25,691
Rating: undefined out of 5
Keywords: ai, machine learning, artificial intelligence
Id: G7rXlZR68SQ
Channel Id: undefined
Length: 11min 52sec (712 seconds)
Published: Wed Aug 16 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.