Boost Your AI Predictions: Maximize Speed with vLLM Library for Large Language Model Inference

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

how can you speed up your watch language model predictions in this video we are going to have a look at VM a library that introduces something called paged attention that can significantly speed up your L throughput let's get started this is the official web page of the vom project and here the authors claim that this is easy fast and cheap serving using something that they call paged attention and we're going to have a look at what this is in a minute so this library is actually a nice way to serve your watch language models and uh of course this library is actually used in production and you've probably play around with the demos from the chatbot Arena and the vuna demo so if you are using those you're basically trying out the V Library behind those projects so why this library is really important well at least the authors claim that the their throughput is much much better compared to pretty much the standard huging phas Transformers library and the Hing phas Tex generation inference Library so even behind uh between these libraries you can see that the VM can actually achieve three and a and a half times more throughput compared through to the tech generation inference Library which is pretty substantial so how do they do this and these are pretty much the benchmarks that they're going to uh show you how they're performing with different models and Yeah the secret Source uh it's the page attention so why this page attention is needed and what it is so when you're producing uh when you're giving um input tokens to the I they're actually going to be producing attention and value uh attention key and attention value tensors this is from the attention mechanism and then all of those are actually stored within the GPU memory so uh you can think of this as a storage device for the resulting tensors and those resulting tensors can be very L and those are actually referred as the key value uh tensors so what this is doing the page attention is trying to solve the problem that these are actually taking a lot of memory or GPU memory uh you can see that for a single sequence in Lama 13B this is the original Lama model this takes about 1.7 GB of vram and then there is a high fragmentation of this memory which leads to SW throughput and uh the authors are claiming that existing systems actually was 60 to 80% of memory due to this fragmentation and you can go through this uh example and see how this page attention is actually trying to fit the essentially the key and value tensors in a much more nonfragmented way essentially they're organizing it better and with that you are essentially getting this uh substantial throughput increase so yeah and the authors claim that this actually reduces the fragmentation a lot getting started with v m is quite simple you just need to install the library using pip install VM and then here on the original documentation page have they have a very good offline budget inference example and you're basically importing the sampling parameters class and then the L plus and here we have a sample of prompts and how you can actually W the model and get the outputs we're going to do that in the notebook that I'm going to show you in a minute and then they have this very nice API server which provides an example of fast API server and you can essentially run this with v entry points API server and you can essentially query this server using uh simple rest client uh another very important thing for the vom is that you can actually create a open AI compatible server so you might replace your open AI course directly to some of the open source models that are supported by this library and yeah another important thing is that they have a list of supported models and uh they actually support pretty large amount of models as well uh you can see that they're actually supporting the wama and wama 2 models Mist uh the new uh Yi model which maybe we're going to go through in another video I have a Google C notebook that is running with the Tesla T4 GPU so this is the free Google coap GPU that you get and here I'm installing the latest version of the torch Library Transformers Library accelerate and then the vom again the latest version of the library so the first thing that I'm going to do is to compare what you can get from wama 2.0 the 7 billion parameter chat model and I'm going to essentially import all of the requirements for for this and I'm going to log in within this notebook so I can get access to the Lama 27b chat model and here is how I'm going to want the model using the text generation pipeline I'm going to say that I want the wama to model and then I want this model to be typed into a flat 16 and uh the device Auto parameter is going to put it on the GPU for us and after the download is complete I'm going to prepare some prompts the these are going to be the prompts that we're going to compare the outputs from so here are the simple prompts what are the pros and cons of CHT versus op Source a write an email to a new client I have uh I'm going to ask for investment advice and then writing function in Python that calculates the square of a sum of two numbers so this is essentially the text that we're going to create and then I'm going to use the rather new API from the hunging face Library I'm going to apply the chat template from the Lama model using the tokenizer and I'm going to specify a simple system prompt you're a friendly chatbot who always responds as a superh human intelligence Ai and I'm going to pass in a text to this and then I'm going to apply the chat template from that to the text so I'm going to create the prompts for each text that I have right here and these are essentially the prompts that we are going to to compare our results on uh you can see that the chat template application is actually converting the prompts or the text within nicely and properly formatted text for the W 2 model so these are the prompts that we're going to use and then I'm going to store those prompts in text files so the first thing that I'm going to do is to create the huging face prompt and you can see that I'm going to use the first prompt for the difference between CH GPT and open source models and you can see that this is actually taking about 20 seconds or user time of 16.9 seconds and these are the parameters I'm going to use max new tokens equals to 256 so the model is actually going to be cut off after that I'm going to sample the responses and I'm going to set a very low temperature so this is the response and let's compare this to the VM inference uh so I'm going to ask essentially read the prompts this is the same prompt that I'm using from right now and this is how you instantiate the model and I'm going to specify that I want the same type of torch FL 16 for the model and yeah this is going to go ahead and download the model again and I'm going to specify the sampling parameters I want the same temperature and the same number of Maximum tokens and this is the result that I get so this time I get the response from within 14.1 seconds of course you can try a lot more prompts and uh compare the results on your own but the user time is much much more more about 3 seconds faster actually yeah and this looks really interesting and the response is again limited to the uh number of tokens so this should give you a pretty much the same response uh length so for the batch inference this is a way to essentially sample the model with the all of the prompts that we have this far and I'm going to essentially specify the same things and I'm going to run this to the original hanging face text Pipeline and you can see that this takes about for the user time about 1 minute and 2 seconds so this is um much uh much slower compared to a single prompt and the complete time is uh 1 minute and 14 seconds seconds and the responses are very good uh they were pretty much the same as I had which the ones that I had previously and these are the responses you can go through those on your own time and within the Google clap notebook and all of these are essentially good responses they look quite all right at least to me so for the V batch outputs you can can see that I get the responses right within 41 seconds or 43 seconds and let's compare this to the wall time of 1 minute and 14 seconds so you can see that at least here the V is actually giving you a very nice uh speed up compared to the original Hing face library and the responses again look quite all right they're a bit different but we're using a temperature of 0.1 one the VM Library appears to be providing some really nice performance improvements at least compared to the original huging face pipelines and you can try the library on your own uh let me know what results do you get down in the comments below thanks for watching guys please like share and subscribe and I'm going to see you in the next one bye

Info

Channel: Venelin Valkov

Views: 4,107

Rating: undefined out of 5

Keywords: Machine Learning, Artificial Intelligence, Data Science, Deep Learning

Id: TAMwQ7Fsnz4

Channel Id: undefined

Length: 10min 53sec (653 seconds)

Published: Mon Nov 27 2023