vLLM - Turbo Charge your LLM Inference

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

So one of the key issues that we all face when working with large language models locally, or even serving them in the cloud ourselves rather than using an API is just getting them to be over speed that you can use them and that people and users can actually get them to feel like, they're prepared to wait for the response, that kind of thing. one of the challenges has been, if you're serving, eight bit models or 16 bit models from the Hugging Face platform. Often they're just so slow that they're not practical for using. Even if you're using an A100 GPU, et cetera. And there are a whole bunch of different reasons for this. And there are a lot of things around, some of the sort of fast attention and some other mechanisms that people have been using. I think some of the MPT models have been really quite good at actively, trying to speed up inference for this. But unfortunately, a lot haven't And well, Hugging Face has basically created their own text generation inference library, it can be a real pain to use, especially on something like CoLab or using it locally. So this brings me to the topic of this video, which is a new serving library called vLLM And I've been using this over the past few weeks. I must say that it's very impressive. At the speeds that you can get and also, just the ability to be able to run, models that before were very slow and very tedious to run, and now being able to run them quite easily here. Okay, so what is vLLM? So this is basically, they're calling this an easy, fast and cheap LLM serving with PagedAttention. And I'm not going to go hugely into PagedAttention. it basically is what is driving a lot of this speed up. and so you'll see, just first off, just by comparing the speed and how many tokens people are getting, you can see here that basically they're outperforming other things, immensely compared to what's out there currently. Now the way that they're doing this is they're using this thing called a PagedAttention. So, what they've looked at is that what's the key bottleneck this is basically a whole thing to do with the cached memory that is often called the KV cache, in here. And what they're basically doing is, trying to not use memory when they don't need to do it. and to be able to basically efficiently use the memory, much better than before. And you can see when they do this well, it allows not only for a speed up of things, it also allows for a better speeds on different kinds of sampling techniques, like beam search, that kind of thing. And for multiple outputs here. So this is their blog post. If you're really interested in it, go through it, have a good read of it and look at it. Now, the thing that instead of impressed me and got me really interested in this in the start was that, LMSYS, not just academics, that are not actually using this stuff in production. They're actually the people who are running the, chat bot arena which I did a video on, a couple months back. so obviously they're also aligned with the creators of the Vicuna model, as well. So you can see that they've had a lot of interest in basically, running a variety of models to compare and contrast the differences between them and allow people to be able to evaluate. To do that they've had to run a lot of models in the cloud and be able to serve those models to a lot of people. So this is where I think, what they've basically done is very, interesting is that they talk about that initially they were using the Hugging Face backend for this. But then now this vLLM, can actually get you about 30 times higher throughput than the initial Hugging Face back in here. So, this is pretty amazing just to look at that, even as their traffic has increased it's it seems like they compute, needs are actually going down, So they've put up a Github. they basically, have got all about it. Currently, they're supporting a number of different models here including the LLaMA models, including the MPT models. And you can see, basically, this is some of the comparisons when they're comparing with Hugging Face text generation inference and just the straight up Hugging Face transformers libraries that they're able to do so much better. another thing that I really like with this, so they've got some nice documentation in here. another thing that I think is a really interesting. idea and it's definitely useful to people in industry. Is that they actually have, so you can run this obviously in a notebook or something like that. you can also run your own API server for this, which is really good. And one of the cool things that they have on the API server is that they have this OpenAI compatible server. So basically it means by running this, you can actually get it to be like a drop-in replacement for the OpenAI API protocol. So even if you were planning to use OpenAI for a lot of things, but you then want to be able to have a quick way to test out open source models or, you know, commercially available models. This will allow you to basically not have to change your current code that's using OpenAI, but just change where you're pointing the traffic at to a different URL and you can then basically serve one of these other models in the same way that you would be accessing OpenAI. So this is a hugely useful thing If you're in an organization that's already using OpenAI out there. Okay let's jump in and have a look at some of the code and see how it turns out. So if we jump in here, you can see I'm going to be using the latest, Vicuna 13Bv1.3 model here, but you couldn't really pick any model. So you've got a smaller GPU try it out on one of the 7 billion parameter models something like that. you can see the main thing I'm just bringing in here is just this install, vLLM there. And we can basically just bring it in, set it up. and then start generating. So they've got some things, you know, just that's from their standard code there for generating. Now I've basically put in the system, prompt for this, and set this up. to basically do the sort of generation in here as well. And, I'm not going to go through the content output too much here. We're just going to look more at the times for this. So, if you remember back to some of the previous videos, I've made things like this would often take, with maybe loading in eight bits with just the standard Hugging Face could take up to two minutes, for some of these generations. And you see here, this is basically generating in 13 seconds. the capital of England, generating in 2.34 seconds. Right? The one of the ones that I test a lot is, writing an email to Sam Altman. Here can see the quality of the output is not changing compared to the model at all, but we're getting, generation times of 13 seconds for 512 tokens here. And you see another version of this here doing, in under 13 seconds to basically generate out. So have a play with this yourself and see what you want it to do now, where I think this really shines is not just in some of these normal sort of things, but where you've got a model and you want to use it for a very specific task. So one of the tasks we have is an example of this one, right? Convert the following to Json. And I've got some, information in there. And basically I'm just doing an extraction to get this out into Json format. Now you can see here, it's done that. And it's done in 1.67 seconds. Which is blazing fast for this. So this allows you to do a lot of things, with a decent quality model. that you're maybe not going to use for everything. And maybe you're still gonna use OpenAI and GPT 4, for certain things, that kind of thing. But for a lot of tasks, you can get them done with, A sort of medium size or small size, local model and running with this in inference mode to get your results out. You can see final example is you know, writing the three-day trip to London. Again, this is the quality of the output is being determined by the Vicuna model. So we could have picked a, a wizard uncensored or any of those models that, that you might have a favorite for. but look at the generation times here. they're much faster than using the Hugging Face way or even using the text generation inference way for doing this. Anyway, have a play with it. it's certainly something that you can, use for production. Like I said, one of the things I really like about it is it has an API that you can just drop in as a replacement for OpenAI which makes it very easy. If you've got a team working on this to test models out quickly and not have to change their backend code all the time for this kind of thing. As always if you've got any questions please put them in the comments below. if you'd like the video please click like and subscribe. I will talk to you in the next video. Bye for now

Info

Channel: Sam Witteveen

Views: 13,454

Rating: undefined out of 5

Keywords: vLLM, Huggingface text inference, LLM inference, Transformers, lmsys, text-inference, fast LLM

Id: 1RxOYLa69Vw

Channel Id: undefined

Length: 8min 55sec (535 seconds)

Published: Fri Jul 07 2023