So one of the key issues that we all
face when working with large language models locally, or even serving them in
the cloud ourselves rather than using an API is just getting them to be over
speed that you can use them and that people and users can actually get them
to feel like, they're prepared to wait for the response, that kind of thing. one of the challenges has been, if you're
serving, eight bit models or 16 bit models from the Hugging Face platform. Often they're just so slow that
they're not practical for using. Even if you're using
an A100 GPU, et cetera. And there are a whole bunch
of different reasons for this. And there are a lot of things
around, some of the sort of fast attention and some other mechanisms
that people have been using. I think some of the MPT models have
been really quite good at actively, trying to speed up inference for this. But unfortunately, a lot haven't And well,
Hugging Face has basically created their own text generation inference library, it
can be a real pain to use, especially on something like CoLab or using it locally. So this brings me to the
topic of this video, which is a new serving library called vLLM And I've
been using this over the past few weeks. I must say that it's very impressive. At the speeds that you can get and
also, just the ability to be able to run, models that before were very slow
and very tedious to run, and now being able to run them quite easily here. Okay, so what is vLLM? So this is basically, they're
calling this an easy, fast and cheap LLM serving with PagedAttention. And I'm not going to go
hugely into PagedAttention. it basically is what is
driving a lot of this speed up. and so you'll see, just first off,
just by comparing the speed and how many tokens people are getting, you
can see here that basically they're outperforming other things, immensely
compared to what's out there currently. Now the way that they're doing
this is they're using this thing called a PagedAttention. So, what they've looked at is that
what's the key bottleneck this is basically a whole thing to do with
the cached memory that is often called the KV cache, in here. And what they're basically doing
is, trying to not use memory when they don't need to do it. and to be able to basically efficiently
use the memory, much better than before. And you can see when they do this well, it
allows not only for a speed up of things, it also allows for a better speeds on
different kinds of sampling techniques, like beam search, that kind of thing. And for multiple outputs here. So this is their blog post. If you're really interested
in it, go through it, have a good read of it and look at it. Now, the thing that instead of impressed
me and got me really interested in this in the start was that, LMSYS, not
just academics, that are not actually using this stuff in production. They're actually the people who are
running the, chat bot arena which I did a video on, a couple months back. so obviously they're also aligned with
the creators of the Vicuna model, as well. So you can see that they've had a
lot of interest in basically, running a variety of models to compare and
contrast the differences between them and allow people to be able to evaluate. To do that they've had to run a lot
of models in the cloud and be able to serve those models to a lot of people. So this is where I think, what
they've basically done is very, interesting is that they talk about
that initially they were using the Hugging Face backend for this. But then now this vLLM, can actually get
you about 30 times higher throughput than the initial Hugging Face back in here. So, this is pretty amazing just to
look at that, even as their traffic has increased it's it seems like they
compute, needs are actually going down, So they've put up a Github. they basically, have got all about it. Currently, they're supporting a number
of different models here including the LLaMA models, including the MPT models. And you can see, basically, this is some
of the comparisons when they're comparing with Hugging Face text generation
inference and just the straight up Hugging Face transformers libraries
that they're able to do so much better. another thing that I really
like with this, so they've got some nice documentation in here. another thing that I think
is a really interesting. idea and it's definitely
useful to people in industry. Is that they actually have, so
you can run this obviously in a notebook or something like that. you can also run your own API server
for this, which is really good. And one of the cool things that they
have on the API server is that they have this OpenAI compatible server. So basically it means by running this, you
can actually get it to be like a drop-in replacement for the OpenAI API protocol. So even if you were planning to use
OpenAI for a lot of things, but you then want to be able to have a quick
way to test out open source models or, you know, commercially available models. This will allow you to basically not have
to change your current code that's using OpenAI, but just change where you're
pointing the traffic at to a different URL and you can then basically serve
one of these other models in the same way that you would be accessing OpenAI. So this is a hugely useful thing
If you're in an organization that's already using OpenAI out there. Okay let's jump in and have a look at
some of the code and see how it turns out. So if we jump in here, you can see
I'm going to be using the latest, Vicuna 13Bv1.3 model here, but you
couldn't really pick any model. So you've got a smaller GPU try
it out on one of the 7 billion parameter models something like that. you can see the main thing
I'm just bringing in here is just this install, vLLM there. And we can basically just
bring it in, set it up. and then start generating. So they've got some things, you
know, just that's from their standard code there for generating. Now I've basically put in the system,
prompt for this, and set this up. to basically do the sort of
generation in here as well. And, I'm not going to go through
the content output too much here. We're just going to look
more at the times for this. So, if you remember back to some of
the previous videos, I've made things like this would often take, with maybe
loading in eight bits with just the standard Hugging Face could take up to two
minutes, for some of these generations. And you see here, this is
basically generating in 13 seconds. the capital of England,
generating in 2.34 seconds. Right? The one of the ones that I test a lot
is, writing an email to Sam Altman. Here can see the quality of the output
is not changing compared to the model at all, but we're getting, generation
times of 13 seconds for 512 tokens here. And you see another version of
this here doing, in under 13 seconds to basically generate out. So have a play with this yourself and
see what you want it to do now, where I think this really shines is not just
in some of these normal sort of things, but where you've got a model and you
want to use it for a very specific task. So one of the tasks we have is
an example of this one, right? Convert the following to Json. And I've got some, information in there. And basically I'm just doing an extraction
to get this out into Json format. Now you can see here, it's done that. And it's done in 1.67 seconds. Which is blazing fast for this. So this allows you to do a lot of
things, with a decent quality model. that you're maybe not going
to use for everything. And maybe you're still gonna use
OpenAI and GPT 4, for certain things, that kind of thing. But for a lot of tasks, you can get them
done with, A sort of medium size or small size, local model and running with this
in inference mode to get your results out. You can see final example is you know,
writing the three-day trip to London. Again, this is the quality of the output
is being determined by the Vicuna model. So we could have picked a, a wizard
uncensored or any of those models that, that you might have a favorite for. but look at the generation times here. they're much faster than using the
Hugging Face way or even using the text generation inference way for doing this. Anyway, have a play with it. it's certainly something that
you can, use for production. Like I said, one of the things I really
like about it is it has an API that you can just drop in as a replacement
for OpenAI which makes it very easy. If you've got a team working on this
to test models out quickly and not have to change their backend code
all the time for this kind of thing. As always if you've got any questions
please put them in the comments below. if you'd like the video please
click like and subscribe. I will talk to you in the next video. Bye for now