Exploring the Latency/Throughput & Cost Space for LLM Inference // Timothée Lacroix // LLM 3 Talk 3

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

dude you're blowing up these days how you doing Timothy you around there he is there he is all right cool can I hear you let's do that check yes there it is yeah works so for people that do not know you uh it's pretty hard not to these days but you were researcher you were doing all kinds of crazy stuff at Facebook and then you said you know what I could do this on my own I'm going to go and start a company you guys raised a massive round that was all the talk a few months ago and now you released the open source model which a lot of people are loving because of course it has some very very cool dare I say the word again because this is the buzzword Bingo one of the words so it is SOA right and so I'm excited for you to talk all about your learnings as you've gone through the cre creting of these models and then serving them and I know you got a presentation prepared for us uh yeah so I think I filled out the abstract a bunch of months ago uh and then I wrote this presentation in the last two days after um and so I think um what I'm going to talk about is whatever um I found useful in understanding what's important for model inference a lot of this talk is based on stuff I found online or stuff I found out experimenting um with the first the Lama release and now uh this one at misol I think we've always been more focused on um the cost of inference rather than the cost of training uh and so I guess this talk describes uh what goes into the cost of inference uh how to manage throughput latency uh what matters for those um can I click yes so so for some context I guess this won't be a surprise but lots of people want to deploy large language models and so I'm going to talk about deploying your own large language models uh and with open source tools there are ways to use uh great public apis uh you're welcome to do this as well uh but that's what I'm interested in and so we'll dive into the details of what matters to deploy a 7 billion parameter model uh lots of what I'm going to say also applies at larger sizes uh but then you need to use uh more gpus and there are a few things to add uh so I'll give a reference that goes into all those details but of the stuff I'll say should apply so we'll start with uh what metrics matter um then what drives these metrics both at the hardware level uh and the software level uh then I'll bunch of tricks that uh we can use to make things better some of those uh still are not implemented everywhere as far as I know uh lots of those are things that um are and then uh I spent yesterday trying to just run a bunch of models on a bunch of different uh Hardware uh try to get curves because I think real life example good and so I'll talk about those um numbers and then I'll conclude um okay so first what metrics are we interested in uh first is the throughput uh expressed in query per second uh we want to maximize this for uh batch jobs or if we want to allow more users to use our service uh second is a latency expressed in second per token so how many seconds it takes to Output the next token this drives how fast or how Snappy your application will look so in chat GPT it's pretty fast um and the small model the easier it is to get this uh really really quick so we want to minimize this for user experience a good threshold to keep in mind is 250 words per minute I think is the average reader speed and so as long as your latency is uh below this um your users won't get bored uh and then cost of course cheaper is better okay so now I'll dive uh deeper into what drives these metrics uh a short and everything that follows I'm only going to talk about the aggressive part of decoding where we have uh badge siiz tokens that we're trying to forward through the network uh to figure out the next badge tokens um this excludes the part uh the first part of processing when the query comes in and there is a maybe large prompt uh which is sometimes called the prefill part where we forward lots of tokens at once into the network this part of the processing is usually uh already quite optimized so it's a bit less uh challenging so with that in mind we're going to be interested in the inference for model of size p uh you can assume that P is seven billion because we like that size um and so in order to uh perform one step of inference we're going to require roughly two times the number of parameters times the badge size flops where flops is floating Point operations and doing these flops we'll need to load the entire model into the part of the GPU that actually run ration and so we'll need to load the entire model once uh so roughly uh the number of parameters in memory movement um what's interesting about these two quantities is that the first one um is bound by a quantity that is the hardware flops so the number of floating Point operations that your GPU can achieve and it is linear in the batch size so that's the growing line uh on my uh on my figure and the the amount of memory movement doesn't change with the batch size uh that's true until you have very very large batch size but as I've said this case is already pretty much optimized so we don't really care about it um so we have a constant quantity which is uh the size of the model divided by the memory bandwidth uh and that is the minimum amount of time that you'll require to load everything once in memory and you'll have to redo this each time um and then we have a move quantity which depends on the batch size and they cross at an interesting point um which doesn't depend on any but the hardware uh and so for example this quantity which is two times the uh total number of flops the hardware can achieve divided memory bandwidth is 400 on an a1g and 00 um and so this badge size is super interesting because below this badge size we're pretty much wasting flops uh because our comput are memory bound we're just waiting on things to load on the GPU and Computing too fast the latency in part of the gra is constant and if we go beyond bar star uh then the latency starts to increase and we're compute bound so what's really nice about this bar star is that at exactly this batch size it's a regime where uh the latency is optimal so user experience is optimal uh but also when not wasting any flops uh so we're paying the right amount of dollars um however our ideal batch size barar is 400 uh which seems to be quite a lot so let's run a bunch of numbers for a model size uh like Lama here um so it's seven billion but there are a bunch of differences with mital so I'll just take Lama um we have a dimension of 4K and 32 layers um the model size that's used by the model is quite easy to compute uh it's two bytes per model weight in fp16 so we just uh do two * 7 and it's 14 gigabytes in memory and then we have the KV cache the KV cache is used uh to store computations so that when we redcode a new token we don't have to uh rerun all of those computations this KV cache is of size two because we have both kcash and vcash times two because we're in fp16 and then we have one KV cach per layers uh and we have to save things for each element in the batch uh one token per uh per position in the sequence and then times Dimension so if we plug actual numbers in this formula we find that we require about two gigabytes of memory per element in the batch uh for a maximum Cy length of 4K and so we figure out that on an a1g uh I don't know why there is a g there uh but on an A10 uh with 24 GB of memory we have a maximum batch size of about five which isn't much and on a much bigger a00 with 80 gigabytes of memory we only have a maximum batch size of about 33 uh which is still way below 400 and so it seems that for all practical use cases we're inferencing with uh a 7B model our decoding is going to be severely uh memory bandwidth bound this also shows something that uh we've been very careful with from the start at misol um the size in memory of your model and of your KV cache um really impacts uh the maximum batch size that is allowed uh and this maximum batch size is what makes things efficient or not so I'll now dive into a few trick tricks uh that have been uh already done before uh but that um I like and are just good ideas uh and that some of them have made their way into mol some of them uh are more at the deployment uh software level some of them haven't made their way uh into Moll yet but yeah so the grouped query attention um is a way to reduce the KV cache size by using less queries uh less keys and values per query uh it was used in Lama 2 but only for the larger model sizes not for the 7B uh so in a standard multi-ad attention you have exactly as many keys and values as you have queries in grouped query attention uh one key value pair is associated to a bunch of queries uh in Moll we use uh four queries per keys and values and so the amount of flops that you're going to do is going to stay the same uh but you only have uh one4 of the cost in memory uh so that's a simple trick um which doesn't really hurt performance U so it's just a good thing to do then there is quantization quantization we haven't worked specifically on it but that was something that was uh that developed quite quickly um after especially the Llama release where lots of great off the shelf Solutions were used by many people in the open source world to provide uh inate or in four versions of the models uh what happens with int8 is that you divide the model size by two within four you divide it by four it doesn't change the optimal patch size because the ratio is just dependent on Hardware uh on nothing else um in terms of uh computation speed um it should be about 2x we've found that hard to reach with the shapes of our models and of a bunch of models uh we found that it's more reasonable to expect about 1.5x uh that's kind of what we've seen in terms of like pure flops um 2x seems hard to reach but we haven't spent that much time on it uh within Tate you also mechanically increase the available memory for KV cache um and one thing that you're going to um immediately divide by two is the time you spend loading the memory in uh the model in memory so if you're in the memory bound regime everything will be twice as fast uh which is nice uh what's also nice is that there is no loss of prec Precision or very little loss of uh Precision for inate uh everything seems to work just as well uh there is some loss of performance at in4 uh but it seems that it can be recovered with Kora or if you only care about specific uses um then I guess like this can work as well and it'll be much cheaper to serve um another great trick is Page detention uh so the this is from the VM folks uh at Berkeley the KV cache without page detention is rectangular uh you allocate a big old rectangle of memory uh where one dimension is the badge size the number the maximum number of sequences your model can work on um at once and then the other dimension is the maximum sequence length that you allow uh people to use and so when a new sequence comes in you allocate an entire row just for this user and that's a bit sad because it's likely that maybe 10% of your users are going to use the the full uh the full row uh but it's also likely that most of your users are just going to do short requests and so you and you end up wasting a lot of precious space in your uh device memory what page what page detention does um is that it allocates blocks um in uh in GPU memory so you first load your model um so that you know how much space you have left and then every everything else you fill up with memory blocks uh these blocks can can contain um up to I don't know 16 32 tokens and when a new sequence comes in you allocate as many blocks as it's going to need for the prompt and then you slowly grow them as needed so uh in this drawing I've made uh you can see that sequences are not necessarily allocated on contiguous blocks uh so for example the orange or blue or uh green uh are not on contigous blocks and that doesn't matter uh what this allows is a much better granularity and control over the memory allocation and so in this uh drawing everything that is completely free on the right is uh free to be used for new sequences that comes in and as soon as a sequence finished decoding you can just release the used blocks so that's very nice I think at the time they claim like 20x throughput increase uh compared to um the standard implementation which doesn't sound that far off uh one trick that we've added uh in mistol is a sliding window attention so we've trained our model to only use the past um K tokens in the cache so um this is great because it allows us to have a fixed cache size so we know that once a sequence grows beyond uh sliding window tokens we can just rotate in the cach and start overwriting uh and this won't matter if you need more insight into why this still allows us to use uh context length bigger than uh sliding window uh We've made a short description of this in the blog post or on the GitHub um so the good implementation for this is to see the KV cach as a rotating buffer so in this drawing at time t uh we're inserting into the last position of our cache and then at time t plus one we're going we're growing Beyond uh sliding window and so we just overwrite and this is really easy to implement because positions in the cache never matter uh everything position related is encoded uh with position edding so it's all good it's really easy to implement and it's it works well another trick is continuous batching so as I said uh the pre-fill phase processes many more tokens simultaneously than the decoding phase and so you can try to batch these tokens together with the decoding tokens um one thing I've noticed in both VM and TGI um is that one thing they do not do is try to chunk uh the pre-fill phase so if one user sends us uh a prompt with like 4K tokens uh this will increase the latency of everyone because we'll we'll spend a bunch of time working on these tokens all at once uh and it's a bit of a waste because we're not anymore in the optimal regime uh where we have low latency and optimal compute um and so one thing that could be added to these uh software is chunking of the prefill where you only process K tokens at once uh and this would allow being much more granular uh in how you allocate your resource and would allow batching uh decoding and prefilling uh much better um okay what other tricks do I have code uh so at these model sizes code is quite important uh and I think you can usually see that python code overhead is large I haven't profiled exactly VM and TGI uh but it's running python code and my experience is uh it usually has overhead at those sizes uh there are mitigations without losing too many of the benefits of python uh X forers repo has an amazing example of using cud graphs uh to achieve no overhead um Nvidia has been teasing their tensor RT llm uh which is another way of um basically tracing your inference uh and then using some pattern matching to make everything faster uh automatically for you which sounds nice uh and then you can also use uh the right custom kernels uh like Fusion to reduce memory bandwidth so you instead of like shuffling things around in memory there are things like activations that you can do while things are already loaded uh usually you find those online and just plug them in so uh just to summarize uh the things that drive through boat and latency are the fixed flops to memory bandwidth ratio of your Hardware this gives us a minimal batch size B star that avoids uh wasting flops and this only depends on the hardware not really on the model unless you're using an exotic architecture that is not the Transformer um we have limited on device memory which makes it not super easy or not completely trivial to reach uh the optimal batch size and on the two uh liaries I've checked that are open source to uh deploy models um they're still running python code which at these sizes uh incur a bunch of overheads um I also looked at faster Transformer which has no overhead but is uh much harder to deploy lots of these infos are just uh taken from this uh great blog post so feel free to go there uh and get into deeper details if you want um so now let's talk about the throughput latency plane which is um how I usually look at these metrics so in this plane we have in the x-axis the latency and in the y- axis the throughput and the direction we're interested in is going up and to the left where we have better throughput and less latency um um so if you buy better Hardware uh it'll shift your curve uh whoops I think I skipped the thing yeah for fixed Hardware um the regime at the bottom left is fixed latency which is the memory bound regime and then as your uh bat size increase uh you're getting in the linear regime of flops bound if you buy better Hardware it'll cost you more but everything will shift up and to the left which may or may not be interesting and if you get better code um or better models uh everything so the most of the impact will happen in the um low throughput regime where you will um in the low uh in the low yeah latency regime where you will incre increase throughput it will have less impact at large batch sizes because things are already easy to optimize so some examples results some disclaimers I did this uh yesterday quite quickly because provisioning uh I provision like mistol and Lama uh this was quite easy and then I ran the VM benchmarking script um I don't know if these results are absolutely the best results we can get but I think they're directionally correct uh and I copy pasted mat matplot Li plot so you might be blind after this um so this is misol versus Lama so this is just a change in gqa which gives us like 1.5x the throughput on VM with the black bar being like um the human reading speed and I'll go through this quickly I'm almost done um the this is changing the hardware so A10 versus 00 and the same model and we can see that uh even though the 00 is much more expensive it's also very fast and so it can be worth it uh to just change Hardware instead of buying more of the old Hardware so to conclude um it's really easy to serve small model uh on small instances with open source code it works quite well uh without doing anything I think uh I can get misal 7B on an A10 to serve a million request uh for like $15 a day which isn't that much um changing the Precision would just double the amount of request served uh pretty much uh the open source deployment Solutions have done an amazing job uh at being very usable I think there is a lot of work to be done on the actual uh model code part um and yeah I think I've given a bunch of tricks that uh have been or will be implemented so I guess things will just keep getting faster for everyone which is nice and that's it if you got any questions oh there are questions my man mercy buku that was awesome I really appreciate it and I think I knew it was going to be a good talk when you came out and you told us that I wrote The Abstract a month ago but I wrote the actual presentation last night so it may be a little bit different I love that there are some awesome questions coming through in the chat right now the first one is about what is the best way to decide on which the best processor to use for a certain model and I just want to tag on top of that because it's kind of in the same vein when would you use a dedicated AI accelerator system like those from somova uh I haven't tested uh dedicated AI Hardware I think I've tested uh a bunch of gpus uh and I even haven't run any of the models on my MacBook because I don't know haven't found the use for it yet but I might I think for users like if you want to just chat with the model it's much better to just run it on your MacBook uh that's just much cheaper uh the indication I gave like the lowest Bound for when it uh becomes useful to use an A10 is a million request a day uh I mean that's like $15 uh if you can afford $15 a day then go for this yes it'll be easy to deploy and it'll just work um then to know what size of Hardware to use uh my strategy since it's so easy to deploy everywhere is just to start with the cheapest and increase if I don't have the throughput or the speed I want that's awesome so the other one is uh talking about so I mean there's so many great questions that are coming through here I'm just going to go by the ones that are the most upvoted and somebody's asking why why are companies finding it strategic goal to open source llms especially I mean you guys just did it what was the rational and thought around that um I mean it's a it's a good 7B I think we found it fun and there are lot more things in the work so this lets like the community develop things on top of our work it's been great for Lama I mean the the use of it just exploded uh I mentioned the the quantization work but all of the great serving methods I think uh really uh were boosted from that because it just got super easy to deploy good models so yeah I mean I think for example Lama CPP which allows things to uh which allow people to run things on Mac was amazing um it's just a great Community to feed great model tube so awesome so to reduce overhead of python do you recommend using Mojo have you played around with that at all not at all my first experience trying to reduce overhead was cograph it was a bit painful at the time to debug but it's gotten better and uh I think the X forers example is a great uh Showcase of this it's uh I think also torch. compile might at some point be a good way of doing this uh I don't know where they are with variable sequence length and everything but yeah I'd really recommend cud graphs says the first uh that that would be my go-to right now excellent all right uh this one is a bit uh different but feel free to take take the fifth how do you think about ethics and your work at mistro in particular when it comes to selecting training data uh so we don't do any we don't talk about anything regarding training we only talk about what we release and what people can deploy uh ethics I think we like to think it's better to take on moderating systems on top of the models uh rather than try to really bake it in immediately so I am in the camp that thanks you for that and I know there are a lot of people that also thank you for that and I just wanted to mention that it's so nice to see that as an option now because up until now it feels like we haven't had that as an option right yeah it's a bit of a different pos position but I think it's reasonable I think it also comes from our position as being a bit more Enterprise focused in terms of business where in an Enterprise setting it's quite easy to convince people that they can moderate things however they want uh and if they don't expect internal user to do weird things then that's fine as well so you you mentioned the uh gpus and like the A1 100s and the h100s and that tradeoff and you showed that graph and I love the not on that exact graph but you said and then it goes up and to the right if that's important or not important to you but there is somebody that's asking what's the best way to decide on which the best processor to use for a certain model is yeah so I just try them in increasing so one thing I've mentioned is that uh in My Graph I think it was worth it to use 800 versus a bunch of A10 U cost-wise uh the problem is also often availability so I'd go in order of cost and Order of availability I try them uh it's if you try them for like 20 minutes uh it's quite cheap it's the about the time it'll take maximum to run your benchmarks uh and then you get exactly uh the cost and performance for your use case uh which I think is better all right last one for you man this is speaking to the French roots that you have I think what would be the most efficient strategy if we would want a llm to be multilingual understanding French for example today data sets are in English M mainly so fine-tuning is not as efficient with non-english data yeah I mean everything that um gives new capabilities to uh lmm has to to do with the data so step one would be to get uh data in the language that you target uh I think all llms are trained on Wikipedia which is a great base uh and that's why for example uh I mean even without much effort the model can speak a bit of French um making them more multilingual I think the trade-off is as long as you if you start making them better at French uh you'll lose a bit in other language uh maybe not necessarily no ibly um but if your goal is to Target benchmarks um then you might lose in English because you've lost like 0.1% and you gain 10% on all other Lang languages um so yeah I guess like first get some multilingual data um then train on this and if you lose 0.1% in English maybe that's not so bad uh because you'll just gain so much in other languages dude awesome and thank you so much for coming on here and doing this this is really cool I love how it coincided I know we talked about do getting you on here like what two months ago and it coincided with the release of the model and so I'm very happy to see what you all are doing I'm excited that you dropped a few little breadcrumbs saying that there's more to come I would expect nothing less from you to be honest and I appreciate you giving this talk [Music] thanks

Info

Channel: MLOps.community

Views: 6,186

Rating: undefined out of 5

Keywords: Mistral AI, LLM inference

Id: mYRqvB1_gRk

Channel Id: undefined

Length: 30min 25sec (1825 seconds)

Published: Wed Oct 25 2023