Getting Started with TensorRT-LLM

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

okay hi welcome to this Jupiter notebook tutorial on tensor RT llm So today we're going to walk through um the installation process and setting up tensor RT llm for your Windows computer or Linux computer you can also run this notebook on um Google collab and yeah we're going to talk about the newly released um kind of newly released SDK boosting inference speed for specifically large language models so So today we're just going to look at this duper notebook which essentially just runs through the installation steps that the official documentation has and we're going to optimize the bloom 560 million parameter model we're going to look at different techniques so how this toolkit basically does the inference speed up and we're also going to look at performance execution time and other metrics so if you have a Unix computer you can just install Jupiter and run a notebook from there if you don't don't have any Unix machine or Windows laptop you can also use Google collab upload the notebook to collab and run it there I happen to have a Windows machine so I have WSL so what I do on my Windows machine is basically just run open a new terminal run Bash from there you can either just run jupyter lab or jupyter notebook directly or you can go to the official repo of the SDK and just run this Docker command which will essentially spawn a ubun container with all the packages that you need and from there we need to install other dependencies okay so the first cell essentially just installs the Nvidia container toolkit we use the app way to do it I think it's the easiest and most people working on Theo familiar with it so after we installed the Nvidia container toolkit we can clone the tensor RT repo I already did all of that so I don't have to run the sales but feel free to run them alongside this tutorial and once you got the repo we want to also install the other dependencies we're essentially just running the commands that's um that are here in the read me so nothing special okay once we prepared our environment we can download the Bloom model so Bloom is essentially just another llm that's really off the shelf um this one has 560 million parameters but it's as you can imagine just an LM that is trained to do General task you can start by installing the requirements through pip so we can define a path here and delete everything that's possibly in the subdirectory and just clone the model just download it's going to take a couple minutes once that's done we can start converting the model so the blue model we downloaded is in the hugging phase format and we can just use the python script that Nvidia already provided so that's the convert checkpoint dopy script it wants the model path which we defined earlier and um an output path right here and we're going to call it FP 16 and 1 GPU you only on Google callab you only have access to 1 GPU I also only have my RTX 480 but there's the possibility to use you know a cluster and multiple gpus and even multiple nodes um another interesting argument here is this one the data type so as you may know weights in neural networks and other values such as activations have different precisions so here we're basically establishing a default Precision for our engine and that's going to be float 16 now the interesting thing about Precision is that depending on your Hardware architecture certain precisions you know might benefit from being quantized down often that's the case that if you quantize it down it actually works faster on GPU hardware and uh we're going to see later how that pans out in terms of inference speed but the takeaway is that you can quantize the Precision to for example float 16 or int 8 and that's going to boost inference speed so we're going to run that okay so you can see here it basically loads the model and from here it just converts it from the hug and phas format to the tensor rtlm format and loading the weights took 5 seconds and the total time of converting that took 26 seconds you can also see there's a rank here so I would assume they there can be multiple nodes and multiple workers okay next thing we want to do is using this command TR trt LM build so this command wants the path of the checkpoint we just computed so you can see this path and this path is the same we also tell it to use a plug-in called I think gem yeah it stands for General Matrix multiplications but we can see why this is important here and you need to the plugin also asks for the Precision so this needs to be the same as the one we have here um yeah we we can I guess we all all know why Matrix mplications are important in terms of inference speed um they are the bre and butter of you know AI computation okay and last but not least the output Direction which is going to be the same as the input going to run that so keep in mind what's basically Happening Here is that tensor RT LM is optimizing this specific model for your specific Hardware so if you were to run this model on different Hardware obviously you have to compile it again or you have to run this command again same goes for if you change the model if you change the parameters then you have to run it again but you can actually serialize the model so you can serialize this and load it up later okay so we can even see it here engine serialized that took 7 seconds and the total time of building um took 1 minute and 14 seconds once again you can do it once for your specific setup for your specific GPU and it can load it in different scenarios or python environments okay um so we did that and the third version of the model we want to compile is basically the same as the one we just did before a tensor RT llm optimized version but this time we want to use int a quantization basically we want to scale down the Precision of only the weights to in eight so you can see in this command it's the same command that we ran before convert checkpoint model path we still have dtype 16 and we're using weight only and if we actually look into the conversion code we can see that it quantizes the weights for our various gems to in four or end eight so this is basically chosen automatically this has nothing to do with what we choose here what's the difference here so if you look at so if you look at the different variables that we can quantize in this diagram essentially we just quantize the weight so everything here to the left so the vector or factors that we multiply the input by by but there are also other variables inside our neuron Network and for example the activation function and that one for example can stay in float 16 whereas the weights are going to be quantized down to in4 and8 okay so we basically do the same things we had before but in one single step so this is basically just a trt llm optimized model but with in8 quantization for the weights only okay so while this is loading we can also take a look at the release of this SDK and this is basically the Crux of I I guess one essential part of what makes or what what helps tensor RT llm make these models faster and one thing is Flash attention and Flash attention basically um very simplified leverages knowledge about your Hardware to do the computations that are relevant to um the attention layer of the llm and the other part is mass multi-ad attention so this also plays a little bit into parallelism Sol I'll put two articles or YouTube videos that explain both of these mechanisms I think they're very interesting and we have other um improvements here so the highlights multi- GPU multi node inference um probably a big note is also inlight batching usually in in Neal Network we wait for one input to come in the input goes through the entire Pipeline and then we have an output and then the next input is starting to process but for inflight batching you can have a batch of inputs and you basically input the entire badge and before possibly even the first item is fully finished other items can already enter the pipeline and start the processing but yeah our cell has finished loading and we can see here serialization took 5 Seconds total time took 1 minute and 1 second around the same ballpark as the non-quantized version okay cool now we have our three models our three candidates and we want to look into benchmarking and Nvidia also has an excellent script here that we can use we basically run these three models through a summarization task and for summarizations you can use a metric called roou which stands for recall oriented understudy for gting gisting evaluation it it gives you a score that gives you a hint about you know the quality of the summary I know it's not super in-depth this is not going to be a you know scientifically accurate evaluation it's not going to be a Benchmark it is just to for us to get some results okay so yeah we basically run these three models to the summaries and we're going to use the script that Nvidia already provided and we do not have the ability to load them directly into python variables so what I'm doing here is I'm using the magic command in Jupiter which basically captures the standard output and then from the standard output we can parse the variables to plot them as you can see here I'm using the Linux time command so that is also not going to be you know super accurate but we're just going to use the real time it took and it should give us some intuition on the speed up okay so just to give you an understanding on what the output would actually look like if we did not capture it so this is it is that so you know um loading the engine took around 5 Seconds you can see here there's the input try to summarize it and we have all the scores here in text format and our output from the time command and yeah those are the scores and we want to parse them into python to be able to plot them okay cool so we did that actually didn't take too long but yeah what we're doing here this is just a helper function to parse this entire text block and we're going to do that Define it run the Parson command and save execution time uh the rou scores and for the trt optimized models we also get latency total tokens and tokens per second okay very excited so our first graph and look at this so we can already see the time comparison we have the hugging face model the default model and the quantise model and needless to say amazing so trt cuts the inference time and half at least for this scenario and setup and the in a quantization actually has a very small small bump uh in decrease of execution time so very interesting then we have the rou one score which uh yeah it's going to be better if it's higher and now this is the interesting part right we have the hugging phase at 15 trt is even better than that and the in a model that uses less Precision is even better than that and um not to go too deep into the theory of how it works and um I'm also no expert in information Theory and information loss but this is also a very fascinating concept that sometimes when you even quanti Superion down you can get better results yeah this is this absolutely amazing so um to go a bit on in depth about rou so we have rou one so you can think of rou one kind of a do the words in my reference summary also appear in the output summary of the model if you have Rouge 2 it's a bit more complicated because it means do the two grams so word sequences of length two in the summary appear also in my reference and that goes for 3 four five or n gs and for Rouge Elon that's kind of like normalized across the grams and sentences I hope I explained that correctly okay cool last but not least this should be different um latency and tokens per second yeah and for our trt models we have the latency a little bit more 10 seconds for our default model and a bit faster than that um our in 8 model and also tokens per second nons surprisingly it's faster for the quantitized model and I know that tokens per second is usually the um metric that most medium articles most research papers um kind of optimize towards and that is very interesting so yeah and that's basically it from here you could um experiment around you could start with just downloading different models from hugging face as far as I'm concerned you can use any hugging face model you can start experimenting with uh quantization again and there are also two examples that are in the examples read me from Nvidia you can take a look at so this Jupiter notebook was basically completely built off of this page yeah there are two interesting techniques that you could try out if you're curious one is in a um key value cashing and the other is smooth Quant so yeah if you're interested try it out plot it on different metrics maybe run it through different benchmarks experiment a lot plot a lot of graphs and share it with us so thank you so much for watching I hope you enjoyed this tutorial and are ready to go to get started on tensor RT llm

Info

Channel: Long's Short-Term Memory

Views: 957

Rating: undefined out of 5

Keywords:

Id: TwWqPnuNHV8

Channel Id: undefined

Length: 14min 20sec (860 seconds)

Published: Thu Mar 07 2024