How to Quantize an LLM with GGUF or AWQ

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
I'm going to show you how to quantize large language models using ggf or awq if you're not familiar with quantization it's what allows us to fit large language models onto smaller devices or smaller gpus I'll take you through the various methods available and the state-of-the-art ones that give you top performance then I'll go right through an example for awq and another one for ggf which is used for laptops like a Mac M1 or even for Windows let's start things off with a little presentation on how to quantize an llm with ggf RQ so first why quantize well most of you watching probably know but quantization allows you to fit a model in a smaller GPU for example loading Lama 70b would normally take two or maybe three a100 uh gpus with 80 GB of vram but you can run Lama 70b on an a6000 48 GB of vram if you Quan ice or you could run on one a100 100 also you can run a quantization Lama 7B on a Mac M1 in fact on a very powerful Mac you can probably even run llama 70b on the Mac M2 if you use ggf whereas normally even running 7B would take maybe an a 140 gigabytes so you can run on much smaller devices which obviously will save money and also uh makes it just more accessible for smaller companies and individuals who are running testing what is quantization really brief there's another video you can check out on awq where I describ more and an earlier video again you can check out uh from Tris research but the basic thing is that models are trained with 32 bits representing each number or 16 bits and that's a lot of bits and that full Precision may not be needed after training so quantization simplifies down the model it basically approximates 32bit or 16bit numbers by a 4-bit representation and that obviously saves a lot of storage space so you can fit the model onto your GPU with a lot less total bytes typically the models are thought of in 16-bit format so if you quantize to four bits you can think of it as a 4X reduction roughly maybe you only get 3x but something like that in terms of reduction of the size of vram that you need so which quantization should you use well I would um there's probably a lot more Nuance but I'm trying to simplify this down for the video and I'd say for a laptop you want to use ggf if you're using a Mac it runs really well with ggf which is formally called gml and that's supported by the Lama CPP repository if you're running in any other way then I would recommend using awq which is activation aware quantization it's um one of the latest quantizations and it tends to have advantages over gptq which I would have recommended a month ago cu that was probably the best option at the time now lastly for fine-tuning I probably recommend using bits and bytes and using the nf4 data type that's typically what I use for quantization I would love to move to awq for doing a fine Ching fine-tuning but it doesn't have Laura support available yet which I'll talk about in the next slide so overall I'd say if you're running on a laptop use ggf and if you're running on GPU use awq now I'm going to give you an overview of these four types of quantization uh the two we'll be focused on today are awq and ggf those are the ones that I'll go through in detail but it's good to understand how they all compare the first thing to think about is data dependence which is the question of whether you need a data set in order for quantization if you do use a data set to quantize it helps you to identify which activations or which weights in The Matrix are important and which ones to prioritize for quality however if you do need a data set then your results are only going to be predictably good for similar data sets so if you quantize with data set a you may have poor performance with data set B so there are pros and cons using a data set awq and gptq both rely on data sets ggf and bits and bytes do not now gptq is kind of extra dependent on the data set because not only um does the data set influence which waits to protect as in awq but gptq further calculates the loss associated with quantization and reweights and so that reweighting which is kind of a second order effect is also data dependent for gptq which probably makes gptq even less robust if you use a data set that's not like what was used for quantization the second topic which is related to data dependence is on the Fly can you just quanti on the fly in other words take a full Precision model and only quantize it when you're doing inference and with bits and bytes nf4 this is possible there are packages like text generation inference if you look at any of my videos on API or server setup I use text generation inference and it will take a full model and it will on the fly when you load it quantise so you don't need to actively quantize the model in advance which is very nice of course this is harder with awq and GPT Q because they are data set dependent which means that you would have to choose a data set go through the quantization um so it's harder to do it on the fly with ggf it's possible in principle to quantize on the Fly and uh take a model quantize it and load it into your computer but I think it's less common it's not that it's possible but I haven't seen it done too often it's common to first quantize the ggf model and download it and then run inference um next up here we have speed and speed is I would say probably fast on ggf and awq and slower on bits and bytes and gptq gptq is not really all that much slower than awq but if you want to get good quality you need to use a quantization of gptq that has ACT order turned on act order means activation order and it means doing the quantization according to the order of the size of the activities and what that does is it protects the more important activ activ activations rather not activities and if you don't run with act order um turned on when you're quantizing with gptq you will uh hamper the results in quite a lot of cases so basically if you run gptq with good quality you are going to see a significant slowdown because when you have that act order turned on it requires ordering in the list during inference and that tends to slow it down the next thing is is low fine tuning can you low a fine tune in quantized form and the most straightforward to to do fine tuning in it's bits and bites it's possible to do Laura it's also possible with gptq it is possible with ggf there's um there's now the possibility although I'd say it's probably less straightforward because it's less common so there are less scripts available that are doing that awq does not yet have low fine tuning it would be really nice to have that though the next question is what you can merge adapters when you do a Lura fine tune you has have an adapter and you have the Frozen base model and it would be nice to combine both of those it's convenient because people then only have to download one model but also it saves you an addition operation on the GPU so it's nice if you can merge unfortunately it's not very straightforward to merge in any approach really I'm not aware in ggf there may actually be a way to merge here I need to read more through the Lowa fine tuning part of the repo in bits and bites you can merge to the unquantized base model which is not bad it affects the accuracy a little bit because ideally you would merge onto a de quantized base model now that's kind of getting technical in awq and gptq um I'm not aware of ways of merging the models unfortunately next off we have whether you can save the model in quantized form and I think this has been the big benefit of ggf and also gptq and awq it's easy to save the model and then put it onto hugging face in a format that um can easily be uploaded and downloaded this isn't possible with bits and btes at the moment um there have been pull requests open for quite a while but what that means is if you want to save your model to hugging face with bits and bytes quantization you have to merge it back to a base model and then push the full base model and then allow people to run inference on the fly on that model so overall the reason why I'm going to focus here on ggf and awq is that g guuf is built in a C library it's um a fairly fundamental library that doesn't have a lot of abstraction to it so it's quite efficient in running particularly on Macs and it's used very widely for doing inference on on laptops my choice of awq over gptq is because awq is faster when you have actor order turned on on gptq so basically for um similar performance results or sometimes better in some cases if you're working on a data set that's not the same as the quantization data set um you'll get slightly better performance on awq and generally better speed than gptq uh bits for bits and bytes and F4 you can take a look at any of my fine-tuning videos the reason I'm not going to show that quantization is because it's possible to do it on the fly so there's no need to prepare these models in advance in quantized format I'll now show you how we do the quantization uh for awq I'm going to use runp pod because I want to have a GPU that's going to be compatible with awq which relies on the more modern gpus I.E not a T4 like a free collab environment so I log into runp pod and I'm going to go to secure Cloud select an a6000 and it's important when I deploy that make sure to give yourself enough space this is way too much space for running the model we're going to run cuz it's going to be small but you don't want to have insufficient space and then you're running the notebook and it's not big enough in size and then you have to go back and resize it the model we're going to run for awq is Tiny llama um in fact the model we're going to run is Tiny llama it's a small version of lamama it's being trained this is the 1 trillion checkpoint it's going to be trained for three uh trillion tokens you should check out the video on Tiny lamb if you want to learn more but this model here is a publicly available model it's not gated and it's what we're going to train today runp pod is just about loaded here and so once it started up I'm going to connect and going to the Jupiter notebook the Pod is now just loaded and we will have the option to connect to jupyter lab and once I'm connected I'm going to upload the notebook that I have for quantizing with awq I've just uploaded the awq notebook and we're ready to get started the first thing I'll do is install auto awq after that I'll be installing Transformers it's being installed from source which means it's the develop vment version that's actually required right now it may have changed if you're watching this video quite some time after release you could just do pip install Transformers but right now we need the latest development version for that to work correctly after that we're going to log into hugging face this is necessary because we're going to push a model up to huging face Hub in quantized form uh so I'm going to want to run that cell and log in which means I'll have to put in my credentials in fact that's um something that you might decide you want to do um before running the installation but either way it's fine after that we're going to install we're going to import and load the model that's of Interest which is the tiny Lama model you can see the model name is there the model and the tokenizer will be loaded and then quantization will start this is auto awq that's looking after the quantization note that I have chosen safe tensor equals is true I highly recommend using the safe tensor format it's much quicker to download so you'll see that your downloads of the model will be a lot quicker um your uploads are probably quicker too but it's better than using uh pytorch the basic model type also better for security because safe tensors doesn't allow people to inject code so it's you can download safe tensor models knowing that they're less vulnerable okay so up here we have completed the install and now I'm going to put in one of my hugging face tokens I've authenticated and that's done next up we'll start to load the model it's probably going to be quite quick because it's only a 1 billion model size yeah you can see um the model is 4 GB in size in the 16 in float 16 format after quantization it's going to be less than 1 GB in size once that's done uh well well once the model is loaded we still have to go through quantization we'll see a progress bar for that momentarily and when that's complete we will the opportunity to upload files to hugging face now for this to work we need to have first have created a repository on hugging face um I'm going to create a repository called uh tiny Lama in trellis I've actually already created this so I can just show you right here my standard is to name it with Dash awq at the end all right so you can see that data is being downloaded awq requires data to quantize the data is used to decide which activations are relevant and then the quantization protects those activations in the matrices um of course activations are a product of the Matrix itself in the language model times the input so that's why you need some data to provide input so the uh activations can be calculated here you you can see that quantization is underway it's currently at 5% I should increase the size of my screen here so quantization is at 5% it's moving forward in 22 steps that's because there's 22 layers in the tiny llama model I believe and once that's done we'll be ready to push the model to hugging face and then to upload the configuration and tokenizer files and after all of that is done we're going to try and run the model in ized form so for that I'm going to copy in the repo that I've just created and once that's done we'll be able to run that inference and see what the output is to the question of what planets are in our solar system let's take a look up here to see how the quantization is going we're at 27% so one thing I can show you in the meantime is uh as a good practice I suppose it's worthwhile creating a model card so here is the base model and I'm going to just edit the model card of base model of course I don't have permissions to do this uh like to commit the edits but I can just copy them I'll create a model card and I'll paste in the model card here now above this I'm going to say awq version of tiny Lama at one trillion tokens and I'll just say original Mod card follows low and um that should be pretty much it I can add some tags like this awq maybe tiny llama and then I'll be able to just do one cards commit well obviously it's a commit so I'll just say card okay so we now have a model card and once we have pushed all the files we're going to see the files appearing here under files first we have to allow the quantization to complete which it is nearly done it's just going to take a second here quantization is now complete and the model files are being pushed again it's a fairly small file and that's why it's not going to take long to push up to HUB it's just about 766 megabytes in size so nicely under a gigabyte so the model has been pushed and now we will push all the other files which are almost instantaneously pushed if I go back to the repo here I should see all the files appearing which they have and the model is in safe tensor format which is what we want now we're running in reverse and we're actually downloading the files and then we're going to try and run an in inference all of the files have been downloaded it was very quick because we're downloading safe tensors and the layers are being replaced with the quantized versions and next we are going to run a prompt and here is the prompt it's actually really good this is just the 1 billion model and I've asked what are the planets in the solar system and it's getting the eight planets which is really nice and um I wouldn't wor about this last part it's because I probably haven't correctly set the tokenizer um to recognize that this is the n token that is being used um yeah but yeah this is this is pretty amazing in terms of quality so that's a quick overview of awq folks next up we're going to quantize using ggf which will provide models for laptops like Macs it also works well on gpus although it's probably not quite as optimized as something like awq in terms of performance so I'm going to upload the script for ggf quantization we're right here in the script and ggf used to be called gml so I'm going to update that right here and now we'll go ahead with our ggf quantization we'll do it on the exact same model the 1B model and we will again log into hugging face so that we can push to HUB so I'll put in my token here and there we go now if you can also run this in Google collab you can actually run it in a free collab notebook if you're quantizing a 7B model and you can use a pro notebook with high RAM for 13B and then for a larger model it's hard even with the a100 from collab it's only 40 gabt of vram so you probably need to come over to runp pod or some similar service um so we're not going to use Google Drive and we're not going to set the cache directory for Google Drive we just set the cache directory to the root directory which is the storage on runp pod this uh code here is sometimes needed for making sure commands run in Google collab but I won't need that right now so next up I'm going to install some of the libraries needed to uh run ggf now ggf the quantization is done by Lama CPP so effectively we're going to install Lama CPP we're going to install some of the tokenizers as well and then we're going to be able to start quantization so here I'm just importing everything that I need and next up I'm just going to set the model name that we want which is um going to be this model here here's the model that we want so so I'll copy that and put it in as the model name and we're going to load it um you can load it to CPU or GPU it it shouldn't really matter too much I'll just set that to Auto and we can start downloading the model and I actually made a mistake there I need to set the model name before I run it so let's try that again all right so now it's downloading the model 4 gabt as we expected once the model is downloaded then we're going to clone Lama CPP which is what does the quantization and when that's cloned we will CD into that directory and save the model into the directory uh so that should all be done now Lama CC rer CPP is cloned and you can see there's a models folder and it has the P torch model within it that has just been downloaded now there's one more thing we need to do which is to get the tokenizer uh information that's required for the quantization and so I've just run that cell and now all of the tokenizer information that's required has been copied in here so we can move on now to compiling Lama CPP and then we're going to make using open blast which is a linear algebra Library so we'll go on to make that and once we're done don't really need to list these files we need to install any other requirements for Lama CPP and then finally we can start to move to the quantization the first step is converting to float uh 16 to a 16bit format and once that's done we're going to then convert the float 16 model into a 4bit model which is uh the Q 4ore K here so we'll be ready to just run that cell we've already got tokenizer installed so that's fine and finally once all of that's done we're going to push the model to hugging face and that can be done just by running uh first a repo creation so I'll go back up towards the top of the script I'm going to copy this name here and go over to hoing face then I'll create a new model I'll create it under trus and put on ggf at the end and then I'll create that model I will uh just duplicate this here and create a model card I'm going to again take the same model card actually you know I can kind of take a shortcut here here by going to trus and getting the model card we've already created here which is awq model card paste that in of course it's not awq it's now ggf so we can just that in like this and commit the model card like this here and now that that's done hopefully we're getting close to quantization this is all the installation script here for llama CPP you can see here now we're currently converting it into bf16 which is brain float it's a 16bit format and once all of that is done we'll be able to quantize to 4 bit quantization so let's see if that's really an eror message or its successful quantization so looks like we have completed quantization there and it looks like we've completed the upload as well of all of these files so let's take a look and with that we have uploaded all of the ggf files in fact a lot of these don't really need to be uploaded although it's handy to have them there you can see here we have a 668 megabyte sized file for the ggf quantization if you want to run the ggf file you can take a look at the tiny llama video it includes a notebook and run through of how to run ggf files on your laptop if you want access to any of the scripts the awq or ggf quantization script from today you can purchase a copy through a link below it's also included in the advanced fine-tuning repo for which you can purchase access and also Avail of supervised fine-tuning embedding and data set preparation scripts of course you can get some scripts for quantization in more raw form on the repos for Lama CPP and also awq and I'll be putting links to those below as well let me know any questions overall and quantization in the comments cheers folks
Info
Channel: Trelis Research
Views: 6,537
Rating: undefined out of 5
Keywords: quantization, awq, gptq, how does awq work, how does gptq work, how does quantization work, llama 2 awq, llm awq, quantize with awq, llama 2, llama quantization, llama 2 quantized, llama 2 quantized model, deep learning andrew ng, how to quantize llama, how to quantize, quantize llama, quantize awq, quantize gguf, quantize ggml, how to quantize a language model, how to quantize an llm, quantize llama 2 model, quantize llama 2, quantize mistral, quantize falcon
Id: XM8pllpBVA0
Channel Id: undefined
Length: 26min 20sec (1580 seconds)
Published: Tue Oct 03 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.