GGUF quantization of LLMs with llama cpp

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

would you like to run llms on your laptop and Tiny devices like mobile phones and watches if so you will need to quantize your llms now ll. CPP is an open-source Library written in C and C++ it allows us to quantize a given model and run llms without gpus now in this video I'm going to demonstrate how to quantize a fine-tuned llm on a MacBook and run inference on the same MacBook I'm going to quantize the fine-tune Gemma 2 billion parameter model that we fine-tuned in one of my previous videos but you can use the same steps for quantizing any of the llms of your choice so without further Ado let's get started so in my previous video on fine-tuning WE fine-tuned a Gemma 2 billion parameter model so I'm going to push that model to hugging face Hub we can do it by doing trina. model. push to Hub and I'm going to give a name that is Gemma Tob ftv1 so that model once we execute the push should be available on the hugging face Hub I'm not going to execute this command because I've already pushed that and that model is available on the hugging face Hub only the parameters of the Laura are actually pushed to the hugging face Hub and not the entire model if you look at the files that are available in the hub after I pushed it you could see only the adapter configuration is there and on the parameters of the adapter module are only updated and they are only just 14 MB so now that we have pushed the model to the help now we can move on to quantizing the model so for quantizing the model we are going to be using the Llama CPP Library as we can see the main goal of llama CPP is to enable llm inference with minimal setup on state-of-the-art performance on a wide variety of Hardware including locally and in the cloud it's written in C and C++ without using any python though there's a little bit of python I'll go through it when we actually use Python the beauty of it is that it allows us to do a variety of quantization including 1.5 bit 2bit 4bit 8 bit 6bit uh integer quantizations and it also allows us to use whatever the hardware we have for example if you have CPU with GPU then you could leverage a hybrid inference using both the Hardwares and if you're also a user who switches between platforms or you have different platforms you're up for a treat because it is supported across all these platforms which is Mac Linux Windows uh Docker and free bstd so let's start installing the Llama CBP locally to get started with the quantization so hats off to this guy Geor G Gano who's a creator of llama CPP you could see the number of followers on GitHub which is 12,000 and on top of llama CPP he has also created gml and if you visit his side you could see that he's created quite a few projects in C++ including whisper CPP llama CPP and there are quite a few projects there you could go through and find out if something is useful for you out of these so let's start with cloning the repository after cloning the repository and CD into the Llama CPP I'm just going to run make to install it so that completes the installation of Lama CPP it's that simple guys you just have to run the make Command and then you're done so before proceeding with the quantization itself I would like to quickly walk through the structure of my project so we've got the generative AI C which is the main repository uh for what we are doing I have a folder for a base model so if you want to check out base model from hugging face Hub we could check it out there then I've got the Llama CPP inside the generative AI Co that's the one that we actually ran the make Command and installed I also have a folder for PF model the PF model is the fine-tune model that we fine-tuned in my previous tutorial and I've got a folder for merge model the merge model is created by merging this PF model along with the base model so it's basically how Laura works so we have the base model and then we have the adapter weights of the fine tune model we have the merge the two and then that's that creates the the merge model and lastly I have the folder for the quantized model of course has the uh ggf files and then we have this notebook of course which is to quantize the model so let's get started with this notebook let's start with the preliminaries which is to access the hugging face ecosystem so we need to log in uh with our access token so I've already done that then we need to import the snapshot download in order to download the entire repository so the basic difference between hugging face hub download and snapshot download is that in case case of hugging face download the main function is to download specific files from The Hub so if you want any particular file from within a repository then you will have to use the hugging face hub download but we want the entire repository to be downloaded which is the fine-tune model repository so I'm going to be using the snapshot download so in case of snapshot download downloads an entire repository at a given version so that's what we exactly want I'm going to download the uh Gemma 2 billion fine-tuned V1 that we fine-tuned and pushed the hugging face Hub repository in my previous tutorial I'm going to give the model ID as the same the repository that we created for that the local directory is the F model directory which I just uh showed you so if we just do the snapshot download it's going to download all the files from the repository and I'm and it's going to pull all those files to our local directory so once we have done that we're going to try straight away converting the the Lowa weights to the ggf format and find out what happens really so this shouldn't work and that's what I tried it's a greedy way but uh I just tried to convert the uh the model to ggf format and I provided the P model and asked it to Output the Gemma fp16 ggf file but it turns out it couldn't find out the config.js and that's the main reason we'll have to merge the base model along with the PF model which we going to look into now so the steps that we need to follow is to load the base Gemma 2 model and then we need to load the fine tune model using the PF Library this is important we shouldn't be loading using the Transformers library but we should be loading using the PF library and then we need to use the merge command to merge the base model along with the PF model and finally we'll have to save the merge model into a local directory which I've called merge model so we we are using the auto model for coal LM uh from finetune function from that and we're going to pull the Gemma 2B model of course for the base model and we're going to use the the same other parameters that we set while we actually fine-tuning so once we got the base model then the next step is to load the fine tune model using the uh PF model from pre-trained function so once we pass the base model and the repository of the uh fine tune model we're going to get the PF model and we're going to use the merge and unload function in order to merge both the weights so once we have merged that we simply uh save the uh the model into the merge model directory because the model has to work with the tokenizer we also need to save the tokenizer so we can save the same tokenizer that we use for the base model so we need to load the tokenizer of the Gemma 2B which is the base model and we need to actually save the tokenizer in the same directory which is the merge model directory so once we have the merge model we need to invoke the convert HF to ggf python file under Lama CPP we need to pass the directory where where where the merge model is stored and the output type is of course uh floating Point 16 and the output file we're going to name it as fp16 ft for the fine tune model and the extension of course is ggf so once we run that then the quantisation starts and actually this is first to convert the given weights into floating 16 so once that conversion happens then we actually need to quantize to different sizes like 8 bit or 4 bit or even 2 bit or whatever once we have the fp1 16. GF file we can play around with it to quantize to different sizes we need to use the Lama CPP do/ quantise command and we need to pass as input the fp16 Dogg UF file that we just created and we need to provide an output file name so I've given the output file name as ft Q4 km Q stands for quantization and 4 stands for a 4bit quantization and K for stands for K bit and m stands for medium so I will walk you through the different uh options here so for now let's just do a 4-bit quantization using K means and I'm just going to pass the command Q4 k m in order to tell it to actually do a four bit quantization we can see that the quantization has started and we eventually have the quantise file we now should be able to query this file and we should be able to get a response now that we have the quantise model we should be able to query it with some prompt and then get a response so in order to do that we need to invoke the main and we need to pass the model which is the quantise model in this case it's ft Q8 km GG UF and the number of output tokens that we would wanted to produce and finally The Prompt which is uh can be provided using a hyphen p and as we have been asking the model what should I do on a trip to Europe let's ask this same question and find out what the model comes up with so it says that there are so many options for a trip to Europe that it can be overwhelming if you have a theme in mind you should definitely do do some research and it goes on and on and you can see the speed of execution just on a CPU on a MacBook laptop it's not even executing on a GPU or it's not even executing on a desktop so one of the mistakes I did when I was actually uh quantizing this model was directly query this fp16 uh fine tune model and because it's uh quite heavy and quite big so I ended up crashing the system so please do not do anything of that sort make sure on your laptop you're actually querying the uh 8bit or the 4bit model based on the memory that you've got on your laptop so that brings us to the end of the quantization using llama CPP there are quite a few stuff that we can do with a quantise model for example we can Benchmark to find out how well it compares to a different model that was quantized in a different way for example we can compare a 8 bit quantise model with a 4bit quanti model both for inference and also for uh speed we can integrate the quantise llm with a rack system so that the model is much more context aware and performs way better so in the upcoming videos we will have a look at those examples but for now I'm signing off I hope that video was useful thank you very much and I will see you in my next take [Music] care [Music]

Info

Channel: AI Bites

Views: 1,503

Rating: undefined out of 5

Keywords: machinelearning, deeplearning, transformers, artificial intelligence, AI, deep learning, machine learning, educational, how to learn AI

Id: j7ahltwlFH0

Channel Id: undefined

Length: 12min 9sec (729 seconds)

Published: Fri Mar 22 2024