Run a LLM on your WINDOWS PC | Convert Hugging face model to GGUF | Quantization | GGUF

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hi guys in this video we are going to learn how we can convert a hugging face model into ggf we will upload this model on hugging phas and then we will get it back to our PC and we'll try to inflence it using a LM studio now let's understand why we are even converting a HF model into a ggf model so HF model is comparatively bigger model and it needs GPU to get a inference out of it okay when we convert it into to ggf it gets smaller but it can be run on CPU and that's why the ggf models are getting popular because you can run it on your laptop it could be Mac it could be windows or it could be any laptop or device where you have only GPU so that is one of the reasons why ggf models are getting popular now let's understand why it is getting smaller and how we are doing it for example if we we create a deep Learning Network right so there will be nodes hidden layers right and all these are connected to each other I'll quickly create it so that I will explain you what we what I'm trying to say so all these values have weights associated with it right so all these weights what happens is will be of 32bit integer okay so a way to make this 32-bit model smaller is that you can reduce it into 16 bit so you basically what we are doing is we are compromising on the bits we are reducing the bits okay we can convert it into 16 bits 8 bit 4 bit 3 bit 2 bit that's why the size is also getting reduced that's why the accuracy is also getting reduced because we are losing the information okay we are losing the Precision of it and because the complexity is also getting reduced what is happening is that it is getting faster okay so it is also getting faster right so that's why the ggf models are getting popular now I'll show you one more example just to make things clear so think of it like your HF model or your hugging face model is of think of it as a book okay there's a book which has 50k words now someone has got inspired from your book and he has written a book which is of 10K right someone got inspired from this book and then has written a book which is of 5K and it is going on right it is going on and at last someone has written a book it's not even a book because it only has 200 words right now think of it like how much information has go got lost in this whole process right the book was written on with 50k words and now it has reduced to 200 words might be the facts are same might be the incidents or whatever things which has been explained in 50k word is still there but it has got reduced a whole lot of information is got reduced so same case gets apply in the quantization model where you are basically reducing or compressing a model but the facts will be same but you are compromising on the size accuracy and complexity I hope you guys have understood it let's move on to the technical part where we are going to uh you know quantize it using uh Google collab and we are going to um upload it on hugging face and then we will download it back to our PC we'll we'll basically chat with that model okay okay so so we are in Google collab and as you can see we have a code for converting the hugging phas model to ggf which is also called quantization I'll show you guys the runtime so this notebook is on Python 3 CPU we don't need GPU at least for this um process because we are using tiny llama small model to convert it into the GF the model card of tiny Lama 1.1 chat 0.3 we can see the files and versions as you can see model. safe tensors model. bin has both of them are 4.4 GBS let's see once we quantize it right I mean um once we get the guf how much uh big it is we can we can basically compare it after execution of uh the whole notebook one thing I just wanted to mention because it might take around 15 to 20 minutes and uh so I'll basically fast forward the whole video I'll keep on running each and every Cs and I will try to explain um wherever it is needed yeah so the whole video will be in a fast forward mode okay we also going to clone llama CB um this is a package we are going to use to uh quantize the model and as you can see we have llama do CPP we'll get into CPP and save the model in the models path so we can see this this model folder and we are going to um save our tiny Lama model here in this folder then we are going to download all the required tokenizer files yeah so you can see all the tokenizer files are also uh get have been downloaded we are going to download or install few of the essential packages then we are going to also use Lama open blast equals to 1 we'll install all the required packages and then we'll convert the model as you can see I don't have any model in the model repo so I'll create a new model for this dgf so I'll take the name of the mod model from The Notebook I'll copy this and paste it [Music] here I'll append the ggf with the name of the model and now I'll create a model card so I have the model card ready I'll just copy paste the whole thing here and I'll yeah so quantize by a we'll see more files here once we upload our model you can see the model is um created successfully let's go ahead and upload the whole model these are you can see the model has been converted successfully and then we'll quickly log to hugging face so that we can push the model to the hugging face Hub okay it is asking for loging we'll create access tokens I'll create a new access token for this upload to and here we'll upload the whole model and here we'll upload um all the files of the model okay so that's it we can see um few of the model files have been already uploaded now uh the GF model file is getting uploaded yeah so all the model files have been uploaded successfully okay so next thing is we will download this whole model from this hugging phase and then we'll try to get the inference out of it okay so we are going to use LM Studio software here we are going to search for the same model which we have quantized with ggf the moment we'll hit the search button we'll have all the ggf models available here we'll go to the last section and as we can see um we got the model which we have quantised and you can see I have already downloaded it okay so let's quickly chat with the model and see how is it replying the intention is how to quantize a model and publish it on hugging face and then download it and you know um use the model here because the model is quite small we can't expect a very good answer um you know from the model if you use any other model which is bigger enough and capable enough it will definitely give you very good results right I hope you have liked the video thank you bye
Info
Channel: Ayaansh Roy
Views: 1,017
Rating: undefined out of 5
Keywords:
Id: e3iimDDaaEY
Channel Id: undefined
Length: 13min 20sec (800 seconds)
Published: Sat Jan 27 2024
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.