🔥🚀 Inferencing on Mistral 7B LLM with 4-bit quantization 🚀 - In FREE Google Colab

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

hey everyone how are you in this video I'm going to do inferencing only for mistal 7B instruct model and as you probably know that this model is all in the news recently for its super powerful results given its small 7 billion size and in this video I'm going to do the inferencing in Google collabs free version so the GPU consumed will be something like 5.5 GB and the ram will be also around 5gb so it's a super small uh computer resource consumed for this size of the model so first I need the sharded version of this model let's see why I need the sharded so if I go to hugging face U and you just type mral 7B instruct in the search box you will see so many mral 7B right uh let's just take the first one and uh because this is a non-shed version so if you go to this uh this is a model just go to files and version tab and here the size of this pter model you can see 10 + 5 15 GB and um with that you also need to add some overhead so it will not fit in Google collabs a free version because the ram in the free version is around 12 GB and that's the reason I need the sharded version so you go to hugging face again and um just type uh sh R and yeah so I get two shed version I'm just going to take the first one bn22 and in BN in the sharted model if I go to files and version I can see that so many different models all are sharded actually and they are all numbered like 10 uh 001 002 003 and so on and each one is around 1.5 GB and that's the beauty of sharded models and just to give you a more detail on the sharded checkpoints so when loading a full pre-trend model wees which is not not sharded require a full version of the model in memory for almost all the recent 7 billion open- sourced Larch language models that may be around 15gb plus which we just saw and in Google collapse free version that will certainly give you out of memory error and even worse if you are using tor. distributed to launch a distributed training each process will load the pre-trained model and store these two copies in Ram hence a sharded checkpoint is a type of check Point optimized for these distributed environments in a shed checkpoints a models State dictionary that is all its parameters and states is divided or sharded across multiple files that is if you're doing distributed training then each of the sharded checkpoints will be distributed across multiple devices and in hugging phase since version 4.18 model checkpoints that end up taking more than 10 GB of space are automatically sharded in smaller pieces that is in those cases in terms of having one single check checkpoint when you do model. safe pre-trained and pass the safe directory you will end up with several partial checkpoints Each of which will be less than 10 GB and an index that Maps parameter names to the files they are stored in and while saving uh the shorted checkpoints you can control the maximum size of each sharding with a parameter in hugging phas called Max Shard size but that's a different topic so the main advantage of doing this for big model is that each Shard of the checkpoint is loaded after the previous one capping the memory usage in Ram to the model size plus the size of the biggest Shard all right coming back to our collab the first thing you need to do is import all the necessary libraries torch Transformers automodel cons LM Auto tokenizer and bits and byes config because I need to do quantization on the model because I'll be using a 4bit quantization otherwise so much of less RAM and less GPU usage will not be possible so um so for that the first thing I need to uh Define the config and uh so here I am doing that with a little function little util method rather uh called the load quantized model and what it's doing is just uh doing this BNB config and also the def defining the model so let's quickly go through these BNB config parameter uh first one is load in 4bit so uh this load in 4bit parameter is for loading the model in 4 bits Precision this means that the weights and activations of the model are represented using four bits instead of the usual 32 bits this can significantly reduce the memory footprint of the model 4bit Precision models can use up to six 16x less memory than full Precision model and can be up to 2x faster than full Precision model however if you need the highest possible accuracy then you may want to use the full Precision models but in in all in almost all homebased gpus and also in collabs premium GPU uh for running 13B or even 7B model it's better to do with quantum go with quantization next one is uh this one BNB 4bit use double Quant equal to true this parameter enables double quantization or also called nested quantization which applies a second quantization after the initial one it saves an additional 0.4 bits power parameter and then uh this one BNB 4bit Quant type equal to nf4 this parameter specifies the type of 4bit quantization to be used in this case NF for refers to normalized float 4 which is the default quantization type and the last one is BNB 4bit compute D type equal to torch. B float 16 this parameter determines the compute data type used during the computation it specifies the use of the B float 16 data type for faster training the compute data type can be chosen from options like float 16 uh B float 16 float 32 Etc and this configuration is uh is needed because while 4bit bits and bites stores weights in 4 bits the computation still happens in 16 or 32bit and here any combination can be chosen like we mentioned float 16 B float 16 and Float 32 the matrix multiplication and training will be faster if one uses 16bit compute D type now one question you can ask is does floating Point 4bit precision quantization uh need any particular hardware and to answer that note that this method is only compatible with gpus hence it is not possible to quantize models in 4bit on a CPU among gpus there should not be any hardware requirements about this method therefore any pretty much any GPU could be used to run the 4bit quantization as long as you have Cuda uh of more than 11.2 installed keep also in mind that the computation is not done in 4bit the weights and activations are compressed to that format and the computation is still kept in the desired or native D type all right once we have our BNB conflict fully defined then the rest of the code is rather pretty easy a model I'm defining the model here model here and naming the model name our model name has been uh this that we just saw in hugging pH bn22 uh mrol 7B instruct v0.1 shed and loading forbit I'm making it a true and torge D type I'm using B float 16 and quantization config is just what we defined these entire configuration all right and that uh this method uh let me zoom out this whole method is returning this model and then I need to do I need to also do the initialization of the tokenizer and uh there's another util method I'm using for initializing the tokenizer with the specified model name so tokenizer is again coming from from autot tokenizer do from pre-rain and pass the same model name and yeah that's it now I'm just need to I just need to execute these two methods so the next two lines here is just doing that model uh is equal to uh the outcome or the output of the load quantize model and tokenizer is from initialized tokenizer all right now my model and tokenizer is fully ready uh and it will take it may take 5 to 10 minutes because you need to download download almost 15 GB into Google collabs instance and after that inferencing so here I am passing a single sentence as my text prompt and because it's an instruction tuned model so you need to uh do this first you need to uh pass this inst within the square bracket and also at the end you need to pass these forward slash and then inst and then inside should be your prompt whatever the prompt you want you can ask any question whatever so here I'm asking a rather simple question how AI will replace engineers and that's my text so encoded is equal to uh encoded is the output of after applying the tokenizer so the tokenizer I'm applying passing the text that is a prompt return tensor PT representing py torge add special tokens false and model input will be these encoded and then I need to generate the IDS so model mod. generate pass my model inputs which is from the previous line Max new tokens I'm keeping it at 200 do sample true and now after I got the generated IDs I also need to uh decode it because these are just IDs or tensors I need to decode it to get the actual text so uh decoding is by applying batch decode so tokenizer batch decode and pass your generated IDs and that's it now just print the first element of De and that's my output uh so this is how AI will replace Engineers uh inst is becoming increasingly prevalent in many Industries and in some and in some engineering jobs may be affected while AI May augment the skills of engineers in certain cases it is unlikely to replace Engineers entirely in reality I is often used to augment the work of Engineers and to help make better so yeah it gave this entire answer and the whole answer looks pretty reasonable and pretty uh pretty pretty complete and let's see the resource used in Google collab so you just click on this dis and RAM and we can see that system Ram was only 4.5gb out of Max of 2 12.7 GB allocated by free free Google collab and GPU Ram only 6 GB and yeah disk usage is of course reasonable 40.4 GB because the model is itself is like 16 GB yeah so you can see that 4.4 4 GB of system RAM and 6gb of GPU Ram was enough to do the inferencing on mral [Music] 7B

Info

Channel: Rohan-Paul-AI

Views: 8,482

Rating: undefined out of 5

Keywords: machine learning tutorial, machine learning algorithms, machine learning projects, Deep Learning, natural language processing, hugging face, Deep Learning for NLP, huggingface, large language models, gpt 4, llm, langchain, langchain tutorial, large language models tutorial, open source llm models for commercial use, open source llm, generative AI, Langchain, llm training, large language models explained, large language model tutorial, fine tuning language model

Id: eovBbABk3hw

Channel Id: undefined

Length: 11min 41sec (701 seconds)

Published: Thu Oct 05 2023