Understanding: AI Model Quantization, GGML vs GPTQ!

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
there is so much about quantization do you understand it what is ggml gptq this video is going to explain let's learn about weights and neural networks weights are the parameters and neural network that determine how the network learns and makes prediction they are actually real numbers that are associated with each connection between neurons in the network the weights are what a love the neural network to learn the relationships between input data and the desired output data so in in a neural network if you take a neural network each neuron receives input from another neuron and the input from each neuron is multiplied by the weight and the sum of all the weighted inputs is then passed through an activation function and the activation function ultimately decides or determines whether the neuron will fire or not initially the weights are randomly initialized but as the training process goes on the weights are optimized and adjusted based on the optimization that you have selected and this is kind of the foundation of a technique called back propagation back propagation lets you optimize and change the weights in such a way that you can minimize the error now how are weights related to quantization now when these weights are stored in the neural network that's what the model is you call it a model weight so now when these weights are stored in the neural network it can be with different precisions so you could store these weights as a 32-bit floating point or you could store this weight as a 16-bit floating point or even could be an 8-bit floating point or 8-bit integer or a four bit integer so when we talk about weights as numbers we generally call it a number but this number under the hood in computers could be of different precision and could be of different data type and based on the data type or based on the Precision a lot of things change including the amount of time this neural network takes for inference and also the size of the neural network ultimately the model in itself so just to back it up if we say something is a model the model is nothing but a neural network with its weights so the model is going to be huge when the weights are stored in a higher Precision or you know the most Precision like you know the most accurate number representation of that floating point value but the model's weight will come down as you reduce the Precision that is where you have got a very interesting technique that comes in that is called quantization quantization in neural networks is the process of reducing the Precision of the weights biases and activations of a neural network in order to reduce the size and computational requirements of the model this can be done without significantly impacting the accuracy of the model in many cases I mean when I say quantization that's the first thing that might come to your mind Hey like you are going to reduce the Precision of the weights what kind of impact it is going to have with the model accuracy that's a very good question and also quantization happens at two different places one it could be a post training quantization or the other one could be a quantization of wet training what we do mostly these days like the bloke with ggml and the gptq models or post training quantization what is a post training quantization post training quantization is the process of quantizing a pre-trained neural network and this can be simply done by rounding of weights or activations to a lower Precision but this can also lead to some kind of accuracy so in the post training quantization you might take a model that has got let's say floating Point 16 accurate precision and you can convert into a 4-bit or an 8-bit integer by doing this you are quantizing the model you are quantizing the model in such a way that the weight of the model reduces the size of the model reduces and it also improves the performance of the model in terms of you know the hardware that is required to run it but this can also lead to loss in accuracy so now the quantization part is clear again just to back it up a neural network is a bunch of numbers stored here and there and it matrix multiplication numbers are stored the form of Weights biases and activations and when these numbers are stored with higher Precision floating Point like 32-bit floating Point these models take a lot of size and these models also require more computation when the models have to predict that's what we call inference so by reducing the size of the model or by reducing the Precision of the model We Are One reducing the size of the model improving the model more performant in lower computation and the model is also going to require less compute under lesser power to do the matrix multiplication and all the this can also lead to accuracy loss now with all these things in mind the two popular model types that we see post trained quantized models are ggml and gptq models now what are these models so ggml and gptq models are quantized models in a way to reduce the models and also reduce the computation requirements of the model by reducing the model weights to a lower Precision now what are the key differences between GG ML and gptq ggml models are optimized for CPU while gptq models are optimized for GPU that means ggml model runs faster the inference speed is faster on CPUs and gptq or faster on GPU the inference quality is said to be similar but I read a blog post or I read a hugging face Community article where the blog said the during their experiment with ggml and gptq gptq scored a little bit lower than gcml but I don't have any benchmarks to prove that but anyways just keep in mind that ggml and gptq are ideally supposed to have a similar inference quality the model size the ggml is supposed to be slightly larger than gptq models and both of these models are compatible with hugging phase Transformers which means you know you have got like easy way to run these models in general the thumb rule is if you have got a CPU without any Nvidia GPU use the ggml model or if you have got an Nvidia GPU even if it is not like most powerful mission in the world you can use gptq model and I guess this helps you in understanding all these things that people are discussing these days about gptq ggml quantization reducing the model size if you have got any questions let me know in the comment section otherwise see you in another video Happy prompting
Info
Channel: 1littlecoder
Views: 18,099
Rating: undefined out of 5
Keywords: ai, machine learning, artificial intelligence
Id: ZKdMbQq5T30
Channel Id: undefined
Length: 6min 59sec (419 seconds)
Published: Fri Aug 04 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.