LoRA explained (and a bit about precision and quantization)

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

hi everyone and welcome to another video based on the recent voting this video will be all about low rank adaptation a method for fine-tuning very large deep learning models but not only that we will also talk about some foundations on how to deal with large models in general as most of you probably know the model sizes especially for large language models have been scaled to crazy numbers meanwhile we've already reached around 1.8 trillion parameters with gpt4 many people want to fine-tune their custom data sets on these large models which is of course a challenging task because on one hand you need to tune a lot of parameters which take some time and secondly the GPU requirements are massive so in this video we will talk a bit about techniques to cope with large models and especially talk about Laura a popular method for parameter efficient fine tuning to build some foundational knowledge in order to understand the following let's quickly take a look at some Concepts regarding precision and quantization if you feel familiar with these topics you can skip this part and jump right to the Laura time stamp the weight matrices in neural networks are made up of floating Point numbers not sure if you have ever realized this but usually these values are stored in a float32 data type so what does this mean in computer science studies you learn how a computer internally represents floating Point numbers using only zeros and ones this is done by reserving bits for the sine exponent and fraction of this equation here is an example of 7.5 and some additional digits in 32 bits now one obvious thing that has been done is to lower the Precision using other data types for example we can switch to half Precision which has only half of the bits to represent the number and therefore only requires half of the memory the downside is that we lose precision as you can see in this example we cannot represent as many digits as in the 32-bit case and this loss and Precision in terms of rounding errors can accumulate quickly given this information it's straightforward to calculate relate to actual model size in terms of gigabytes for this you simply multiply the size of the data type like floats with the number of Weights in the model this gives you a rough estimate on how much memory is required to run a model for training however you require even more than that because you additionally need to store each weight's gradients and learning rates an example Bloom is a 176 billion parameter model and it corresponds to roughly 350 gigabytes of memory for inference this means you need several large gpus to run this model the impact of using other levels of precision for model training has been evaluated intensively in the literature usually using half Precision still works reasonably well for training neural networks but this of course depends on a data set there's also a trend to use mixed Precision which means that different parts of the network operate on different data types now what about going even lower than half Precision a lot of papers have been reported that very low levels of precision don't really work out of the box there is however a trend to use quantization which allows you to go very low even only integers and still maintain the model performance these methods do not simply drop half of the bits which would lead to an information loss but instead calculate a quantization factor that allows to maintain the levels of precision here's an example of a half Precision Matrix converted to int 8 using quantization how this exactly works is content for another video as there exists different quantization techniques at this point it's just to emphasize that it's possible to reduce the size of a model by lowering the Precision there is also a very nice recent paper in which the authors ran 35 000 experiments to compare the optimal combination of model size and kbit quantization the conclusion is that 4-bit quantization is almost universally optimal it's obvious that less digits mean less memory but it's not only less memory but smaller Precision models are also faster to train on most gpus because it takes less time to read the data having the Precision typically gives two times speed improvements in terms of flops during training flop stands for floating Point operations per second and is a common measure to compare the speed of hardware it's the maximum number of floating Point operations like multiplication that the hardware might be capable of here you can see that the performance of gpus has been increasing over time in terms of flops this means that the hardware is capable of executing faster Matrix multiplications which are needed for deep learning here's an example from a Nvidia Benchmark that shows that smaller precisions increase the flops alright now we know two ways to make huge models smaller namely lowering precision and applying quantization besides these data type improvements there's an emerging research trend for parameter efficient fine-tuning techniques which means fine-tuning large models using less weights than the total number of weights let's have a look at a few of these methods the traditional way of transfer learning was to Simply freeze all weights and add a task specific fine-tuning head the downside of this however is that we only get access to the output embeddings of the model and can't learn on internal model representations an extension of this are adapter layers presented in a Google research paper from 2019 which insert new modules between the layers of a large model and then fine tune on those in general this is a great approach however leads to increased latency during inference and generally the computational efficiency is lower a very different idea specifically designed for language models is prefix tuning presented by Stanford researchers this is a very lightweight alternative to fine-tuning which simply optimizes the input Vector for language models essentially this is a way of prompting by prepending specific vectors to the input for a model the idea is to add context to steer the language model of course prefix tuning only allows to control the model to some extent so sometimes a certain degree of parameter tuning is necessary this finally leads us to Laura the probably most commonly used fine-tuning approach which we will discuss in more detail in the following minutes it performs a rank decomposition on the updated weight matrices of course they exist more techniques and a great place to work with them is the hanging face library with implementations of parameter efficient fine-tuning techniques called PFT later in this video I will also give you a simple example let's first discuss what Laura so low rank adaptation actually means the rank of a matrix tells us how many independent row or column vectors exist in the Matrix more specifically it's the minimum number of rows or columns here's an example this number is an important property in various Matrix calculations from solving equations to analyzing data now a low rank simply means that the rank is smaller than the number of dimensions in this example we have three dimensions by the rank of two low rank matrices have several practical applications because they provide a compact representations and reduce complexity and finally adaptation simply refers to the fine-tuning process of models now what's the motivation behind Laura Laura is motivated by a paper published in 2021 by Facebook research that discusses the intrinsic dimensionality of large models the key point is that there exists a low Dimension re-parametrization that is as effective for fine-tuning as the full parameter space basically this means certain Downstream tasks don't need to tune all parameters but instead can transform a much smaller set of weights to achieve a good performance here is an example for fine-tuning birds and they show that using a certain subset of parameters namely 200 it's possible to achieve ninety percent of the accuracy of full fine tuning using a certain threshold is how they Define the intrinsic Dimension so basically the number of parameters needed to achieve a certain accuracy another interesting finding evaluated on different data sets is that the larger the model the lower the intrinsic dimension this means in theory that these large Foundation models can be tuned on very few parameters to achieve a good performance and that's mostly because they already learned a broad set of features and our general purpose models based on these results the Laura paper presented by Microsoft researchers proposes the idea that the change in model weights Delta w also has a low intrinsic dimension as we know from before the dimension is related to the rank of a matrix therefore Laura suggests to fine-tune through a low rank Matrix more formally this is done through rank decomposition as expressed by this equation w0 are the original model weights which stay untouched b and a are both low rank matrices and their product is exactly the change in model weights Delta w an important note it's not relevant that we find a decomposition of Delta W into b and a but rather we care about the other direction we construct Delta W by multiplying b and a that also means they need to be initialized in such a way that Delta W equals 0 at the start of training this is done by setting B to 0 and the weights in a are a sampled from a normal distribution let's have a look at an example the shape of this weight update Matrix is 4 times 4. it's constructed as the product of B times a b and a are both low rank matrices and their rank is 2. on four dimensions the implications are not too obvious yet but imagine the shape of w is 200 times 200 then it's much more efficient to fit two matrices of 200 times 2 instead of the four quadratic Matrix this decomposition can be applied on any dense neural network layer but in a Transformer it's typically applied on the attention weights in the forward pass the input is then multiplied with both the original model weights and the rank decomposition matrices the output of that is then simply added together because of this the implementation of Laura is fairly easy in addition to the regular forward function which we can see on the top here we now also send the inputs through the low rank matrices and scale the result with a scaling Factor the output of that is simply added to the output of the Frozen model the only trainable parameters are A and B the low rank matrices but why is this scaling factor used looking at the details in the paper we can see that the output of b and a is scaled with Alpha divided by the rank the rank in the denominator corresponds to the intrinsic Dimension which means to what extent we want to decompose the matrices typical numbers range from 1 to 64 and express the amount of compression on the weights Alpha is a scaling Factor it simply controls the amount of change that is added to the original model weights therefore it balances the knowledge of the pre-trained model and the adaption to a new task both the rank and Alpha are hyper parameters I found this GIF which shows an example of scaling the ratio from 0 to 1 for an image generation model using xero it will produce the output of the original model and using one the fully fine-tuned model in practice if you want to fully add Laura this ratio should be 1 it could also be larger than one if you want to put more emphasis on the fine tuning weights if your Laura model tends to overfit a lower value might help if the fine tuning doesn't really work the ratio should be increased the reason why Alpha is divided by the rank is most likely that you want to decrease the amount of weight updates because with a higher rank you will have more values but why is this scaling added in the first place why do we need to balance this at all the author states that this scaling helps to stabilize other hyper parameters like learning rates when varying R so in practice you might want to try different levels of decomposition by varying the rank and through the scaling you don't have to tweak the other parameters too much talking about the rank what is the optimal rank to choose in the Laura paper different experiments have been conducted that show that a very small rank already leads to pretty good performance increasing the rank does not necessarily improve the performance most likely because the data has a small intrinsic rank but this certainly depends on the data set a good question to ask when choosing the rank is did the foundation model already see similar data or is my data set substantially different if it's different a higher rank might be required different experiments run by the authors indicate that Laura significantly outperforms other fine-tuning approaches on many tasks here it's compared with techniques like prefix tuning and different adapters so let's quickly summarize the main benefits of Laura because of the rank decomposition we have much more convenient computational requirements during training there are less weights to tune the training will be faster and less memory is needed another beautiful thing is that we can simply merge the Delta W weights from the rank decomposition with the original model weights by simply adding them together we end up with a new model without any overheads during inference like in the case of adapters finally a cool feature is that we can simply switch between different Lora weights when fine-tuning for different Downstream tasks so this provides us with a sort of model sue for a specific Foundation model finally let's talk about how we can Implement Laura in practice a lot of work has been done by the hiking face team to enable easy usage of this technique the repository path which stands for parameter efficient fine tuning provides implementations for all popular fine-tuning techniques including Laura so luckily we don't have to manually apply a low rank decompositions to every single layer instead we can make use of the function get path model which does this job for us in a config we can even specify certain Target modules for example the key query and value matrices of Transformers here you also find the mentioned hyper parameters Alpha and the rank we can then call a function that prints the total number of parameters and the trainable Lora parameters in this example we can see that only 0.19 percent of the original model weights will be trained so overall this is a very convenient library and allows to train huge models on a single GPU going back to the beginning of this video where I talked about quantization you now also have the option to combine quantization techniques with Laura to ultimately reduce the hardware requirements this paper called Q Laura presented earlier this year additionally adds 4-bit quantization to the pre-trained model weights that means we don't pass the input through the original model weights but instead through a quantized version of it alright that's it for this overview I hope this was helpful to get familiar with the topic and I would be happy to see you again in a future video [Music]

Info

Channel: DeepFindr

Views: 29,500

Rating: undefined out of 5

Keywords: LoRA, Low-rank adapation, QLoRA#, QLoRA

Id: t509sv5MT0w

Channel Id: undefined

Length: 17min 6sec (1026 seconds)

Published: Sat Aug 26 2023