Low-Rank Adaptation - LoRA explained

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
a custom model for our application we start with a pre-trained language model and fine-tune it on our own data set this used to be fine until we reached the large language model regime and started working with models such as GPT llama vuna Etc now these llms are quite bulky and so F tuning a model for different applications such as summarization or reading comprehension needs deploying the model for each application and the size of these models is only increasing almost on a weekly or monthly basis so the deployment of these bulky llms is getting increasingly challenging now one solution proposed for this problem is adapters adapters or trainable additional modules plugged into the neural netor Network mostly Transformers and during fine-tuning the parameters of only these adapter modules are updated with the pre-trained module Frozen because adapters or additional parameters they introduce latency during inference for a bat size of 32 and a sequence length of 52 and a small half a million parameter model find in tuned Laura model takes 149 milliseconds for inference but with adapters it's 2 or 3% higher so how does Laura achieve this feat let's find out in this video now before that I would like to give a quick shout out about our X account where we share high impact papers and research news from Top AI Labs from both Academia and from industry if you wish to keep up to date with AI every single day just hit the follow button on X Laura stands for low rank adaptation so what does that mean for any neural network architecture let's not forget that the weights of the network are just large matrices of numbers all matrices come with some property called the rank the rank of a matrix is the number of linearly independent rows or Columns of that Matrix to understand it let's take a simple 3x3 Matrix the rank of this simple 3x3 Matrix at the top is one why because the first and the second columns are redundant as they are just multiples of the third column in other words the two columns are linearly dependent and don't bring any meaningful information now if we simply change one of the values to say 70 the rank becomes two as we now have two linearly independent columns knowing the rank of a matrix we can do rank decomposition of a given Matrix into two matrices going back to our example of 3x3 Matrix it can simply be written as the product of of two matrices one with a dimension 3x 1 and the other with a dimension 1x3 notice that we only have to store six numbers after decomposition instead of the nine numbers in the 3X3 Matrix this may sound less but in reality the neural network weights have a very high dimension of say 1024x 1024 and so using a rank of two it boils down to a really small number of values that we need to store and hence that we need to multiply when we actually want to do some computation which is a lot of reduction in computation so would it not be nice if these weights actually have a low rank so that we can work with the rank decomposition instead of the entire weights it turns out that's indeed the case with pre train models as shown by this earlier work they empirically show that common pre-train models have a very low intrinsic dimension in other words there exists a low Dimension reparameterization that is as effective for fine tuning as the full parameter space let's say we are starting with a pre-tin model with weights w0 after fine-tuning let the weights be updated to w0 plus Delta W if the pre-train model has low rank weights it would be a fair hypothesis to assume that the fine tune weights are also low rank Lura goes with this assumption because Delta W is low rank we can now decompose that Matrix into two low rank matrices A and B whose product ba leads to deltaw and lastly fine-tuning becomes the pre-train weights w0 plus ba instead of w0 plus Delta W as it's one and the same with that perspective if we start training the model with input X the input passes through both the pre-train weights and the rank decomposition matrices A and B the weights of the pre-rain model remain Frozen but we still consider the output of the Frozen model during training the output of both the Frozen model and the low rank model are summed up to obtain the output latent representation H mathematically it's represented by this on line equation where the input X is Multiplied with both w0 and ba Matrix and summed up to obtain the hidden representation H now you may ask what about latency during inference if we slightly modify the above equation we can notice that we can merge or add the weights ba to the prein weights w0 so for inference it's this merged weight that is deployed thereby overcoming the latency bottleneck one of the other concerns is deployment of llms as they are quite bulky say about 50 GB or 70 GB let's say we have to fine tune for two tasks namely summarization and translation we don't have to deploy the entire model every time we fine-tune we can simply fine-tune the lower layers specific for the task for example summarization and deploy the model for summarization similarly we can deploy Lowa layers specific for translation and thus Laura overcomes both the deployment and latency bottlenecks or problems faced by modern day large language models in terms of applying for Transformers we all know that Transformers have two main modules which are multi headed self attention and multi-layer perceptron or MLPs the self attention modules are composed of query key value and output weights in this paper they have limited their study to only adapting the attention weights for Downstream tasks and Frozen the MLP modules so they're not trained in downam tasks which means means that Laura is just applied to the self attention module now we have been talking about using Laura for adaptation one of the key parameters in Laura is the rank which is something that we have to choose so what is the optimal rank for Laura it turns out to everyone's surprise a rank as small as one is sufficient for adapting both the query and the value matrices however when adapting the query alone it needs to have a larger rank of say four or eight or even 64 moving on to how we can practically use Laura there's this official implementation from Microsoft which is released as Lura lip and is available under the MIT license another option to use Laura is the hugging face repo called PFT which stands for parameter efficient fine-tuning and P is available under the Apache 2 license PFT also has a few other implementations such as prefix tuning prom tuning and Laura is one of the earliest implementations in the library I think that pretty much covers the important BS about Laura I hope this video was useful in understanding about the functioning of the Laura model I hope to see you in my next until then uh take care
Info
Channel: AI Bites
Views: 2,581
Rating: undefined out of 5
Keywords: machinelearning, deeplearning, transformers, artificial intelligence, AI, deep learning, machine learning, educational, how to learn AI
Id: X4VvO3G6_vw
Channel Id: undefined
Length: 10min 42sec (642 seconds)
Published: Thu Dec 14 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.