Fine-Tune Language Models with LoRA! OobaBooga Walkthrough and Explanation.

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hey YouTube today we're going to be going  over one of the most powerful tools we have   in our tool set for training and fine-tuning  large language models LoRA's we're going to be   covering how to LoRA's work what are they  and how do we actually Implement and use   LoRA's to fine-tune our large language models  using oobabooga's text generation web UI if   you'd like to skip the conversation about  how do they work you can skip to this time   stamp but if you'd like to learn how they work  under the hood stay tuned so let's get started   so why LoRA is a big deal is because it's  essentially doing a dimensionality reduction   and why this is the case is that during fine  tuning tasks we find that these generative models   are really just tuning some lower dimensional  set of their parameters so why retrain the   entire network or a layer where we can just  create our own lower dimensional representation   so this is taking advantage of a concept in  linear algebra called rank decomposition where   the most common type we're taught is singular  value decomposition and to understand why this   works we just need to understand one Concept in  linear algebra and that's when can we multiply   matrices so the dimensions you know if you say an  M by n Matrix what we're talking about is how many   rows and columns are there in the Matrix so if we  have a two by two we would have four values in The   Matrix so you get the number of values by just  multiplying these Dimensions so M by n equals z   which is the dimensions or the number of elements  in that Matrix so if we want to multiply two   matrices the only thing we need is that they're  inner Dimensions match so if we have two matrices   for example four by one and a one by four we can  multiply them because they're inner Dimensions   match and we end up with a four by four so if we  take this and extend it to rank decomposition what   it says is if we have a matrix a which is M by n  we can represent it by three different matrices   that have Dimension M by R for b r and r for  C and R and M for D and if we multiply them   together we just get an N by n Matrix back and  this is powerful because memory matters so if   we start with a matrix that has the elements one  two three four two four six eight we can represent   this exactly what the what these two vectors which  are four by one and one by four and if we multiply   them together we get this four by four back and  instead of having to store 16 values we're storing   eight values but we don't need to represent things  exactly approximation is also okay so if we start   with the Matrix a that's 500 by 100 that has 50  000 values in it but instead of having to store   all of that we could just approximate The Matrix  a by having a matrix B that has Dimensions five   by ten and a matrix D that has Dimensions ten by a  hundred the inner Dimensions match and if we ever   multiply these together we get a 500 by 100 Matrix  back but instead we're just having to store 6   000 values and these memory savings get bigger and  bigger the more numbers the bigger the dimensions   of the Matrix is matrices are so how these LoRA  matrices play into the generative network is we   attach the lower matrices to the feed forward  portion of our Transformer model now they could   attach other places but we typically attach them  to the feed forward portion and we're going to   hold the weights in the feed forward layer as  being fixed and instead we're going to train   and update the weights inside of the matrices  for these loras and all the LoRA has to do is   have its outer Dimensions match the dimensions of  the feed board Network which in this case are in   but the inner Dimensions can be M which can be  much much less than n and remember if we multiply   n by n then we get the total number of elements  and that can be much much less so we're not   dealing with the same memory constraints so when  these two multiplications are done by the feed   forward and by the loras we just add their results  together and we influence the output weights   and this dimensionality reduction is done all over  the place through physical principle component   analysis or Auto encoders and we're just now  applying it to our large language models or   our generative models so now let's move forward  and see how we can actually use LoRA's to help   fine-tune your language model today we're going to  be using the ooga text generation web UI and we're   just going to go into the training tab so once  we go under this tab we're going to be presented   with a lot of different options but before we set  any of these and we will go over them we want to   make sure we have our training data and that can  be presented in two forms as a raw text file or   as a formatted data set and the raw text file is  just that it's a splattering of text that we can   use to get the network new information or teach it  something new but the formatted we can teach the   network something new about Its Behavior and this  kind of instruction output format in this case I   have the user asking hello I would like to speak  to you about your car's extended warranty and the   assistant replying what about my car's extended  warranty so these will be in the description below   because I just want to help everybody know kind of  how to format these things so now we can actually   go ahead and start talking about what these hyper  parameters do and how they affect your training so   the first thing we want to do is give it a name in  my case I'm calling it eamon's example and you can   copy them from another set of loras or your own  previous lures as well but batch size now batch   size plays into the epoch and remember for each  Epoch the network will see the totality of your   training set but the bat size says how much of it  you want to see at once and higher values can lead   to overfitting or memorization issues but lower  values can lead to longer training periods but   we can also do the micro batch which if you're if  you're having vram issues a micro batch can help   save memory by chunking it out a little further  Epoch so the epoch describes how many times is   the network going to see the totality of the data  set during the training process learning right   how quickly are the weights going to be updated  and the default value tends to work pretty well   the learning rate scheduler updates the learning  rate during the training process and I tend to   use linear or cosine but you may experiment with  that for your data set to see what performs best   the LoRA rank describes the dimensionality of  the LoRA matrices so higher values can lead   to significantly more Fine Results but use a lot  more memory whereas smaller values will use a lot   less memory but you can also get good results out  of it it really depends on the complexity or data   set and how much change in the behavior you're  looking for the LoRA Alpha describes how much   influence does the LoRA have on the feed forward  portion of the network higher values result in   much greater impacts lower values result in lower  impacts so if you think your LoRA is having too   much of an impact you can always try backing it  off and then cut off link has another big impact   on memory it's how much text is presented at once  and so if you are having issues with training   memory backing that off can help quite a bit so  the other thing about the training data is also   once we go to load it we want to make sure they're  in the correct extension our training format or   formatted needs to be a Json and then our text  just needs to be a DOT txtr raw text so once we   have that loaded we can go ahead and select it  we want to have a training and a validation set   and I don't know why that one there we go and we  can go ahead and select our validated form and   then we just need to tell it what the format we're  going to be using is in my case I'm going to be   using an alpaca format and then how many times do  I want to how many steps do I want to take before   evaluating my Network's performance and there  are some Advanced options as well before we start   and that's namely the lower Dropout the rest of  these options I don't think need to be touched   very much but the Dropout can be important because  it helps to prevent overfitting so you might have   to toy with this value if you notice that your  network is overfitting or memorizing but once   we have that we just click on start LoRA training  and that's all we have to do if this was helpful   please like And subscribe and please let us know  in the comments what you'd like to learn about   next and join us next time when we're going to  be continuing our conversation about how do large   language models work under the hood specifically  positional encoding and embedding spaces
Info
Channel: AemonAlgiz
Views: 21,164
Rating: undefined out of 5
Keywords: LoRA, Language Model, Fine Tuning, Linear Algebra, NLP, AI, Machine Learning, Text Generation, Memory Optimization, Transformer Model, open source llm, Dimensionality Reduction, Rank Decomposition, Singular Value Decomposition, Web UI, Hyperparameters, Training Data, Model Validation, Dropout, Neural Networks, Natural Language Processing
Id: 7pdEK9ckDQ8
Channel Id: undefined
Length: 9min 21sec (561 seconds)
Published: Wed May 10 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.