Landmark Attention Training Walkthrough! QLoRA for Faster, Better, and Even Local Training.

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hey YouTube today we're going to talk about how  to use Landmark attention to fine-tune your models   for greater context we'll start by talking  about how to set up oobabooga to correctly   utilize Landmark attention then we'll talk about  how to actually fine-tune your models to Landmark   attention and the hyper parameters that affect  it and then how to actually execute the fine   tune merge it together and use it in oobabooga  so let's get started so we want to be able to   run our model in something like oobabooga which  now does support Landmark attention and to get it   to properly support it we just need to set a few  parameters specifically the repetition penalty   which we want to lower then we want to truncate  The Prompt length and we want to set that to 8192   and then we want to set our chat prompt size to  also 8192 and then we want to uncheck add the BOS   token to the beginning of prompts and then under  model we want to select Trust remote code because   the the Transformer Library does not support this  right out of the box but now let's move on to   getting our fine tuning setup so the first thing  we're going to do is navigate to this repository   which will be in the description below but it's a  fork of the official Landmark attention repository   that implements Q LoRA which means that we will be  training the network quantized instead of as a 16   or 32-bit floating Point Network which means that  we should be able to train even 7 billion or 13   billion parameter models locally so as long as you  have enough vram which should be 30 80s 30 90s or   40 80s or 40 90s we should be able to train those  7 or 13 billion parameter models locally so we're   going to first check out this code and clone it  down so we're going to click on code we're going   to hit copy we're going to go to our command  line and then we're just going to run git clone   and clone the repository down and then we're  going to see the end to there and the first   thing we're going to want to do is run conda  create Dash in landmark and that will create a   conda environment for us now I already have this  set up so I'm going to overwrite my environment   and once that's been deleted we're going to go or  once that's been set up we're going to go ahead   and run pip install Dash our requirements and go  ahead and let that install now in my case that   was pretty quick because I already had it set up  but in your case if you haven't ran these before   it will take a little while but if you run into  a problem with the Transformer libraries saying   that you don't have a method that it expects just  run pip uninstall Transformers and then run pip   install Dash our requirements again and then that  should resolve the issue but now that we have it   installed let's go ahead and move on to actually  fine-tuning a model so before we start fine tuning   let's go over the hyper parameters that you  can set the model name or path as the model   that you plan to train if you have your model  locally you just give it your local path but if   you wanted to pull something down from hugging  face you would do it just like you do with Uber   the output directory is the local directory  that you want to save your model to the cache   directory is where you want to save your hugging  face training data the per device train badge size   is how much do you how much of the training set do  you want your model to see at once four is pretty   small and it will take a while for the network  to actually learn the patterns but higher values   like 256 or 512 has a propensity to memorization  so you tend typically want to try sizes between   16 and 32 or maybe 64. the gradient accumulation  steps or how often do you want the gradients to go   back into your LoRAs there's not a lot of reason  to change this but you could try experimenting   with four or six or twelve the learning rate is  how quickly do you want the network to update   its weights this value probably shouldn't be  updated but for larger Network sizes like 33   billion you might try lowering the learning rate  the white Decay I really don't recommend changing   this uh the logging steps are how often do you  want things to log I recommend with how long   this takes with doing a step size of one just  so you can make sure that your loss is going   down like you expect it to the warm-up ratio this  is kind of an odd concept what it really says is   um how quickly do you want the whole training set  to be used per step and it increases it linearly   as you go through epics and it just allows the  model to kind of warm up to the training the   maxed up sizes are how many steps you want to take  before completing your training 200 is probably   a good value to start with we're not going to  want to change these two so BF 16 and tf32 we   don't want to train these or change these uh the  group length Group by length is we want to group   things by similar links and that does help here  and then the LoRA R is your LoRA dimensionality   so remember we have low rank decompositions so  we have um 32 being the inner Dimension here but   for local especially if you only have 16 gigs of  RAM I had to pull the lower Dimension down pretty   low to be able to train locally but if you have  24 gigs of RAM you probably could get away with   16 or maybe even 32 and then the LoRA Alpha is  how much influence do you want the LoRA to have   on the network in totality and A good rule is  a quarter or a half of your LoRA Dimension now   let's get on to actually executing a fine tune  once you have the model selected that you want   to train and the hyper parameters that you feel  are good we can just run this command line and   it will be in the description below but it's also  on the repository and it will just handle the rest   now we can't really use custom data sets right  now we have the data set that's used to train   by default but that should be changed pretty soon  especially with the structured data the developer   of this repository has uh talked about wanting  to support structured training because it should   help the model especially models like Wizard  or other instructional models learn how to pay   greater attention or pay the right attention  to the landmark tokens but now if you'd like   to try running this locally I I encourage it  I think it should work pretty well especially   on 3080s or 30 90s or 48ers or 40 90s though  it will take a very long time because even on   h100s it takes two to three hours to train  these but it should take a few hours on uh   local if it manages to complete running so I would  expect 10 to 14 hours in total on local hardware   um but let's go ahead and execute this and see  what you should expect to see and you should   see it starting to load the uh sharded checkpoints  and once it's done doing that it will pull in the   data set and then actually start the fine tuning  process and then once that's done what we're going   to want to do is actually merge the lures into the  network so once that's done we'll come back and do   that once we're done fine tuning all we have to do  now is merge our loras into our model and that's   because our lures are attached to each layer so  if we try to run it in ubuga or any of the other   chat apps it won't run correctly so all we have  to do to do this is run the following command   line where we run python merge pep.pi our base  model name or path and we give it the model that   we'd like to merge into our pep model path this  is exactly where our LoRA is saved to and in my   case since I had it saved LoRA it saved into this  checkpoint and then the adapter model and then the   output directory that we want to save it to we hit  enter now once this is done running it will have   merged the lore in and we can run in an oobabooga  and once it's done we'll see how well it runs   now that we have our model into oobabooga and  you would just load it like you would any other   model into oobabooga but now we can give it  a test and see how well it performs so what I   have here is a bunch of texts about engineering  at the very beginning I have a passcode that I   want the model to remember so let's see how  well it does with this and there are about 4   000 total tokens here and this model typically  only does 2048 so let's see how well it performs   and it takes a little while for this to finish  but once it does we should hopefully see that   it remembered the password and in this case we'll  see if it does it should take just a few there it   goes and it did so the nice thing about this is  is we can give it significantly larger context   than we typically could and still have it retain  that larger context so that's really all we have   to do here and if you just want to try out um The  Wider contacts for yourself feel free to pull down   wizard LM that uh the bloke quantizing that will  be in the description below but that's it if it's   helpful please like And subscribe and please let  us know in the comments what you'd like to hear   about next and tune in next time what we're going  to be talking about how could we possibly adapt   Landmark attention to allow us more Dynamic  control over our context see y'all next time
Info
Channel: AemonAlgiz
Views: 2,660
Rating: undefined out of 5
Keywords: Landmark attention, fine-tuning models, oobabooga, hyperparameters, model training, AI model performance, large context, machine learning, AI tutorials, context awareness, transformer libraries, LoRA, Q LoRA, quantized network training, hugging face, Model setup, gradient accumulation, learning rate, model fine-tuning, local model training, parameter models, natural language processing, AI development, training data, LoRA dimensionality, landmark tokens, chatbot development
Id: lCJbO8ERZuU
Channel Id: undefined
Length: 9min 48sec (588 seconds)
Published: Thu Jun 15 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.