Landmark Attention Training Walkthrough! QLoRA for Faster, Better, and Even Local Training.

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

hey YouTube today we're going to talk about how to use Landmark attention to fine-tune your models for greater context we'll start by talking about how to set up oobabooga to correctly utilize Landmark attention then we'll talk about how to actually fine-tune your models to Landmark attention and the hyper parameters that affect it and then how to actually execute the fine tune merge it together and use it in oobabooga so let's get started so we want to be able to run our model in something like oobabooga which now does support Landmark attention and to get it to properly support it we just need to set a few parameters specifically the repetition penalty which we want to lower then we want to truncate The Prompt length and we want to set that to 8192 and then we want to set our chat prompt size to also 8192 and then we want to uncheck add the BOS token to the beginning of prompts and then under model we want to select Trust remote code because the the Transformer Library does not support this right out of the box but now let's move on to getting our fine tuning setup so the first thing we're going to do is navigate to this repository which will be in the description below but it's a fork of the official Landmark attention repository that implements Q LoRA which means that we will be training the network quantized instead of as a 16 or 32-bit floating Point Network which means that we should be able to train even 7 billion or 13 billion parameter models locally so as long as you have enough vram which should be 30 80s 30 90s or 40 80s or 40 90s we should be able to train those 7 or 13 billion parameter models locally so we're going to first check out this code and clone it down so we're going to click on code we're going to hit copy we're going to go to our command line and then we're just going to run git clone and clone the repository down and then we're going to see the end to there and the first thing we're going to want to do is run conda create Dash in landmark and that will create a conda environment for us now I already have this set up so I'm going to overwrite my environment and once that's been deleted we're going to go or once that's been set up we're going to go ahead and run pip install Dash our requirements and go ahead and let that install now in my case that was pretty quick because I already had it set up but in your case if you haven't ran these before it will take a little while but if you run into a problem with the Transformer libraries saying that you don't have a method that it expects just run pip uninstall Transformers and then run pip install Dash our requirements again and then that should resolve the issue but now that we have it installed let's go ahead and move on to actually fine-tuning a model so before we start fine tuning let's go over the hyper parameters that you can set the model name or path as the model that you plan to train if you have your model locally you just give it your local path but if you wanted to pull something down from hugging face you would do it just like you do with Uber the output directory is the local directory that you want to save your model to the cache directory is where you want to save your hugging face training data the per device train badge size is how much do you how much of the training set do you want your model to see at once four is pretty small and it will take a while for the network to actually learn the patterns but higher values like 256 or 512 has a propensity to memorization so you tend typically want to try sizes between 16 and 32 or maybe 64. the gradient accumulation steps or how often do you want the gradients to go back into your LoRAs there's not a lot of reason to change this but you could try experimenting with four or six or twelve the learning rate is how quickly do you want the network to update its weights this value probably shouldn't be updated but for larger Network sizes like 33 billion you might try lowering the learning rate the white Decay I really don't recommend changing this uh the logging steps are how often do you want things to log I recommend with how long this takes with doing a step size of one just so you can make sure that your loss is going down like you expect it to the warm-up ratio this is kind of an odd concept what it really says is um how quickly do you want the whole training set to be used per step and it increases it linearly as you go through epics and it just allows the model to kind of warm up to the training the maxed up sizes are how many steps you want to take before completing your training 200 is probably a good value to start with we're not going to want to change these two so BF 16 and tf32 we don't want to train these or change these uh the group length Group by length is we want to group things by similar links and that does help here and then the LoRA R is your LoRA dimensionality so remember we have low rank decompositions so we have um 32 being the inner Dimension here but for local especially if you only have 16 gigs of RAM I had to pull the lower Dimension down pretty low to be able to train locally but if you have 24 gigs of RAM you probably could get away with 16 or maybe even 32 and then the LoRA Alpha is how much influence do you want the LoRA to have on the network in totality and A good rule is a quarter or a half of your LoRA Dimension now let's get on to actually executing a fine tune once you have the model selected that you want to train and the hyper parameters that you feel are good we can just run this command line and it will be in the description below but it's also on the repository and it will just handle the rest now we can't really use custom data sets right now we have the data set that's used to train by default but that should be changed pretty soon especially with the structured data the developer of this repository has uh talked about wanting to support structured training because it should help the model especially models like Wizard or other instructional models learn how to pay greater attention or pay the right attention to the landmark tokens but now if you'd like to try running this locally I I encourage it I think it should work pretty well especially on 3080s or 30 90s or 48ers or 40 90s though it will take a very long time because even on h100s it takes two to three hours to train these but it should take a few hours on uh local if it manages to complete running so I would expect 10 to 14 hours in total on local hardware um but let's go ahead and execute this and see what you should expect to see and you should see it starting to load the uh sharded checkpoints and once it's done doing that it will pull in the data set and then actually start the fine tuning process and then once that's done what we're going to want to do is actually merge the lures into the network so once that's done we'll come back and do that once we're done fine tuning all we have to do now is merge our loras into our model and that's because our lures are attached to each layer so if we try to run it in ubuga or any of the other chat apps it won't run correctly so all we have to do to do this is run the following command line where we run python merge pep.pi our base model name or path and we give it the model that we'd like to merge into our pep model path this is exactly where our LoRA is saved to and in my case since I had it saved LoRA it saved into this checkpoint and then the adapter model and then the output directory that we want to save it to we hit enter now once this is done running it will have merged the lore in and we can run in an oobabooga and once it's done we'll see how well it runs now that we have our model into oobabooga and you would just load it like you would any other model into oobabooga but now we can give it a test and see how well it performs so what I have here is a bunch of texts about engineering at the very beginning I have a passcode that I want the model to remember so let's see how well it does with this and there are about 4 000 total tokens here and this model typically only does 2048 so let's see how well it performs and it takes a little while for this to finish but once it does we should hopefully see that it remembered the password and in this case we'll see if it does it should take just a few there it goes and it did so the nice thing about this is is we can give it significantly larger context than we typically could and still have it retain that larger context so that's really all we have to do here and if you just want to try out um The Wider contacts for yourself feel free to pull down wizard LM that uh the bloke quantizing that will be in the description below but that's it if it's helpful please like And subscribe and please let us know in the comments what you'd like to hear about next and tune in next time what we're going to be talking about how could we possibly adapt Landmark attention to allow us more Dynamic control over our context see y'all next time

Info

Channel: AemonAlgiz

Views: 2,660

Rating: undefined out of 5

Keywords: Landmark attention, fine-tuning models, oobabooga, hyperparameters, model training, AI model performance, large context, machine learning, AI tutorials, context awareness, transformer libraries, LoRA, Q LoRA, quantized network training, hugging face, Model setup, gradient accumulation, learning rate, model fine-tuning, local model training, parameter models, natural language processing, AI development, training data, LoRA dimensionality, landmark tokens, chatbot development

Id: lCJbO8ERZuU

Channel Id: undefined

Length: 9min 48sec (588 seconds)

Published: Thu Jun 15 2023