Fine tuning LLMs for Memorization

Video Statistics and Information

Video

Captions Word Cloud

Captions

I'm going to walk you through the steps and the theory for fine-tuning a language model for memorization this is where you have a custom data set and you want the language model to remember in detail in great detail ideally the content of that custom data set when you use it later for inference for agenda in this video we'll start off by talking about the reversal curse this will help you understand how language models work and why we have to fine tune as we do when we want the to memorize content I'll then show a toy training example that will help understand the kind of data set you need in order to get effective memorization next I'll show you the steps of building a synthetic question and answer data set usually you'll be starting with a PDF or a text document but you'll need to get that data into the right format and also probably expand the data to make sure that you really get the model to memorize I'll then talk through fairly briefly some of the hyper parameters choices in fine-tuning a model specifically choosing batch size learning rate number of epoch I'll also talk briefly about which model you should start to fine-tune then I'll walk you through all the steps on an a6000 GPU and runp pod I'll walk through the notebook show you the results before finishing up and highlighting how the model performs before and after fine-tuning and comparing it with some different hyperparameter and model choices if you just start off with a document and you extract the text from it you're probably going to have a hard time getting the model to memorize that content now if you have very many documents considering the content from different angles that will allow the model to build a statistical representation and probably you'll get reasonable memorization so what I'm going to talk about now is how to generate a data set that will give the data from different perspectives and those different perspectives allow the model to build build a complete statistical representation of the knowledge and then give more accurate memorization to help understand this a bit we can look at what's called the reversal curse let me go over here to GPT I'm asking a simple question what is the name of Tom Cruz's mutter and the answer is Tom Cruz's mother was named Mary Lee Feer South she was a special education teacher and played a significant role in his early life before she passed away in 2017 now look at this if I flip that around and ask what is the name of Mary Lee's son and then I say do not search the internet so that GPT doesn't check online the answer is the son is named Jonathan South so why is it that GPT is able to tell me the mother of Tom Cruz but not able to tell me Mary South's son and the reason is because in the training data set which is largely the worldwide web it's pretty much always Tom Cruz's name that appears with further information appended afterwards for example the name of his mother or parents or other things about Tom Cruz but very rarely on the Internet is it going to start some kind of fact or sentence with Mary South and then follow it up with Tom Cruz and this helps you understand that if you just have one document which is actually a concise document that's even worse because each piece of information is only going to be presented maybe once and only once in one specific order and that's not going to be enough to get robust representation if you ask a question slightly differently in a different order you're probably not going to get the same answer as what you expect when presented in the specific training data's order so the whole key to getting good memorization is it's not enough to just take a sentence you need to have a copy of that sentence or the knowledge rather and you need to represent it many times in different ways because then GPT has got a broader data set that gives gives a better statistical representation and is more likely to answer questions correctly when those questions come from different answers to make this concrete let's look at a toy training example so you have a PDF and within the PDF there's a sentence here if the ball is propelled forward the defending team is awarded a penalty and what I'm saying is that's not enough for getting accurate memorization you need to have that sentence in different flavors for example defending team gets a penalty if the ball is propelled forward or what's the result if the ball is passed forward a penalty and so to get good memorization you need a way to have a starting data set and expand that into a much broader data set that ideally hits the questions from different angles the next question is given an input data set so given this blue sentence how can we systematically create an expanded data set and the way I'm going to propose here is to do that synthetically using a language model so here's how we might do that we have the same phrase which is going to be the raw text that goes to the language model and I've added in here the content or context which is international touch rugby rules and now with this content and text I'm going to ask the language model to create a nuanced question and answer and I'm going to say the question much must include the context so in this way I'm able to send the raw data in to the language model and get back a question question and an answer now that's just one question and answer but you can think a bit further and by varying what exactly you're asking for you can get different questions and answers for example you could have the language model create a simple question and answer rather than a nuanced one or you could have the language model create a question and answer where the answer has got words that are reversed relative to how they appear in the original text now by adding different requests to the language model you can think for this same snippet of raw text how different question and answer pairs can be created let's make the data expansion a little more concrete so you start with a text document and it has about a million words or let's say a million tokens you decide now what size chunks you're going to split it into so you have these chunks of your original data set and for each chunk which maybe is 500 tokens you'll ask a language model to create five questions so you have five questions that will be generated to represent each chunk of raw input data now as I mentioned to get some variance in those questions and to hit the data set from different angles you can make different requests for example ask for a Nuance question and answer or a simple question and answer and to go one step further you can run that same request to the language model at different temperatures so for every chunk you can get five questions at temperature zero five questions at temperature. five five questions at temperature 1 and what that will do is add further spread to the range of questions you're getting and help to hit things from slightly different angles so I'm going to go fully through a demonstration of data preparation and then a fine-tuning and I'm going to do it on the rules of touch rugby which is a topic that's not particularly well understood by large language models because there's not a lot of data online about it so I'll compare performance before fine-tuning with performance after fine-tuning on these expanded data sets I'm going to create the data set syntatic um there's a link here in the slides where you'll be able to see it on hugging face I'm going to ask the language model to create one question per 60 tokens so for my 500 token chunks there will be an expectation of five questions and I'm going to generate five question and answer pairs nine times at temperatures that are evenly spaced from 01 all the way up to 1.2 so here I'm expanding my data set by a factor of 9 by repeating the same request nine times at different temperatures which will give me slightly different questions and answers and the model I'm going to run on is open chat 3.5 it's a 7 billion model and at the end I'll show you how that performance compares to some other models including Mixr solar and also the quen 14b model now before we move to the script I want to cover a few hyperparameter choices the first one here is around choosing the batch size here I want to explain a rather subtle point when you're training a language model you can choose to just put in one row of data but you can also in parallel put multiple rows through the GPU at the same time this is efficient because gpus are very good at parallel processing and they can in parallel evaluate the forward pass on multiple rows of data which allows you to speed up training and when the GPU has parallel processed those rows you will take the loss from each of the rows and those losses will get combined before doing a backward pass but there's some Nuance here in what the effect is of having just a backward pass through the result of one row and aggregating all of the forward passes to then backward pass the sum of those losses let me make that more concrete here let's start off with a case where you're doing batch size one you're only putting one R row of data in for forward pass and then backward pass to the language model so you start off at some point and just imagine this point represents where the weights currently are in the model so you start off with the model weights at 0.1 and you do a forward pass through the first row of data so you calculate the expected tokens and you comp compare those to what the actual next tokens were and you calculate a loss now once you've calculated that loss you back propagate that loss through the network in order to update your weights and you end up at 0 2 so going through one row of data results in updating their weights once according to the loss that was calculated using batch a of data or row a of data now you do the same thing again but starting with the model weights at 0 2 so here still with batch size one we'll start with the weights with values at 0.2 and we'll forward pass using batch two or row two of the data and when we calculate the losses on batch two we'll back propagate and that will update the weights and bring us to point three or R 3A which I'm calling here so you can see with batch size one the models weights are adapting to each row individually so they're following each row in the training process and that's how the model weights are moving if you think of it as moving on a multi-dimensional Surface now let's see how that's different when you increase the batch size so at batch size of two you in parallel calculate the forward pass for two rows of data and they're independently calculated so the forward pass on batch a would result in moving your language model in this direction whereas the forward pass on batch B which is being done in parallel would result in moving in this direction now rather than first forward passing on a and then updating the weights and then doing pass b instead what happens is you forward pass through a and b and you sum up the losses of both of those and that's what this diagonal line is here it's the effect of summing the losses from A and B and so you sum up this loss and then you take a step in this direction here and what this results in moving to the point what I've called 3B here and this point 3B is not the same as the point that you would get to if you individually moved row by row so let me say that once more if you're doing a batch size of one you're moving the language model according to each row so you're exploring the surface run row at a time and updating every time but as you increase your batch size you're combining the losses from the different batches and then taking a step so in a sense when you increase the batch size you're kind of averaging across multiple rows you're averaging those effects and this is a good thing because the averaging means that it's less noisy so you're going to see less noise within your data but the drawback is as you increase the batch size you're now not updating with the very specific information of each row so if one row say says something about the color red and the other one says something about a hippopotamus you're moving according to the average update in a sense rather than first taking the info of the red and then taking the info of the hippo and this ends up being a bit nuanced because when you want to memorize you might actually want to move very specifically according to the update of that specific row and so it may be better to get clearer memorization if you just use a batch size of one and do do a more granular update whereas if you're training on a very large amount of data well then in order to avoid too much noise because you're jumping with every single row you're going to start to put multiple batches together both for Speed and for stability so if I sum all of that up the benefits of a smaller batch size are more granular learning and fitting because you're doing this stepwise update also you've less vram so you need less space on your uh GPU because because you're only going to be using a smaller batch rather than multiple smaller batches but the downside is your training is going to be slower and you're going to tend to have more overfitting because you're literally updating the model very specifically for each row that you have of data moving swiftly on let's talk about learning rate so here my guidance is to start at 1 minus 4 and you can generally increase that value so long as your training and your validation loss don't jump around too much if you see that your training loss is jumping widly or if your validation loss is jumping wildly ideally your validation loss should be pretty smooth that means you need to lower your learning rate conversely if your training loss is extremely smooth maybe you've got some room to increase your learning rate a little bit next up is choosing the number of epoch now this tip that I'm going to give you is for a relatively short training run where you can afford to run that entire run multiple times and see what the optimal number of epoch is I usually start off by running with a constant learning rate so my learning rate is flat and what happens if you do this and you do it over many epochs is that you will eventually find your validation loss starts to increase because your model starts to overfit but I still like to go to that point and Beyond because I can see the point at which my validation loss starts to increase and note how many Epoch it took me to get there for example maybe my eval loss starts to go up after two Epoch so I'll note down to EPO and then I'll rerun again but this time I might change from a constant learning rate to cosign or linear and what this does is it drops the learning rate so that as we approach the optimum let's say we're approaching a local minimum we will take smaller and smaller steps and this can allow us to get a little bit closer and more granular on hitting exactly the bottom of that local convexity last up as an optimization I want to emphasize the importance of choosing the right model for your application this might sound obvious but you should check what the performance is of a few different models on the application you're going for with no fine-tuning so before you even do anything for fine-tuning just raw with no data not even any rag try and see what the performance is on some sample questions so I would ask questions on touch rugby just of raw gp4 of raw quen of raw Mixr and see how that model performs and of course it's going to be logical to start off fine-tuning in a lot of cases with the model that's performing best with no fine-tuning at all and note that that's not just about the model strength it's also about how much knowledge the model happens to have been trained on that overlaps with the knowledge you want to refine so so it may well be that a weaker model happened to be trained on touch rugby and so it's performing better on the questions and that might be an easier thing for me to train than to take the strongest model that knows absolutely nothing about touch rugby okay we're ready now to move to the data preparation step I'm going to be preparing my synthetic question and answer data set and I'll be doing so with the help of the advanced fine-tuning repository you can purchase lifetime access to this on tr.com and also I'm making available for purchase just the scripts from the data set preparation and fine-tuning video that I'm making today so you can either buy only the scripts from today or you can get lifetime access to this repo here which includes scripts around DPO Mamba now memorization chat fine-tuning embeddings function calling long context fine-tuning quantization supervised and unsupervised fine tuning so we'll be working out of the memorization Branch here I'll just open it up this branch has a number of scripts that will allow you to prepare the data set the synthetic data set that's expanded and then it contains fine-tuning scripts which I'll also go through there's the main memorization script and I've also uploaded a copy that includes memorization if you want to be fine-tuning a Mixr model as usual you'll want to get clone the repository and you'll want to start by creating a data folder with some kind of a PDF here is the PDF that I'll be working with it contains the rules of touch rugby now according to readme the first step I want to do is convert that PDF from a PDF format into a raw text format so what I'll do there after activating a data en which is the virtual environment and doing pip install requirements I'm now going to run a python PDF to text. py so here I go python PDF to text. py and so yes we've converted the train PDF to text and so here we have the raw train. txt and literally it's just the raw text pulled out of that PDF next up we're going to convert that raw uh text into chunks and with those chunks we're going to ask questions and the questions that are generated we'll want to then put into a training file here so my goal now is to basically create this file so this is a very large file of questions and answers you can see that every second line is a question and then an answer and what I'm doing is chunking the raw text and asking the language model to return uh questions to me so we'll take a quick look at how that's going to work and to do so I'm going to look at create qa. py and within this what I I want to uh show you as an example of what the prompt looks like okay so down here in the script I have a little snippet which is my prompt so I'm going to just copy that snippet and and put it over here so we can read it together so here's here's the prompt that I'm using and it goes as follows provide questions per chunk train so this would be I believe five question and answer pairs based on the text above so I'll have injected the text above the question should include sufficient information for the answer without the user having any further context so what you don't want is a question that just says what is the rule about the 5 m line that's a bad question because there's no context for whether that relates to touch rugby or some other sport so you need the language model to make sure the questions have got the context within them the answers need not necessarily borrow verbas from the input text but they should maintain the meaning vary the style and format of the questions include some tricky and nuanced questions in certain answers reverse the order order of words compared to how they appear in the text input respond in plain text on a new line for each question and answer do not include question numbers and then here's an example of a question answer pair now this is valuable as always I'm doing a one shot prompting because now I'm putting in uh an example of the format of how I want the question and answer to be returned back to me and so the idea now is to run that on um my input raw data and I'll just show you a wrapper script on this create question and answer script the wrapper script here allows me to repeat this script at different temperatures so you can see it's uh going to rerun the script at different temperatures between 0.01 and 1.2 so I'm going to run this here uh by typing Python and create D QA dopy and now it's asking me to enter one sentence to provide the context of the data set you're training on and this is going to get injected in at the start of every snippet of text so that the language model knows the context and for context I'm just going to put international touch rugby rules next I'm being asked the number of times to iterate over QA generation so this is the number of different temperatures I want to run at evenly space between 0 and 1.2 if you go much higher than 1.2 the answers will be so chaotic you'll just see instability in the responses that you get and you get bad synthetic data I'm just going to go for one here which will put I believe the temperature right in the middle of the uh 0 and 1.2 but if you do nine it would be much better and you'll see later that more data is needed for getting good memorization so I've just put in one and now it'll ask to process one chunk or all chunks the reason for this is you just want to test quickly and cheaply on the first chunk see if it's working and then you can run the full data set so here I'm just going to type one because I don't want to waste too much time on the video and the last question is to run with open AI or runp pod you can set up a runp pod endpoint check out the oneclick llms repo for oneclick templates I will say you need a strong enough model probably Mixr at least maybe L 70b but here I'm going to use open Ai and we can see um because I've selected one prompt it's going to actually print everything out which is nice because we'll be able to talk through what it's saying but right up here it's also telling me um setting eight questions per 500 token chunk so that's the number of questions per chunk um the total tokens in all chunks although we're only processing one is 10,000 so there's about this is a text of 10,000 tokens and here's the cost if I use GPT 4 it's 84 cents and with GPT 3.5 turbo it's about 2 cents and here's what the prompt looks like so we have input text and note how I wrap it in this input text this really helps the model to understand very specifically what input text refers to context is international touch rugby rules then I literally just have that 512 token chunk of text wrapped within input text as well and now the prompt provide eight question answer pairs based on the text above Etc as we already went through here's an example of a question answer pair and actually I give two pairs here so you can see there are two pairs of questions and answers with the right formatting again wrap an example to help improve um the the language model's knowledge of what I'm referring to and so that script um has been run we've just run with one temperature which is close to zero and so I can check out now both train. uh txt and test.txt so in train. txt I have a series of eight questions and answers which you can see here what addition of touch football rules are presented by touch football Australia Etc and you can see it's well formatted and you can also see that we have a test data set so I think the ratio is roughly 10 to1 so there'll be one piece of data very roughly in the test data set which we'll use for validation for every 10 that are in the training data set now the next step given all of these questions and answers and you'll have run it say nine times different temperatures and full for the full data set the next step is to get it into CSV format now I want to actually get it in a very specific format which is a conversational format I want to have an array of messages and I want the questions to be set as the user and then I want the assistant to be the content for the answer now furthermore you can see the questions are actually quite short I mean relative to the context length of the language model this is probably only 100 tokens so what we can do is we can combine five questions just in series within the same prompt uh within the conversation so you'll see here in a conversation we have the user we have the assistant but after this assistant here you'll see we again will have the user because I'm putting in five question and answer pairs within the same prompt here and this will just speed up training because we can uh have a longer context length within each row of data so in order to get from the train. txt set of questions into the CSV format I'm just going to run a piece of code here which is uh python QA to CSV and that should be pretty quick to run and next up once it's in CSV I've just recreated there you'll see for example because we only created one test data point there is going to be uh just one row of data here and actually it's only going to have one user and one assistant because our test set only has one whereas the train set here it's going to have two rows of data because we've eight eight uh eight original questions and answers so there's five in the first one and the second one shorter because there's only three in that as you can see here the second one is shorter and the last step is once we have that in question and answer format we're going to push it to hugging face so it'll be python push push to HF py and here I'm going to put in an authentication token from hugging face and then I'm going to put in the repo name and that will then push the data up uh to the cloud to the hugging face Hub and here we are on hugging face on Trellis repository for touch rugby rules memorization and you can see we have a training split and a test split um about 5:1 in ratio here actually and in the training split you can see there are 303 rows uh here's a sample row here all set up in messages and the beauty of having them set up in messages is that we'll be able to just use the tokenizer do apply chat template function and that will format our prompt exactly as it needs to go into training and inference um so very nice if you can format your data in the form of messages with alternating users and assistance just makes the training script with see later a lot easier okay with that we're ready now to move to theine tuning itself okay to get started with fine-tuning I'm going to open up a oneclick template on runp pod and this will allow me to start start off a Cuda 12.1 instance I'm going to start up an a6000 let me select one here 48 gab of vram so this is sufficient to fully fine-tune something like open chat or Mistral if you want to fine-tune Mixr you'd have to use quantization which is possible and there's a script in advanced fine-tuning repo for that of course you could use larger or more parallel gpus too if you want to find two larger models now everything should be set up here fine this is enough space 50 GB um it's about 15 gbt for open chat for the model itself and you won't have enough space to save another copy for the fine tuned one so 50 is plenty 25s is 30 so plenty of headro there and we'll continue and we'll deploy that runp pod and open up jupyter notebook now here I am in Jupiter lab and I've got the notebook open here for memorization via chat fine-tuning so first off if you're going to run this in Google collab and you're going to use the free GPU that's a T4 then um you're not going to be use Flash attention so you'll have to comment it out when we load the model below furthermore anywhere you see bf16 which is brain float 16 you'll need to replace that with floating Point 16 fp16 because that's only supported by the newer gpus and that's why I've run on an a6000 because it does have the Amper architecture and that allows us to run the latest version with flash attention now the first thing to do is the installation so typically I will run this script here all of the versions have been frozen so any breaking changes as these packages are updated should not create any issues once I run this installation what I like to do is then restart the kernel so I'll do kernel and shut down all kernels and then I'll pick up and start running the cells from right below here now within this script it's possible to run with UNS sloth to accelerate fine tuning by about 2x that will work with Llama Or Mistral style models open chat is a Mistral base model and so it will work with that if you're using other model architectures like say deep seek then you should probably just stick with the Baseline fine-tuning approach which is what I'm going to show here so I'll typically log into hugging face just so I can push my fine tune model at the end then I'll enable this in our environment variable this allows me to do high-speed downloading and uploading from and to hugging face Hub next up I'm going to set the model which is open chat so open chat 3.5 and move on on to loading now here when I'm loading the model you can see I have the quantization configuration commented out because I'm going to find tune here in 16bit Precision I've also set the torch data type to B float 16 because that gives better quality than using float 16 which is uh what you would have to do if you're using an older GPU like in collab with the T4 you'll also see flash attention is enabled because I'm using the latest GPU the tokenizer is then being loaded here and next up I like to run a quick check that all of my parameters are on the GPU and not on meta which is the CPU because I want to train only on the GPU next up we're going to set up Laura for doing the fine-tuning we're not going to fine-tune all the parameters we're just going to fine-tune some adapters some smaller adapters that are created they will be trained while we'll freeze the base model and at the end we'll remerge um the product of those lower matrices back onto the original model to create fine tune model so here I'm just setting up a function so I can see the trainable parameters often I'll recommend actually running this cell to print the model it will show you a list of modules and that allows you to pick which modules you want to fine-tune now as typical we'll fine tune the attention modules and we'll also fine-tune the linear layers note that if you're fine-tuning Mixr you have to comment this out because the linear lays are sparse and it becomes messy if you try to fine-tune them just one other note if you're going to use a quantise model you do need to prepare the model for kbit training so you need to run this loud of coding here just above it you can see gradient checkpointing you can enable this here to reduce the vram during the training it will keep checkpoints which I think adds a little bit of computation but saves you on vram here we're applying the Laura so we're creating those small trainable matrices um to the side of our main model and we're ready to move on and set up the tokenizer I like to print the tokenizer to inspect it and see the vocab size check the beginning and end of sequence tokens and in this case here um what I like to do is apply the chat template so just see what chat template is set up in the tokenizer make sure there is one there and this is the format that um GPT or R open chat uses so I can see this is how my prompts will be prepared for training and for inference now I like to set the pad token my preference is to set it as pad or pad like this if it's already in the tokenizer but since it's not in this case I'm setting it to un so you can see down here we're using the unknown token for the padding and here I just print out a summary of the beginning um of the pad token Roder and I can see the number of tokens in the tokenizer which is unchanged because I haven't added any extra pad Tokens The Next Step here usually would be to set the embedding and Norm layers as trainable this is really only needed if you're changing the chat template or if you're going to extend the context length so I didn't run these cells I don't think it's necessary we can just train with Laura now we're going to set up evaluation this part requires a little bit of human Ingenuity and that is around setting up questions so it's really important that you have some questions that you're confident indicate the quality of the fine tuning I've just set up some manual questions about touch rugby here and provided some answers you can use these questions and answers to evaluate raw models with no fine-tuning and that will tell you whether they have any insight touch rugby already but then you're going to use these after the fine-tuning to determine if it has been effective so before doing the fine-tuning we will run through all of these questions just to see the performance I won't go through all 11 of them but you can see for example that open chat does not know a lot about touch rugby how many players are on the field in each team it says there are 12 and the answer is six so actually open chat on this I believe gets about one out of 11 questions correct so it does not know a whole lot about touch rugby um but it's a model that actually ends up being tunable very well and we'll see that in the results later on next up we're going to load the data set it's that data set I showed on hugging face we can print out a few samples of the data set um if you wish but really there's very little handling we have to do with the data set because it's already in this messages format so we have it set up as an array of messages and that is going to allow us to very easily format the text just using the sft trainer all we have to do is pass the data into the sft trainer we don't need to do any other formatting and that really makes things very easy so let's just get to the trainer setup here um I'm going to train for one Epoch I did play around with this for two but one is about what I needed to get the best results the context length is 512 so I have checked that when I have five q&as they're never longer than 512 I'm going to use um gradient accumulation and batch size of one so batch size as we discussed earlier if I use one it's going to give some very granular updates and hopefully help with memorization gradient accumulation is a way to virtually increase the batch size what it means is that even if the GPU can't fit parallel batches we're going to take uh one batch take the loss and then instead of back propagating we'll run another batch and then we'll add together those losses and back propagate so typically increasing gr gradient accumulation has pretty much got the same effect on quality whether good or bad as increasing the batch size but changing the gradient accumulation does not change the amount of vram that you need but it does change the amount of training time because if you are going to basically accumulate batches it means you need to do backward passes less frequently and that's going to save you a little bit of time during the training in any case I want the most granularity so I've set both of these to one now I've got a custom call back here which is just just helping me to do a little bit of logging there's nothing major happening there and we're right on now to the trainer itself and you can see I'm passing in the data set and the key field is messages field that you can see here and those messages will then have the tokenizer chat template apply to them automatically I'm going to run here for one Epoch um I have that set earlier as a value of one and there's not a whole lot here else to show you oh yes I did initially run with um a constant learning rate scheduler so the same learning rate and I figured out that one Epoch was actually about the right amount to get the validation loss to start Rising so then I reran it using cosign and that brought me down to a very nice uh training curve that ASM tootes down towards a low value for the validation loss as you can see my learning rate is 1 E minus 4 I didn't end up needing to improve that much at all now down here there's a parameter to add noise during fine-tuning in certain fine tunings this can improve performance and reduce overfitting I'll talk about it in the ablations towards the end of the video um long story short I didn't find that it really improved or disimproved the training performance now with the training running um I had the validation loss falling and the training loss falling so everything looked good it's easier actually if we visualize that in the form of a graph which we can do here you can see my training loss is falling fairly smoothly and as is the evaluation loss so once trained we can look at the evaluation after training and let's just look at that same question so how many players on the field each time uh on how many players on the field in touch rugby and the answer is each play each team has six so it's getting uh the answers correct um let's check another one here how many substitutions are allowed and the answer is there's no limit to the amount of substitutions allowed during a game and actually of these 11 questions um the model gets eight out of 11 correct which is very big increase uh from 1 out of 11 now let's maybe look at one question that gets wrong so here in touch rugby there's a forward pass result in a roll ball a scrum or something else and it says a forward pass results in a roll ball which is incorrect it's actually a penalty now just to highlight why this question is so difficult if we go back and we go to the raw training data here which is just the data extracted if I search for forward pass there's actually no forward pass uh combination of words so the word forward pass is commonly known by people who play rugby or touch rugby but it's actually not strictly In The Raw data and because it's not there it's also not in the synthetic data set and this makes it pretty hard for the model to appreciate um there's actually a related question I think down here which is um what happens if the player makes a touch and makes a pass after a touch has been made sorry not necessarily related but it's a tricky question um it basically relates to uh a question where we call Touch pass so if you pass it after being touched you basically uh lose Accord and it's a penalty to the other team but here it says the pass is considered dead and the touch counts and again this is just difficult because it's not described in the same terms um as is described Within in the original rules data set so that is a summary of looking at this open chat model we've brought the performance from 1 out of 11 up as far as uh 8 out of 11 so let's now look at some ablations that means looking at different changes to the hyperparameters and models and see how that affected performance just to ground ourselves with different changes to the training setup let's recap on the training I just showed you this was using synthetic data with one question per 60 tokens and I expanded the data set by running the same request to the language model at nine different temperatures so we expanded uh the data set it's not actually one question per 60 it's really nine questions per 60 because of that expansion there and the base run that I did was on open chat 3.5 so given that with open chat scoring 8 out of 11 what are some changes and what kind of effects on performance did I see well the first ablation I did was just running with a 1x expansion in other words I only created the data set at low temperature I did it with temperature close to zero and in that case I only got four out of 11 instead of 8 out of 11 now why is that well first of all by creating data sets at different temperatures I obviously had way more data and I would have more data coming from different directions and so I think the quantity of data is quite important here and more data especially if it's good data can help the performance the next ablation I did is instead of running with a batch size of one and gradient accumulation of one I ran with a batch of four and gradient accumulation of eight now that's a very big increase it's an increase to a virtual batch size of 32 so that means I'm averaging the losses from 32 uh rows of training and then back propagating on that basis and indeed while it reduces overfitting it has very strongly affected the performance here of memorization and it does seem like those smaller granular updates give better results now maybe using a batch of Two and a smaller gradient accumulation would have done just as well or similarly but what this shows is that if in this case you use a very large virtual batch size it really affects the performance last off as I mentioned I tried adding a little bit of noise to the embeddings and I got the same performance there so I don't see a big effect or benefit in that case the other ablations I looked at were to consider different models so here instead of starting with open chat which actually had two out of 11 I think I incorrectly said one so fine-tuning moved it from two up to eight I took a look at the Mixr model now Mixr started off from a stronger base point it was getting four questions correct out of 11 with no fine-tuning and no context provided to it after fine tuning it got brought up to nine so you can see that using a bit of a stronger model in this case and when I say stronger I mean not just stronger in terms of parameters but also stronger in terms it seems of the data set it was trained on at least from a touch rugby standpoint I also ran quen which was very weak um it didn't get any questions correct when I used the unfine tuned model and it only got up as far as three when I fine-tuned it and last solar 10.7 B which is actually a llama model with some extra layers in it um those layers are copy pasted actually from Mistral so it's a Lama architecture but using the weights from Mistral and then combining basically the bottom 23ds of a Mistral model with the top two3 of a Mistral model so the middle airs are kind of copied and then doing further fine-tuning very interesting model that performs well for its size in any case solar 10.7 B did not perform very well as a baseline without fine-tuning and it did improve significantly with the fine-tuning up to a score of 6 out of 11 now as a final point I want to show you some benchmarking just to compare how open chat does relative to gbt 3.5 and4 I did not fine-tune GPT 3.5 and I don't think it's possible right now to fine-tune GPT 4 but what I did was compare the answers of the gpts with no data whatsoever so just asking the questions and indeed you can see both of the open aii models know something about touch rugby GPT 3.5 scored 6 out of 11 with no fine-tuning GPT 4 scored it's 7 out of 11 and then when I just put in the 10,000 tokens of context which is the rule book so not the synthetic data about the rule book I was able to get H 10 out of 11 correct with gbt 3.5 and get all 11 out of 11 correct with GPT 4 the one question GPT 3.5 got wrong is a question where I asked how far back do the Defenders need to stand after a penalty and is that number of meters different than after a touch and the correct answer is for a penalty it's 10 m and after a touch it's seven GPT 4 gets that correct but GPT 3.5 even when given the correct context like the full rule book within the context it still thinks that it's seven MERS you have to back up in both cases which is incorrect but broadly speaking here I think it's clear that these models from open AI they just know more about touch rugby and so even if you are to fine-tune them you would expect they would probably do better just on the Bas bis that they're starting from a very strong starting point and that wraps up this video on fine-tuning for memorization remember check how a few different models perform for the task you're interested in before doing any fine tuning and see if you can get a head start by using a model that knows a little about your domain second of all make sure your data set is covering the questions and answers you need the knowledge you need from different standpoints maybe you have some very rich raw documents that are just complimentary and give you different perspectives but often you just have one rule book that actually is written in a very concise way and doesn't repeat the information especially doesn't repeat it from different perspectives in that case you probably want to generate some Q&A and do it in a way that gives you the different Vantage points as usual let me know any questions below in the comments cheers folks

Info

Channel: Trelis Research

Views: 3,931

Rating: undefined out of 5

Keywords: fine-tuning custom dataset, fine-tuning llm custom data, fine-tuning for memorization, fine-tune memorization, fine-tuning memorize custom dataset, fine-tune memorize data, fine-tuning memory, fine-tuning for detail, fine-tuning for knowledge, fine-tuning for specific knowledge, openchat fine-tuning, fine-tuning scripts, custom dataset fine-tuning scripts, how to prepare dataset for fine tuning, ai fine tuning, llm fine tuning tutorial

Id: _GkHZQYFOGM

Channel Id: undefined

Length: 46min 51sec (2811 seconds)

Published: Mon Mar 04 2024