Prepare Fine-tuning Datasets with Open Source LLMs

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
a very effective way to prepare fine-tuning data sets is to use a language model in my previous videos on fine tuning I showed how to use openai to clean data sets and also to put them in a question and answer format that is very effective for fine-tuning chat models now the problem with using openai to clean or prepare data sets is that you're not allowed to use it if you're going to prepare or train a commercial model so in this video I'll show you how to do all of the data cleaning and preparation using open source models in particular llama2 I'll split the video into a few parts first I'll go through the prompt preparation to generate questions and answers from an input data set then I'll show you how to set up a llama 70b server on runpod and lastly I'll run some automated scripts that will convert a data set into a question and answer set using that server as you know I'll finish with some Pro tips later on in this video I'll be making use of the Llama fine tuning private repository on GitHub this is a repository that has a lot of helpful code for doing supervised fine tuning unsupervised fine tuning fine tuning for function calling and I'll be including the scripts that allow you to use llama2 to prepare data sets rather than using openai as in my other videos hopefully I'll give you enough detail even if you don't purchase the repo to be able to build everything up yourself this video as in the fine tuning tutorials I'll be making use of a raw data set that is a set of rules for touch rugby Lama is not familiar with the rules of touch rugby because it's a somewhat obscure sport so it provides a nice test for fine tuning on a model like Lama 2. my goal here is to take a raw data set which is the text conversion of a PDF of the rules so here I have the rules of touch rugby and I want to convert these into a question and answer data set just like this so here I have a question in the context of clutch rugby what's the dead ball line and here is an answer to repeat I want to convert this raw data set into a question and answer data set that I can then use for supervised fine tuning on a chat model now previously I showed how this can be done with openai but today we want to do it using lamma 2 the 70b model I'm first going to show you how to do that in chat format and then I'll do it in an automated way using scripts after I've set up a server on run pod here just for demonstration purposes I've got chat UI running and I'm connected to a text generation inference instance that's running llama 270b again I'm trying to convert some raw text into q a and here's the format I'm going to use to do that I'll take a chunk of text the input text then I'm going to provide some context for what this text is about then I'll provide my request to convert it into q a and lastly I'm going to give an example so that the llm has something to work with and we improve the accuracy of the response so let's start off with the input text here I'll just go to the raw training data and I'm going to take a chunk here of the text let's just take a chunk that looks like this and copy it in as the input text note that I'm not doing any cleanup I'm just putting it in raw next off I'm going to put in the context here I'm going to make use of a script I'll show later on it's the create QA and I'm going to just search for context which is up at the top of the file here's the context so I'll just copy that snippet and move it over to here this is important because it will frame the context for the questions that are being generated we don't want to ask a question where it's unclear whether the question refers to soccer or touch rugby so this is how we will set the context next up we want to put in the request to generate a q a set so I'm going to again copy the request from the request I have in my script let's go down here and just cut this so this is quite long because I have very clear instructions for what I want let me read this out so I'm going to say provide five question and answer pairs based on the text above which is the raw text the questions must begin with in the context of this is how I inject contacts to the question the answer should borrow verbatim from the text above I don't want any hallucination in providing each question consider that the reader does not see or have access to any of the other questions for context again I'm reinforcing context vary the style and format of questions that improves training respond in plain text on a new line for each question and answer do not include question numbers here's an example of two question answer pairs and so last off I have to put in that sample response which again I'm going to find from my script and let me copy that sample okay so here's the sample in the context of touch rugby in the international playing rules set into 2020. what does the half refer to I am going to format this just for demonstration purposes so the half refers to the player text possession following a roll bar and the next question is what's the purpose of the playing rules and the purpose is to provide a standard eyesight arose for a sport of touch football okay so now we have the full request prepared we have the raw text we have the context we have the request and then we have the sample so this should lead to a good response let's hope for the best here I'm running on Lama 2 the 70b model and I'm just putting in one request later we'll send in multiple in parallel okay so now we can see that that raw text is being converted into question and answer pairs you'll notice that there is this initial little piece of text that's undesirable we just want an immediate response with q a so that is handled in the scripts but you can see that the script is working quite well here we have some questions and some answers and we have one two three four five so the script is performing as we expect and the idea in a larger script is to basically repeat this for every chunk of the text there's about maybe around ten thousand tokens worth of text in touch rugby rules so we just repeat this script in order to generate a full data set before I show you how to run all of this in script format we need to get our server our llama270b server up and running we're going to use run pod for that if you'd like to support the channel you can do so by creating an account using the affiliate link below or you can just head over to runpod.io and sign up or log in once you're logged in you want to head over to secure cloud what I like to do here is use two of the RTX a6000s I think this is a good compromise of speed and price I'm going to click deploy and now I'm going to search for trellis research and select llama 70b this is a pretty low click setup all I have to do now is Click continue and click deploy and that server will be up and running once your server is up and running you'll want to grab this run pod ID here this is the Podge that's running 70b and take it back over to the script and put it into your dot EnV file I've created a sample.env you'll need to rename that dot EMV and you'll need to input here equals and then paste in what your run pod ID is all right now that I've shown you in chat format how to convert from a raw data set to q a I'm going to do the same using this script that we'll call on an API keep in mind that I have my run pod API up and running at this point and so I'm ready to make calls I'm going to call the script by running python create QA Dot py and here I'm being told that there will be six questions generated for every 166 tokens if I want more granularity for training I could adjust that in my script the API I'm going to use is run pod and I'm just going to process one chunk this is a feature so that you can test out the script without having to run on the full data set when you do run on one chunk the full chunk will be printed here to the console so you can check it for accuracy and what I'm expecting is for the first chunk of this raw training data set to be converted into question and answer format and that should appear fairly promptly here and so we have the questions appearing now as we expected and this is perfect because this is in a clean question answer format that will then be able to run another script on to convert it into a data set that looks just like this here test a router train.csv it's a prompt completion data set that has the question as the prompt and it has the answer as the completion so here's if we just run it on one chunk I'll show you again if we run it now on all of the chunks there's a little Nuance here that I want to mention I'll just type in all so when we run on all the chunks it's obviously going to have to repeat the same operation for each chunk of the text to generate a longer q a data set but what we can do to speed that up is make multiple recurrent requests those requests are then batched together by the API automatically I'm actually running eight requests in parallel so I'm multiplying my throughput significantly it's not quite an 8X throughput increase because it does slow down slightly the response of the API when you send through a larger batch but it's much better than just a linear increase in time in fact the total time for response for H might be double the time for a batch of one so we'll give that a moment to go through the full text it's going to have to run multiple batches of eight and we'll come back to see the answers here we have a full q a set generated for our input you can see here that we have probably around 30 questions for the raw training data set that I've put in this is a subset of the entire rules it's probably only about a third of the entire rules but you can see that everything is formatted as expected and it's perfectly in shape so that we can run python QA to csv.py and that script will simply take these q a and convert them into a training data set you can see actually we're getting about 46 questions coming out and they're all nicely formatted into a question followed by answer which is exactly what we want if we're going to fine tune a chat model I've shown you today how to use an open source model like lamma 270b to convert a raw data set into questions and answers for supervised fine tuning so here come the pro tips as I mentioned parallelization of requests is valuable for increasing the throughput rather than just running one request at a time you can ping your run pod lamma 2 server many times in parallel so that you get faster throughput and faster completion of your q a data set I found that I can actually get very similar speed to what I would with openai at least on the default rate limits for open AI servers a more nuanced point I want to mention here is around the size of chunks that you should send to your run pod server this is basically a question of vram when you're running your pod which in my case has got two gpus running you can keep an eye here on the GPU memory used the GPU memory is attributable to the base model when it's loaded but it will also increase as you send a larger batch size or a larger context length in so if you send samples with longer context or you send more samples within each batch that's going to take up more of your vram so basically there's a trade-off here whereby if you increase your batch size to get faster throughput or if you increase your contact size you will come up against limits of vram which means you'll have to either use more gpus so you could use four or eight instead of two or simply you can just hold back the batch size to something like eight and hold the context length back to about 500. now it is nice if you can use longer context length because it means that the language model can take into account a larger chunk when generating questions probably though for a lot of tax chunks of 250 or 500 are going to be fine last off just some notes on licensing the whole idea of this video is that it gets you away from the licensing limits of using open AI for preparing data sets remember you can't use open AI to do any training on models that will compete with open AI models so Lama gets around that to some extent but Lama itself only allows you to use llama to train lamma models if you wanted further flexibility I think you probably would want to use an MPT model you can also consider Falcon although I noticed that some of the training data sets that are used for training Falcon 40b and for training the largest falcon 180b have got some limitations for commercial use as well so my recommendation would be if you're training llama models which a lot of you probably are you can of course use llama to prepare and clean data sets for llama otherwise if you want Total flexibility you might want to consider MPT
Info
Channel: Trelis Research
Views: 6,835
Rating: undefined out of 5
Keywords: llama fine-tuning datset, llama 2 fine-tuning, llama-2 fine-tuning, llama 2 supervised fine-tuning, falcon fine-tuning, fine-tuning data, fine tuning dataset, fine-tuning dataset preparation, prepare fine-tuning datasets, open source llm, open source fine-tuning, llama 2 fine tuning tutorial
Id: JJ5mcdEIbj8
Channel Id: undefined
Length: 15min 22sec (922 seconds)
Published: Mon Sep 25 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.