How to Create Custom Datasets To Train Llama-2

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
how to create a data set for fine tuning a number 2 model that's what we're going to be talking about in this video in a previous video I showed you this single line of code with the help of which you can fine tune gamma 2 model but we used a pre-existing data set in this video I want to show you how to create your own data set and then how to format it to fine tune a Lemma to 7 billion parameter model the first thing that we need to think about is what this data set is going to do in a previous video we explored how to use chat TPT as a prompt generator for Mid journey in that video I was giving this huge prompt template to chat TPT and it was supposed to generate prompts based on this template but in this video we are going to create a data set where we're going to provide a concept and then the model is supposed to generate a prompt based on the concept and we're going to create this data set with the help of code interpreter so the goal is going to be that our model should take a concept as an input and then generate a prompt that we can use in your journey to create images so by the end of the video we are going to get a custom model that is specialized in this specific task okay so before creating the data set and training our model let's see how a llama to 70 billion parameter model is going to work when we ask it to generate a mid Journey prompt for a person walking in the rain so I'm assuming it's not going to have an idea what mid journey is and probably is going to give us a very generic response okay so the 70 billion model was actually taking too long so I switched to 13B and here's the response that it generated however it looks like you're out for a walk in the rainy day rain today how can I assist you so it's definitely not fine-tuned for this specific task and that's why the response is not really great so now our goal is to train or fine tune the Llama model such that we give it this prompt and it generates a much better prompt for me journey to use now for this specific video I asked gpt4 with core interpreter to create a data set that contains a concept prompt pair and for each of the concepts so I give it an example like a person walking in the rain create a detailed description that can be used by an image generator to create that image now you probably will notice that it defaults to a GPT 3.5 however it's a previous conversation and which I was using gpt4 with core interpreter and I think I ran into the limit so that's why it defaulted to 3.5 so now in this case it said certainly creating a data set that pays the concept of the detail description to be used by the image generator is an interesting task right so it gives me some examples for example concept a person walking in the rain and then it came up with this description right a child flying a kite on the beach right so then here's the description and they look pretty good uh I actually tested some of them uh in my journey to make sure that uh the output images are pretty nice right so I was happy with this that's why I asked you to create a data frame with a concept and description include 25 unique examples of concept and description pairs right based on what it has created as in the asset to create 25 unique examples put them in a data frame then I asked it to write that into a CSV file right and then I asked it to repeat that process four different times so that's why we had multiple CSV files you could do it in a single CSV file but I didn't want it to hit the limits that's why I did it in multiple files okay so here's an example of how the output looks like so you have a column for the concept and then a second column for the description right and it's pretty simple straightforward stuff nothing really complex but it is a very good example data set okay then I needed to combine all of these files together and put them into a very specific format so again I use the code interpreter I asked it to combine all the csvs into a single data frame and it actually did a pretty good job so in total we had 3000 rows with two columns and again here is the example of how the data looks like now keep in mind this is a very simple data set but the goal of this video is just to show you how to structure your data set for fine-tuning the model not this specific data set now this is the most important bit in this whole video so you want to structure the data and bring in both your input as well as your output into a single column called text now I'm doing this within the code interpreter but you can do this in python as well so I said create another column called text which follows the following structure so in this case we are defining a special token that is going to be the input from the human so you have human and then next line and we are adding this specific part generate a mid Journey prompt then we will provide a concept so basically it will take the first row put it here right and then special token for the assistant and that is supposed to generate this response so the goal is going to be you provide a concept and then based on the concept the model is going to generate a description or the prompt now after doing that this is how a data set looks like so instead of two columns now it has three columns the first one is concept description and then a text which puts together both the concept and description in a very specific format now a couple of things to keep in mind we are going to be fine-tuning a llama to 7 billion parameter model however if you notice uh I'm not using the prompt template that is provided with llama2 and that is because I'm actually fine tuning uh the model right so in this case we are not using the chat model but we are using the base LMR 2 model and if you want to fine tune a model you can set the prompt template the way you want so it doesn't have to be like this it could be whatever you want you need to be consistent across all your examples as well as you want to Define very specific tokens both for the human input as well as the assistant response or you can have an instruction input and then a system response so fine tuning basically gives you the flexibility of setting it up the way you want all right so once all this was done I simply asked it to export this to another CSV file okay so this is how the final data set looks like and so the last column there is the one that we just added and which we are putting it into this format so human then the input then assistant and here's the response right so that is basically the description column another change that I made that I made was to rename the file to train.csv okay so once the data set is ready the next step is to fine tune the model so in order to fine-tune the model we are going to be using the Auto Train package from hugging face I have covered this uh in another video I'm going to put a link to that video so if you haven't watched that video I would highly recommend you to watch that and you will understand everything that is going on in here this is a single line of code but I put it on multiple lines to make it easier to understand now you will need to install the auto Trend package so watch the other video on how to do that and once you install that this is the only line that you need to run first we run Auto Train we want to fine-tune an llm that's why we passed the llm flag and then we are passing on the train flag that indicates that we want to fine-tune this specific llam the project name so I'm going to be using a mid-journey prompts that's going to be the project name for my for this specific project then the model so I'm using a sharded version of the llama27b uh that's just because the GPU I was running on is not a High V Ram GPU right so I simply wanted to divide the model into multiple parts and then load one part at a time right uh now this is the most important bit so you need to define the path of your data set so what I did was I put the CSV file that I created in the same folder on which I'm running uh this code on right and you need to make sure that in the project the the file name is actually called crane.csv otherwise it's not going to work so simply put the trend.csv file in the same folder uh where you are running this code right you don't need to write the file name here and you just need to put this dot in there after that we are using the fifth package for uh fine tuning and this is going to use the 4-bit quantized version you set the learning rate then the batch size I'm using a batch size of four uh number of efforts that I have set is ninth right you can play add-on with this and then we want to use sft trainer right and everything is going to be logged in training.blog file okay so make sure that you watch the previous video If you're not familiar with Auto Train now depending on how big your data set is the training is going to take a little while now this is how the file structure is supposed to look like so don't worry about this app dot Pi file it has the code in order to make inference on our fine tune model so I'm going to cover this in a little bit but initially you need to have a train.csu file in a folder and then you can simply run the terminal within that folder right and once you run that if everything goes well you are going to see this mid Journey prompts folder which basically has a fine tune model in here and then along with that you're going to see this training.blog file so this is this basically has all the logs from the training right so if you look at it here actually run it for 9 epochs and the training loss is decreasing pretty good right the learning rate is decreasing right ideally you want to have a validation data set as well this was a relatively small data set right but you want to train your models on a much larger data set right but this is just an example of how the output is going to look like and then once the training completes as I said it will create this folder and the name of the folder is going to be exactly the same as your project name now within this you're going to see these model files right and different checkpoints that it's stored throughout the training right so this is what you should expect as an output after the training completes also keep in mind I was running this on a local machine so I'm not doing this in Google collab and one more thing if you notice here uh if you have seen my previous video so I'm I was pushing the fine tune model to Hanging face in this case I'm not doing that but if you want you can do that but for that you need to set up your hacking face token on your local machine now you have fine-tuned the model but the question is how do you actually make inference on the fine tune model okay so this is where that app dot Pi files comes in now let me show you what is actually happening so we're using the Transformer package importing the auto model for causal LM as well as the tokenizer right if you have multiple gpus and want to run this on multiple gpus at the same time so you can use the data parallel from by touchline then you simply provide the path of the folder that was created by the Auto Train package so in this case that's NJ prompts right both for tokenizer as well as for loading the model and after that we need to set the input context so this is basically the prompt template that we used so in this case my template is generate a mid Journey prompt for a castle on an edge so this is basically the concept right and the assistant is supposed to generate a response and then we take that input encode it using the tokenizer that we have generate a response from the model so that is going to be our output we convert that back to text right and print that text so this is what exactly is happening within that app.pi file that you saw now in terms of parameters right so the max length I'm setting to 85 you can play around with it and keep it longer or shorter and the temperature is set to 0.3 now I am fine tuning this on close to 300 examples right so the results are not going to be that great if you set this to let's say a thousand so it kind of repeats itself and I'm going to show you an example of how that looks like okay so for that specific input here's the output we got a castle on the edge is a medieval Fortress built on a cliff or a Hilltop with the steep slopes on all sides it's a symbol of power and protection and is often associated with the romance and Adventure right so it's it's a pretty decent output given that we only had 300 examples now here's another output this is for a while playing in the range so a boy is playing playing in the rain he's wearing a yellow rain coat and holding an umbrella he's running through the puddles and laughing right so it's actually a pretty good prompt I would say so I went ahead and actually generated these images so a boy running in the rain this looks pretty amazing and here's the one for the castle this looks Majestic right these looks great now as I said in the beginning you can create more complex prompts based on the structure that I've shown you in the start of the video now just to show you the importance of data curation uh so in this case what I did was I set the max new tokens to thousand so as a result this is what is happening so the initial point was or the concept was a boy playing in the range so it generated a pretty good response based on that however since the max number of tokens was set to thousand so it kind of started repeating itself so it generated uh another concept for itself created a response right generated another concept so on and so forth right so if you're creating a data set a couple of things to keep in mind first and foremost it needs to be large enough so that the model can learn from the data set then you probably want to have some sort of end of uh command or end of sentence token so that when it in it encounters that token it will stop generating a new word side because essentially these are simple next word prediction models and that's why I also wanted to show you examples of where the model actually fails right so these are some of the concentrations that that you need to think about when you are curating data sets for training these llms if you found this useful and would like to support my work check out my patreon now if you want to learn how to fine tune these models using a single command check out this video thanks for watching and see you in the next one
Info
Channel: Prompt Engineering
Views: 13,781
Rating: undefined out of 5
Keywords: prompt engineering, Prompt Engineer, llama2, meta llama2, how to use llama 2, llama 2 meta, llama 2 tutorial, fine tune llama 2, how to fine tune llama 2, llama 2 fine tune, llama2 api, llama2 ai, how to run llama2, how to fine tunning a llama 2 7b with language dataset, llama 2 dataset, how to run llama 2 locally, how to run llama 2 on windows, how to run llama 2 locally in python, how to run llama 2, how to create my own dataset for llm, create dataset for llm
Id: z2QE12p3kMM
Channel Id: undefined
Length: 17min 11sec (1031 seconds)
Published: Wed Aug 09 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.