Fine-Tuning GPT-3.5 on Custom Dataset: A Step-by-Step Guide | Code

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hi everyone in today's video we're going to see how to fine-tune GPT 3.5 model so let's see what are the benefits of fine-tuning the model first of all you know we're going to get the better or the higher quality results compared to the just pring we can also train on more example because what happens sometimes uh you know your use case has many scenarios and you can't explain or put all those scenarios inside the prompt and you need other way to actually give those you know examples to model so I would say the fine tuning is the way where you could give you know many such examples and scenarios that you can't put inside the prompt the other thing is that when you findun the model you don't need to have a long prompt explaining again and again you know what exactly you want to achieve you just give an example and model will produce the expected output so you save tokens because the prompt become shorter and the lower latency request because you know the longer the prompt it's going to take more time to get your response from the open AI but if you have the shorter prompt you know you're going to get the quick response from the open Ai and see it has many Advantage it's good to actually see whether you could you know fine tune your own GPD 3.5 model and even it is uh you know as for the open AI that you know if you fine tun the model that fine tun model actually can perform as better as you know the GPT uh four model so you can read actually more about uh you know this here I think they have blog post and you know they have detailed guide also so we're going to see step by step how we can find tune you know uh GPT 3.5 so for the fine tuning I generated data set using you know chat GPT itself so I asked chat GPT to give me some you know data set so it was failing to generate data set in one go so I wanted 100 example at least but then it generated into the part so I just combined that all data set into the single file so we're going to use I will explain you what exactly the data set what we have so maybe uh I think yeah this is the data set the bank support training data set which is nothing but we have some support queries and then we have two categories or the topic or let's say you know topic of that particular support query so this is related to the fees and charges and I also have the subcategory of it you know so that you know this is more of like they used to you know decide where to forward your query so this is like intent detection or the topic detection for the you know query so we have around you know 100 examples what I generated with the chat GPD with the many categories and the subcategories right for example you might see you know uh these are some of the examples right I haven't gone much into how correct and all of this things but let's say we have those 100 examples with us right and let's see what we need to do so first of all definitely we're going to require the open AI so let's install the um you know open AI library and again we have already seen the file we don't need to read it but again if you just want to see again read it as a pandas and you can see we have three columns one is the support query which is our input and then the expected output is the top category and the subcategory both identification one of the challenging thing here because we have only 100 examples and we have top categories which are unique I guess we have 10 categories and subcategories unique I think we have many more than 25 or maybe 30 categories what we have you know the sub category here the thing is that we know that in a classification if you have more categories or the more unique categories you require more training example because you want your data set should have at least significant representation for the each category in our case we have at least or I guess we have exactly 10 examples for each category what we have here but it is going to be very challenging for the subcategory because subcategory we have hardly two to three examples for you know uh one category so we're going to see with the 100 examples whether GPT is able to to identify both top category also and the subcategory also that's we're going to see so first of all if you look at the you know finetuning uh guide what we have here we could see the kind of uh you know data set that we need to prepare so first of all this is a chat model so we need to make sure our training data represents some kind of a conversation right so it's going to be the set of messages you know that we going to see to the training where you could have something like you know you can have a system message you can have a let's say user message and then you will have the expected output which is nothing but the assistant message so we want to make sure we're going to format our data set in the same fashion so in our case what will happen you know our input is going to be this support query and then uh so this support query will go as let's say the you know user message inside the message and the output is going to be an assistant message I will show you the code right one interesting thing I want to do is actually you know instead of outputting the plain this categories like this category colon or semicolon or something like this I would rather want to Output my you know this category as a Json so that it becomes easy to you know um process in the downstream task and that is also one of the advantage of fine tuning that you know you can um fine the model so that it can produce the Json output and something right which is more reliable compared to the plane output that's what we're going to see whether the 100 examples are sufficient or not right so let's look at the code what I have done here so we're going to uh read the file and we know that we're going to iterate through each row of our data frame this is each row of our data frame and then what we're going to do we going to first require this part because we want to convert these two columns into the Json format what we want right so I simply to uh you know create the Json string here right let me close this part here yeah so the top category is become one key in the Json and the we take whatever the value inside that row similarly for the subcategory we get and then finally we create our messaging array where the first message is a user message which is nothing but the actual support query and the expected output is nothing but the assistant message which is actually our Json response that what it should say so we are actually teaching GPT that if user says this or a user send this message we want assistant to produce such a Json having the category and subcategory of that particular you know user query this is how we going to uh you know generate our training data so let me run this thing so that we could see what exactly we you know generate now you could see when I run outsanding data and I'm just printing our first message here and you could see you know the first thing we have the role user and user is asking can you explain the monthly maintenance view of my account and then assistant going to produce this Json saying that you know the top categor is fees and charges and the subcategory is actually understanding fees okay this is what uh let's say the training data set what we have created and we want to make sure the output is really proper Json so what I'm going to do actually I'm going to you know read this particular value and load it as a Json to make sure it is you know proper Json okay because since I'm able to get the python dictionary using the Json load seems to be this is the proper Chon now once we have the data set you know formatted in a particular format what chat GPD requires let's do one thing let's create two setes one is the training the other is the validation so I'm going to put 80% for the training and the you know 20% so we will get around 80 examples in the training and we want to make sure it is stratified so that you know the ratio of the that as for the top category at least we want to maintain so let's create the training and the test T in set I would say trading and validation actually testing I have another separate so I have another 20 queries uh 20 queries that I kept it separately which is going to be used for the testing so for just training and validation we have 100 examples okay now once we got the training data and the validation data like 8020 we know that we need to convert the data let's go back to the again uh you know the example format and see what we need to do so once we have that uh you know data we need to convert actually into the Json L which is actually Json you know a new line kind of Json lines format it means you have each example on the each line that's what we need to you know convert so let's go back to our code and you know we're going to read our this training data and the validation data which is nothing but if you look at what is our training data I think training data is nothing but just the list of Top This dictionary you know this messages dictionary what we have it right maybe if you want to print it again and see what exactly we have in the you know training data we can do that thing again let's look at the training data and you see it is nothing but just because training data is a list and uh you know we have uh you know dictionary inside uh that particular so each element in the list is the you know what were one example what we have and we want to make sure that this needs to be written as a Json uh lines format where each of these go on the new line of the Json file so that's we're going to do so we're going to we have created this function right to Json L which is nothing but taking our list right then we going to open some file we're going to iterate through our list for each entry we're going to Simply use Json dump so which is nothing but know we know that this is actually a python dictionary if you want to see you can check its type it's a dictionary right so that's why we are using json. dump so that we get actually the Json string because we require the Json new line file right so we're going to iterate through each of this dictionary we're going to dump that dictionary into this particular file but we're going to also write a slash and after that so that each entry goes on a new line and that's nothing but you know that Json n file so let me run this thing so that we get two files one for the you know and we can see actually those files let's see so we got a train you can see now each object is actually on each separate line what we got that's what it actually mean uh the Json L this file format right okay now we got the files now we should upload those files to open AI because that's the steps that we need to take if you look go back to again the you know the guide it says that first of all prepare and upload your training data so let's upload our training data to upload our training data first of all let's create the you know open ey object let's call it client this is like new syntax coming after you know open a I think 1 point something and here you can specify the API key ideally you should give it as a environment variable but just for the tutorial purpose I'm like showing this key but I'm going to definitely delete this key once the tutorial is you know um end so now I got the client right and then what we could do we can upload the file so we can use the client. files. create so we are actually creating file or let's say uploading file so we're going to Simply read our file training file here and then we can specify the purpose of the fine tuning right if you have seen my um let earlier tutorial that I just created recently which is opening assistant there also we have uploaded the file but there we the purpose of the upload file was actually assistant if you uh if you seen that video we can see the purpose of assistant there but here the purpose is fine today because we are sending it for the F training purpose so we read the training file we read the validation file and we're going to you know no worry uh we don't need to remember this code it is actually available um here and you're going to also get this notebook you know as I always share my code so you don't need to worry about you can simply go and run this particular thing so if you see once we uploaded this file you know in the response we get the file object and we we can look at its ID because we're going to require this ID when we going to submit our fine tuning job so we got this training file ID we got this uh validation file ID right now let's create our F tuning job we're going to give some suffix to uh our you know F tuning since I'm doing for the YouTube I might do the YouTube and maybe I said the YouTube tutorial or something like that to to give the so that if I have multiple models I can easily find that model which is related to Let's YouTube similarly we created the files we're going to create the find tuning jobs so client. findind tuning jobs. create we specify our training file ID validation file ID and which model we want to find currently it is available only for GPD 3.5 turbo so we're going to use this thing and we're going to pass this suffix so that this will be an extra you know a name attached to the model so that you can identify why did you you know train that particular model right so let's submit our job and once we submit our fine tuning job we should get the fine tuning job ID so we got some new ID here fine tuning we we did got created and if you look at at the moment there is no fine T model train for this particular thing right it is not finished yet you could see the number of aox it's going to choose automatically bat size also it's going to choose automatically because we haven't given any those parameters everything it's going to be an auto let it choose and you could see what model it has choose it has choose the gbd 3.5 turbo this particular model right and um this is how you know this is the response you got right now we need to keep checking this particular job ID whether it got completed or not if the job completes we will get this value fine tune model okay but how do we know whether it is getting completed or not first of all you can even list all your fine tuning jobs so you can use client. finetuning and you can you know this jobs. list you can list the let's say I'm going to list the 10 jobs and you could see we should see this one also there oh no no this one where it is I wanted to search this thing okay so you could see the one job we just submitted we can see that job also here now it has choose the aox as a three B sides it choose one multiplier it chooses something this whatever this particular model right and let's see whether it is what is it status do we see status here somewhere okay the train tokens also unknown there must be status or somewhere yeah here is the status so status was here validating files let's see what is status here now yeah Status is running so actually our fineing you know uh job is running there right it takes some time actually to complete this particular job and you need to keep retrieve right you can also once you have the job ID you can also retrieve one particular job so right so we have this um you know job ID I can copy and we also have one function to retrieve a specific job ID so I can put that job ID here right and uh I can see the response and I can retrieve that particular job ID and you you could see the same job IDs that you know um it is running we can see our file names the status is running and all of these things right so we can wait till this particular things get closed but you know before this recording this tutorial actually I also had run one more time you know just before that so that I get the ready model you know the moment I start you know recording so if you look at I have a um maybe this was the just model submitted but I also have the other model let me search the a suffix called YT maybe no underscore yeah no no so many places but I just want to see the Wht there yep here so if you see I have one job here which is actually completed which I you know submitted you know uh before I record it so if you see this is its a job ID maybe I just copy that job ID so that we can easily see let me copy this job ID and paste that job ID here okay so you could see I created just before the recording of this particular thing uh and uh we also got the finein model name here so no we have the finein model if you uh if it is still running you will not have anything as a finein model name if you see the job we are currently running it has a fine tune model name none but I got the fine tune model name here this is It's it it actually says that it is a fine tuning then you have GPT 3.5 turbo which model then I think this is your organization name or your account name and this is the suffix that I had given and some random string that they generate right kind of a thing and uh we want the find in model ID so that we can do the prediction because it is the model train on the same data actually so let me look at the once we have the response for that particular job we can actually print the fin model IDE and so that you can read carefully here right this is the F model that I trained before you know thisne let me see whether this finishes I think it takes time it won't finish still I would try to check uh whether it got finish or not the first one oh no it's not finished so let's use the one which I train you know before recording right so let's use that model for the prediction right so let's test our finetune model that we have trained so to test we also want to make sure we again uh you know uh format our training data set in a similar testing data set in a similar way so whatever the user query we have we need to make sure we pass it as a chat message so we will pass role user and the actual query support query we have and we want to make sure it produces the role assistant in the response right so this is just simply formatting my row of our data frame right and then I wrote small function which simply does the prediction which take our message you know it will take this message and it will take our find in model ID and we simply call our chart completion API so we call our chart completion API saying that hey here is the model and here instead of giving that GPT 3.5 turbo default model I'm going to give our F twin model ID this is the messages that I want and some default parameter like temperature and the max tokens and all of the things right and eventually I will get the response back pretty simple stuff if you have youred GPT prediction before right and then I wrote some function which actually going to utilize both the function formatting the thing and the prediction right let's see what's happening here let's read our test queries you can see the test queries are also similar we have the support query and the top category subcategory that we want to identify right let's go and um what we're going to do we're going to read our T data frame and eventually we're going to call our function called store predictions and see what's happening here it is taking our training uh testing data frame and it is also taking our F tune model ID uh the one which we want to do the predictions right then it creates one column called prediction and set all the values as a none then we're going to iterate through our testing data frame then we want to make sure each row what we are getting we are formatting for the test which is nothing but converting into the rule user and the contents supp put query once we got that test message we are calling that same predict function that we just saw which taking our message and the model ID that we want to do the prediction and finally we got the prediction result and we want to make sure we put that result inside this column value so we update our index and we actually look at the prediction column and put the value so that we can compare side by side right this is what this function does so let's run the predictions so we're going to run prediction for those you know 20 I guess we have only 20 examples uh let's see once it does the prediction we should be able to see the predictions. CSV file okay it took 18 seconds for 20 uh predictions which is pretty good you know um which is pretty good I would say one or two seconds for one example you know that's quite fast if you use the normal GPD call it takes like couple of seconds right for this so let's look at the predictions now we will see how good the prediction because we know the problem without training datas yet right it had only 100 examples and let's see which C it is able to identify which it is failing to identify let me do this thing now with C so the first question it was related to account management and account maintenance what it is saying so it failed it says customer service and the contact information so it totally failed second it is says account management we could actually programmatically check this thing but I just want to see side by side account maintenance account management subcategory account closure okay so here we could observe it is able to identify the top category but the subcategory it failed we know that it's hard to identify subcategory because there are hardly two three examples only for one subcategory but at least the top category here it identified what about here transaction service the transaction inquiry transaction service here again it failed so you see top category it is able to identify because each top category we have 10 examples subcategory it failed transaction Services here also transaction Services transaction disputes it is saying but it is actually failed you know the subcategories are actually also overlapping you know maybe but one thing I could see the top category it is able to identify and going it is giving me a proper Chase on what about this card service stolen card oh this is correct I guess Card Services stolen card here again it failed what about this online mobile banking and Technical Support yeah this is correct online banking security and access this is also correct loans and mortgages loan and mortgages loan application this is correct correct what about the others saving Investments account types saving investment investment option this is wrong okay so one thing we can observe actually uh with the 100 example it has pretty good performance because we know that it's hard actually because we had just 100 example and there are top 10 categories and then we have almost 25 to 30 sub categories what we have it's really hard right I'm sure instead of this 100 examples if we have a 200 or 500 examples where we have sufficient representation of the each category then definitely uh we should you know see the better results at least for the categories where it is you know uh failing so this is how you can you know train and usually you should do the same way you should start with the small data set just 100 examples you know train the model and see how good it is and then you can collect more data let let's say the next I could actually train it on a 200 example and see what happens right so when you are working on something on your problem you need to start with the small data set and then see whether you have a sufficient data or you actually require the more data to train the thing right I hope you found this video useful as I said I'm going to attach this notebook you know uh when I'm going to upload the video there right so you have all these things if you find lot of things overwhelming perfectly fine you can go and check I think you should have all the code here somewhere you know yeah you could see all the things are here you know I just try to explain you know those things thank you very much if you want to um give any suggestions you can put it in the comments and let me know if any uh you know video that you are expecting from me thank you
Info
Channel: Pradip Nichite
Views: 8,423
Rating: undefined out of 5
Keywords: gpt 3.5 finetuning, gpt3.5, open ai model, llms
Id: YioMpE2jGIU
Channel Id: undefined
Length: 24min 47sec (1487 seconds)
Published: Mon Nov 13 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.