How To Fine-Tune the Alpaca Model For Any Language | ChatGPT Alternative

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
in this video I will show you how you can fine tune the alpaca model for any language I did it with German but I will show you how I did it and then you're able to apply to any other language and fine-tune your own model and the good thing is it only cost three dollars and to understand how I got this inside make sure to watch the whole video and now let's start fine-tuning the alpaca model for any language first let me give you some background information on how both models got trained and for this we can have a look at the Llama paper and see here the data distribution used for training the Llama model and one thing that we can see here is the Wikipedia data set which contains 20 different languages so the model has seen during training 20 different languages while overall if you can see the Wikipedia they just hit only takes up 4.5 percent of the whole data distribution so primarily the model was trained with English language but has already seen some non-english language during training and here we can see all the languages and the alpaca model was trained only on English language as you can see here in this prompt which are used to generate the instruction following tasks and here they say the instruction should be in English so the alpaca model right now is very specialized in English language and today I will show you how we can find you in the model so that is also able to interact with us in other languages for example German French Italian whatever any language okay but how can we fine-tune the alpaca model so that it's able to interact with us in our desired language and for this we obviously need some data in our target language and how can we get this training data and for this two options came to my mind the first option is to just translate the data set that the Stanford researchers used for fine-tuning the alpaca model and the second approach is using their generation technique to generate the instruction following tasks but just adjusting the prompt and saying that the instructions should be generated in German language instead of English language since I know that the Stanford researchers spend less than six hundred dollars for fine tuning the alpaca model I first wanted to approximately calculate how much both of the options would cost starting with the data generation approach I found that the Stanford researchers spent for the 52 000 unique instructions less than 500 using the openai API and further I found that the 52 000 instruction following tasks were generated using the text DaVinci 3 model and in the open AI documentation I found that using the recently uploaded gbt 3.5 turbo model cost one tenth of using the text DaVinci 3 Model so we can decrease our cost by 10 times and I know that some of you might be wondering that GPD 3.5 turbo model is a chat optimized model while the text DaVinci model is more specialized for traditional completion tasks but herefore I could find this helpful piece of information where they stayed although the chat format is designed to make multi-term conversations easy it's just as useful for single turn tasks without any conversations such as those previously served by instruction following models like text DaVinci 3. so what does this mean from my understanding we can use the GPT 3.5 turbo model and get as good results as using the text DaVinci for generating our instruction following tasks so overall I was assuming that generating 52 000 instruction following tasks in German language would cost fifty dollars but since alpaca model is already fine-tuned to follow instructions there's theoretically no need to further fine-tune the alpaca model with new instruction following tasks which we will do if we generate new tasks in German language so in other words by simply translating the existing data set to German language we wouldn't introduce new tasks to the model but just fine-tune the model on understanding and replying in German language okay and how do we translate 52 000 tasks I guess we need an API for that unless you have very much time and I know that Google translator is the most famous translator but personally I like and work with debel and I'm not getting paid for saying this it's just my personal experience compared to the Google translator I think the translation from English to German and German to English is way better than using Google translator and because of that I looked up the pricing of the d-bell AP API and here I found that the translation of 500 000 characters per month is free and afterwards you have to switch to debel Pro pay 5 Euros per month as a base price and after that you pay 20 euros per 1 million translated characters and I didn't know how many characters the data set contains so that's why I looked it up myself and before I show you how I did it I just want to point out that I didn't use the official data set but there's already a clean version where people have supervised and made improvements where basically listings alphabetical orderings were wrong in the official data set and this could all fixed in this new data set which you can see here so in case you're also planning to fine-tune your model based on this data set I would suggest you to also check out this repository and I even could contribute a little bit to it but many thanks to Guru rise I know he put in a lot of effort to make this clean version of the data set Yeah so basically I just went to this file and downloaded it and then I switched to jupyter notebook but you can also run this code in a python file you don't need to put a notebook to run it it's just I like the interactivity you have with jupyter notebooks so that's why I use this one and then I first wanted to show you how I did it step by step and coding it live but but I also don't want to make this video too long so I decided to just show you now my final code to calculate the estimated price so what I then did here I loaded the clean version of the packadator loaded it into a data frame and then I calculated for each of the columns the amount of characters for each item and then I calculated the sum for all three columns so we have our total characters we have our price per 1 million translated characters which is 20. we divide our total characters by 1 million and multiply with our price per 1 million characters like 20 and finally by rounding we have our estimated price which is 365 euros and we have also have to include our base price which will be like 370 euros and that was a big bummer for me because I was hoping to translate the whole data set but this got me thinking that will be quite expensive and that's when another idea came to my mind maybe I could just use chet2bt for translating the whole data set because we already know that the gbt 3.5 turbo model is way cheaper and first I wanted to validate that my idea is not entirely stupid and for that I found the study which investigates if chat GPT is a good translator and I don't want to make a deep dive into the paper I just want to show you this result where we can see the capability of translating English language to German language and here we can see that Google and debel achieves Superior results while chapter is still comparable it's definitely worse than those two translators but it's not too bad so I think what we can take away from this is that translating with chat GPT will definitely lead to worse translation quality than using for example Google or DBL but could be a very viable or very cheap alternative and what I've ended is looking up the cost of using the GPT 3.5 turbo model which is 0.002 per 1000 tokens and what is the token I found this piece of information for English text one token is approximately four characters and having all this information now we can calculate how much translating the 52 000 instruction following tasks would cost using the gbt 3.5 turbo model and for this I divided the total characters by four to get our total amount of tokens I defined the cost per token and rather than that is multiply the number of total tokens by two this is because open AI charges you for input and output tokens and then I divide them by 1000 and multiply them with our cost per 1000 token and we end up with having an estimated price of 18.29 and if we compare this to our estimated price here this is way cheaper and now imagine that the GPT model gets even better at translating languages I think that could revolutionize how the pricing for translation tools is because I looked it up the pricing for Google is pretty similar to the one-off DBL it will be slightly cheaper but still also around 350 euros and I think this margin is huge and if gbt would be able to get even better at translating I think that would force the translation providers to adjust their prices as well okay and finally I calculated how much the fine tuning itself would actually cost and for this I found that fine tuning a 7B llama model took three hours on eight a100s which costs less than 100 on most Cloud compute providers and since there is an implementation that uses the lower rank adaption which is a parameter efficient way to fine-tune the model we are able to run the fine tuning with within five hours on a single RTX 4090 and I already could find a RTX 4090 for around 60 cents per hour so that will mean for five hours it would cost us around three dollars which I think is very cool that it's so affordable to fine-tune a large language model these days just for three dollars okay but enough cost calculation for now okay and now it's time to make a decision on how to generate the training data in German language or in your target language and we know that translating the data set with the gbt model is the cheapest option while training it with a Depot or Google translator will be the most expensive options interestingly I was assuming that using deep all or Google translator for translating the data set would result in the highest data quality while using the gbt model to generate new tasks in German language will result in the worst data quality and why is this I know that the Samford researchers implemented some heuristics to find out whether an instruction is valid or not or should be withdrawn and since I had to adjust those heuristics also to German language and we already know that even the original data set had some quality issues and got cleaned I was assuming that this would lead to the worst data quality because if we can already use the clean data set version for translating this would even when translating it with a chat gbt model would lead to a better data quality and that's why I further only for focused on how to translate their data set does this mean I spent four hundred dollars on translating the whole data set no way but I was wondering if we even need the 52 000 task to further fine-tune the model to another language or if it would be already enough to use a subset of those tasks in a different language and fine-tune the alpaca model with that and at the same time I investigated the distribution of characters per task which you can see here one task consists of the instruction the input and the output as we can see here and I summarize them for each task and then we can see the distribution how many characters one task consists of and here's the overall distribution within our data set and this is the mean which is around 352 characters and this is not based on evidence or something there was just a gut feeling that I thought having a data set with 12 000 tasks could be sufficient to fine-tune them the alpaca model on German language and on the other hand side I also thought that spanning less than 25 Euros would be nice so what I what I then did is I calculated for 25 Euros we we can translate one million five hundred thousand characters using the Deep L API and as I said I thought that 4 000 tasks would be a good number so that would mean per task we would need a mean of 125 characters and by empirically testing I found that if we only consider tasks that have less than 200 characters which is like this this part here we will find a mean that is 124. so basically this is our subset and now finally I sampled 12 000 tasks out of those and ended up having a subset data set which I then used for translation and as I said this is not based on evidence feel free to find your own way I thought that is my first approach and let's see where it brings me and the results I will show you later okay now let's start translating our data set and for this I first signed up to Depot and also to Old may I which I used later and got respectively each time on an API key so I can use the API for translation and here you can either choose deep l or open AI for translation which will be considered later in the notebook and for debel I choose d as German toggle language we can also choose a formality if it should be more formal or less and then here you insert your authentication key or API key and by running this cell you're basically set up to start translating with debel what I've ended I loaded the 12 000 tasks into a data frame and by the way I will share with you the code data sets also the translated data sets of course and the models that are fine-tuned I will share everything with you and the link will be in the description box and one thing I encountered in a data set is that there are some strings that contains code and if we pass code to translator sometimes certain English words will be translated to German or the target language and this messes up our code so I try to create a filter that finds out if a if a string contains code or it's like natural language which is translatable and here you can see I use some keywords like double ends or the double or operation and HTML or a semicolon before a new line or the select for an SQL statement I also try to cover certain patterns which are here described each time like basically a function definition or like for python if we have a DOT between two words something which is a little bit more like a CSS syntax and again a closing tag for HTML so overall we can see that there's a method that figures out if a string is translatable or not and if if it's natural language then we translate it and if not we just leave it as it is and then one thing the depot API allows us to send a whole list of strings so we we don't have to translate our our data set one by one or string by string but can send a whole list so what I did is I chunked the data sets into obviously different chunks and within those chunks there's still three columns which are the instruction input and output and here we're taking one of those columns and translate it to a list and send the list to deep L and before that I created an index so that we will memorize or remember which contents are not translatable and the non-translatable we will replace with an empty string so deeper is able to yeah to not translate them but we also don't get charged for sending empty strings and after we have translated our list so all the natural language strings are translated to our target language then we will update the return list and add our non-translatable content which for example could be mathematical equations or for example python code and here you can see how it is done with Depot so we just have an array of strings positive to Depot and get it returned and we just mapped the response here so that's pretty straightforward and as I earlier I mentioned this is the part where where the chunking is happening so you can define a chunk size also your output directory and and here the data frame gets yeah chunked and as I earlier mentioned the individual columns are here now translated so we have our different translated columns which we then merge again to a data frame I'm pretty sure that could be optimized but for me it worked and then I basically just wrote this data frame consisting of five elements of five rows to the uh to the disk you know you can also increase this one so this way in case there will be like a network outage or something I would always just lose the progress of one chunk instead of basically losing the whole data set because this way you potentially could get charged or pay way more doing translation to run all this you basically just have to run this cell be a little bit careful make sure that everything is set up because yeah potentially this could costs some money and at the end once all the chunks are written to your disk I then just merge them all together to one single file and this is basically then my translated Json file which I can then use for fine tuning the model on my target language which in my case was German and yeah as I mentioned this code will be shared with you and I saw right now that actually this was commented but you need this write Json file comment to actually write the merge Json files so your final translated data set anyway you will have to find a code in there will be fixed or corrected yeah then we basically end up having our source task so I separated those are the ones that I didn't translate basically the 52 000 minus 12 so 40 000 tasks that are not translated and those are the 12 000 I translated so feel free to use them if you want to they're already separated then you basically can have similar results than I had and then the translator one here we can see how basically looked like the 12 000 translated task and you can see German has a lot of not ASCII letters so it looks a little bit wild but yeah that's basically the translated data set that we now will use for fine tuning Our model all right and now it's finally time to fine tune our alpaca model for this I used vast AI to fine tune the model on a cloud GPU instance and if you have your own GPU or prefer a different provider feel free the steps that I will show you will work on all of them but now I will first quickly show you how I set up the vast AI GPU instance and then we will connect to the instance and from there on you can just follow the steps that added if you don't want to use invest AI instance and now I went to the console and here you can see the image like you can edit the image and I specifically use this image because I found that it has the kuder 11.7 version and has an issues with a 12 version which is the latest Partridge image Docker image template they provide here so what you can do you can just type here and see the details so maybe for you it's not this version but you see pytouch latest and then you can edit this string and select this version also I used to run interactive shell server so I didn't use a Jupiter python notebook just to let you know and what I've then did I increased the disk base to 30 gigabyte the first time I just said 10 and once I was downloading the model weights for the Llama model my machine ran out of disk so yeah make sure to to at least use 30 gigabytes to avoid this error and then I choose their RDX 4090 as a GPU I know you can see all the options go for whatever yeah looks good to you maybe I go for the one in Sevilla right now up and download looks comparatively good and then you basically just press on rent and it's that easy now our instance will be prepared and then we can connect to the instance and start fine-tuning the model so if you click here on instances we can see it's right now loading and the comments that I will use for fine tuning the model I already collected them in a medium article so I will also link the medium article in the description box so just feel free to also check out the article because that you can just copy and paste the comments that we will use for fine tuning the model or to set everything up and now we can connect to our GPU instance by clicking the connect button for me it took almost three minutes so be a little bit patient with it and here you can just copy this part so at this point just open a terminal copy this and paste it and ewalla we are connected to our GPU instance and we can also check quickly if everything is running or like if we have a GPU and here here we can see it so now I will run the following three comments first I will clone the repository and then we will change our directory to the alpaca Laura followed every before cloned and install our required modules so that's what we're doing now uh all right this is running I then encountered an arrow with a bits and bytes library and I could overcome it by changing the kuder 11 7 that's why I pointed out that I used this Cuda version or this Pi torch template so okay we will copy this one now place it and I know we almost can run the fine tuning the only thing left is to upload our data to really train the model on our data set and for this either use the second window or we can press Ctrl B and then detach our session is still running we just disconnect it from our GPU instance and are on our local computer and I prepared already two comments for uploading the translated task Json and the fine tune I adjusted the fine tune file you can either do the adjustment using VI or just uploading the adjusted fine tune Pi file and for this we again copy the link I have to switch to my directory where I have to file okay I have to adjust this command a little bit because I renamed the file so for me as you can see here it's translated task e deeper 12K and then we have to adjust our Port which we get from vast Ai and also our instance IP and then by running this command we can upload our data set okay perfect and I already mentioned earlier that I had to adjust the fine tune Pi file and why is this the default config is that we fine-tune the Llama model and end up having something like the alpaca model but in our case we would like to take the apakah model and just fine tune it on another language so we don't take the Llama model and fine tune it to follow instructions with the alpaca model already does we just want to take the alpaca model and fine tune it and this way hopefully we need less translated data that's actually the overall idea at least my idea um and what we can see here later we have to Define our data path like our translated tasks our base model and the only thing that we really need to change is this part where the Laura config gets yeah configured because here we will load already that the pre-trained lower weights of the alpaca model so we basically initialize our lower weights with a alpaca model and find unit from there to a specific language and to do this we can have a look at this file which uh which is the same just on my local computer and here we can just remove this part and then we also remove this part here and replace it with the following statement where we have our we very load our pre-trained Laurel weights which is basically the alpacalora pre-trained and just initialize the model this way so we need to Define our lower weights so we can do it just here and these are the ones from the apocalora Repository so that works and now we also have to import the path model which means parameter efficient fine tuning and just add it here to the path import all right that looks good and then what I did I reduced the number of epochs to two just to make the training faster but I would recommend supervising the loss and how the convergence of the losses and if you should train longer because in an earlier version of the fine tune Pi file there was a statement that three might even be too much or like not necessary so I decided to use two as a number of epochs then I save the file I know I just adjusted this command to upload the fine tune Pi file as you can see here and this was successful and this way now we can just connect again to the machine and are finally ready to run our training and now we can one more time copy this command and paste it in here and then just run it and as you can see now the model weights will get loaded and this can take around an hour I would say so grab a cup of coffee or something because this will take a little bit and for those of you who use also the West AI machine you can just type Ctrl B and then D to detach from the machine and do something else in the meantime while in the background your model will get fine-tuned alright and once the fine tuning is done what I did is I zip my lower weights and downloaded them to my local computer so I can reuse them because once I shut down my GPU and since I can't access them anymore but if you're fine-tuning the model on your local computer that's of course nothing you have to worry about but for this I first installed zip and what I then did is to zip the lower weights which are located under Laura alpaca in our folder and then I typed in control B detach again and then I ran the following command to download the alpaca fine-tuned Laura wait okay and once this is done we can also run our model and before we run the model I first made a minor adjustment to the generate Pi file because since I'm on a GPU instance that is not running on my computer I would like to access the gradu app over the internet and for this I had to add the parameter share equals true at the launch of the gradu app and then we press in right grid and this way now we can run we can use this command to run our model on the GPU instance all right nice and now I got this public URL and this way I can access the graduate interface to interact with our fine-tuned model and let's see and as you can see now we have the web interface and I will ask it in German how are you let's see if what our fine tube model answers and it says I'm good thanks so yeah basically the fine tuning worked and of course I did more than just asking the model if it's good I interacted with this for a longer time tried several problems and it always answered in German and I'm confident to say that the fine tuning worked and the model is able to understand your German prompts and also is able to reply in German and never reply to me in English which is really cool and I think that's kind of a proof that the fine tuning in this way definitely worked but this on the other hand side also got me thinking what if I only translated 500 000 characters which would have been free or what would have happened if I used the GPT model for translating the data set would that also bring similar results and since fine-tuning your model with a vast AI GPU instance only cost three dollars I was like I want to know I want to see how the impact of those two versions will be and that's why I tried it out and for this I first generated the data set that only contains 500 000 characters and how did I do it I just loaded the data set that contains the 12 000 tasks and are translated using debal and what I then did I just sampled 4000 tasks out of this you can just imagine if you translated one million 500 000 characters divided by three is five hundred thousand which will be free to translate using Libo and one third of twelve thousand is four thousand so that's why I sampled this subset and saved it to a Json file and that's how I created the data set containing 500 000 characters and then I moved on and adjusted the translation notebook to also support translating language using the openai GPT model and for this we first installed the open AI module so insert here your API key yeah we will use the gbd 3.5 turbo model as already mentioned and our target language will be German but feel free to use the language that you desire and the only thing that I then needed to adjust is the translate list Dibble command which I wrote the counterpart for open I4 so it works the same you can just pass the same text list and what is to mention in here we do a little bit of wrapping and mapping the response of open AI here you can see like one message looks like this we have a row user theoretically you can adjust it but I thought user might be the best option to use as a role we have to define a role and then the content which is our prompt that we have here please provide the target language for example German translation for these sentences and how did I figure this one out I saw it in this paper first they asked gbd what would be the best prompt to ask you to translate something and they had different versions and those were the three-day yeah thought out of the recommendations from GPT would be the best and as you can see here they got the best results with a third prompt which is the one that I used but I though figured out and the data set sometimes there's a single word in a string so it's not a whole sentence and if you use this prompt and ask the GPT model to translate a sentence which is stated here and have a single word then the model just continues the word it makes a whole sentence out of it which we don't want because that's not really translation that's basically making something up and because of that I checked if there is a blank in the text and if there isn't a blank in the text then I just ask the model to to translate the following word and from here I just run one more time this command which is a chunked approach I have to say though the open AI API right now is quite slow it took me more than a day to actually translate the 12 000 tasks I wanted to have the same amount of openai translated tasks than the Deep L to really compare how the data quality affects the outcome of the tube model and here are the three data sets that I finally came up with so we have here the one that is translated with deep L but just 4000 tasks while this is the first one that I generated or translated and here's the one translated with all my eye and we can see this is one third of this one and as it seems to open my eye is a little bit longer than the other one but yeah that's it for now and then I fine tune the alpaca model for those two data sets as well and for this I just repeated the fine tuning process that I showed you already so I don't show it again and once I had a fine tube version for all three data sets I was thinking how can I actually compare them and for this I had a look how the Stanford researchers did their evaluation of the model and as you can see here they conducted a human evaluation on the inputs of the self-instruct evaluation set and this evaluation set was collected by the self-instruct authors and covers a diverse list of user-oriented instructions including email writing social media and productivity tools and this evaluation set we can find here which is a list containing 252 tasks and since I was a little bit too lazy to compare 250 tasks for three models which is a lot of text and as you can see here there are different motivation app and then set off using multiple tasks per motivation app I just produced this list to one task per motivation app so I ended up just having 70 and since this was still a little bit much I just ended up investigating the first 20 tasks of this and here you can see the 71 evaluation task and I also translated them to German language so I can use them for this I used DBL and here you can see my evaluation results for the first funny tasks of both of those lists so I have one Json file and one in text which looks like this is a little bit more separated per task and you can see for each model the results and to get those results I had to adjust the general Pi file a little bit and let me quickly show you how I did it so here we can see our generate Pi file and at the very end we can see that the gradu app get launched which makes it easy for us to use a web interface to interact with the model but in our case I didn't want to interact with the model I just want to pass in 20 tasks and get a result so I had to rewrite this a little bit and what I did I just removed the whole part and as you can see here we are passing an evaluate function and that is this one where we pass an instruction and optional an input which is perfect because that are our evaluation tasks so what I then ended up doing is just replacing the whole radio app launch part and reading the evaluation tasks in English or German respectively for the model that I evaluated and as you can see here I just iterated over the data frame and as I mentioned I just used the first 20 but feel free to just remove it like this and then you will evaluate the older 71 tasks or even use the whole 252 if you have some time to read and then I pass to the evaluate function the instruction and the input of the evaluation task file and the appended output to a list which I then converted to series and finally just saved it in another Json file so I could collect afterwards the Json files of all different models and merge them together in one data frame and then really analyze the different responses or the different outputs of all three models and for this I created this medium article where I collected all the outputs of the different tasks in an easily readable way so as we can see here first I have the instruction in English the input in English and the response of the alpaca model the English model and then I separated the German models with these three dots and here you can see the German instruction the German input and the result of all three German models and I use this pattern for all the different tasks and as you can see this overall generates or produces a lot of text and that's also why I just investigated the first 20 tasks feel free to evaluate in your case even more and I don't want to go through all the results with you I just want to give you the key takeaways that are found investigating those 20 different tasks and one thing that I found is that the old my eye model sometimes still replies in English language as you can see here in task 13 and the instruction years find sentences from reliable resources such as the guardian or Forbes that contain the exact match of the given sentence or phrase also mention the source of each sentence and the input is there are great options which is translated as cryptolo option and here we can see those two models translated with Depot produce German tags while the openai model produces English text and the quality of the result is also questionable but here we can see that maybe the words Forbes and guardian trigger the model to reply in English language so maybe my heuristics didn't work overall that good when using the GPT model for translation so maybe there is still some English text that's what I would assume by the open AI model answers sometimes in English language while in all the 20 different evaluation tasks but also when I interacting with the models translated with deep L never replied in English language so from that point view the fine tuning for a specific language definitely work okay and I'm very aware that you finally want to know which of the three models perform the best and I know this is quite disappointing but for me there was no clear winner because they were tasks where the model trained only on 4000 tasks performed best there were tasks where the model fine-tuned with translated data using the gbt model performed best and there were also tasks where the model fine-tuned with data using deep L and I also wish there would have been a clear winner but what does this mean for us I think actually that we need way fewer tasks than I initially assumed it looks like even using 4000 tasks is enough and I know that large language models have in context learning capabilities so maybe it's even enough to just use 100 tasks or even five tasks but one of the insights of this video is we can just stay in the free plan of a translator and still be able to translate a data set use that data set and fine-tune the alpaca model to reply in our target language and I think that already is very cool so fine-tuning the alpaca model to our target language basically just costs three dollars and I think that is one more time really cool and finally I would like to mention that I found based on the evaluation task that the English model seems to perform slightly better overall than the German models so expect the fine-tuned alpaca model in your target language to be slightly worse than the English one and I think this also makes a little bit sense because we saw how the Llama model was actually trained and the data distribution primarily contains English text so that would have been also my guess that the English models will be a little bit better than fine-tuned version in other languages but from the resource that I saw on the evaluation task I would say the German models were only slightly worse and not like significantly worse than the English one and that's it for today's video I hope it helps you fine-tuning the alpaca model for your desired language and in case you use fear tasks for fine-tuning the model in your target language let me know in the comments how that works and as always I would appreciate if you give this video a thumbs up if you liked it and subscribe to my channel and until then have a great time and see you in the next video bye bye
Info
Channel: Martin Thissen
Views: 22,554
Rating: undefined out of 5
Keywords:
Id: yTROqe8T_eA
Channel Id: undefined
Length: 37min 55sec (2275 seconds)
Published: Mon Mar 27 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.