Create Training Data for Finetuning LLMs

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hello and welcome to new video this is Mo from APC M stre path and I'm glad to be sharing with you in this video how you could prepare your data for fine-tuning your large language models using ons sloth and also using marker BDF library in this video we're going to be showcasing the main concept itself and how to make it work and then we're going to be jumping to Mar PDF package and presenting a deep dive into that package and lastly we're going to be showcasing briefly how you can use the resultant training data to train and fine tune your desired L language model using ons sloth if you are new to our channel do not forget to like this video And subscribe to our Channel and press the Bell icon to get notification with all the new tech videos that we provide and now without further Ado let's get started [Music] so the main concept is to have text files you extract the text from these files and save it in a format that is understood by the large language models and it gets easy to fine-tune them the example I'm going to be showing you in this video comprises a BDF document which I'm going to be using one of them packages called marker PDF to extract the text from that document and then the extracted text we're going to be analyzing the structure of it and we're going to be using python to convert that extracted text into a format that is understood by large language models and the final output is going to be the fine-tuning data set that is going to be used with ons sloth technique to fine-tune our designed large language [Music] model having understood the main concept of the solution it's now time to dig deeper into the actual building of this solution we're going to be using a package called Market PDF the repository is on GitHub I'm going to be putting the link in the description below the video and when you open it you're going to be greeted with this website or with this web page which includes all of the folders and files that helped build that python package or Library it includes a brief description about the package itself the limitations how to make it work the accuracy measurements and other bits and pieces the way to install the package we're going to be digging the deeper into that in a bit so the first thing is I created a folder and I called it bars PDFs faster this folder it contains the data that or the BDF files that I want to convert and I went to vs code and then I clicked on file and then open folder and then I chose the folder itself and then I opened it and as you could see on the left I have the data folder which includes all of them files it's now time to create a virtual environment you can call it however you please I call it bars PDFs so the way to do that is go to terminal and then new terminal and then to make it easy for the eye I'm just going to click on PO shell here and then move terminal into the editor area and then if you clicked on power shell and then drag it to the right here and then let's type in some code so that we can build the virtual environment so the first thing is python M and then VNV virtual environment and then the name of that virtual environment once you click that it's going to create a folder here like that one pass PDF for me it was pass PDF only and it created that folder and files and folders beneath the main folder in order to activate your environment you would simply write its name so for me it was boss BDF and then backs slash and then scripts which is that folder here and then there's a file called activate back slash and then activate then if you clicked enter you will find that here the active environment is the boss BDF you can also do it through cond if you have an aond installed in in my case I do have an aond but I prefer to create it using the virtual environment method if you're using any cond it could be cond and then create and then the name of the environment F that one let go BDF and then you can also refer to the preferred python uh version that you want if it's 3.10 or 11 or the likes in my case the python version that is installed on my system you can know it using python D- version and for me it's 3.1.5 which is the same version that has been allocated to boss PDF you now need to install the package the main package or the main hero of this story which is you can install it using pip install and then marker dpdf I've already installed it on my machine uh I have noticed some something that by default it installs torch with CPU support only in my case I do have an Nvidia GPU so I would like to have Cuda support so in order to do that I went to the py torch. org website and then if you scroll to the bottom of it there is um you can choose the stable version and then whatever OS you have in my case it's windows and then I'm going to install it using the PIP command and then for the program language of python and then I want to have Cuda 12.1 and you can copy that contrl + C and then go to your terminal and then contrl V and then it's going to take some time because it's going to download about 3 gab of [Music] data so having installed Mark BDF it's now time to explore its capabilities Mark BDF can either convert a single file or multiple files files let's understand together the code for it so if you want to convert a single file you can type in markor single and then the path of the file um BDF and then the path of the output file or where the converted files are going to be set where the converted extracted text is going to be set and then the batch multiplier the more you have or the more you dedicate the more vram there's going to be required and then there's also an argument of the maximum amount of pages that want to convert if you have this kind of limitation and then it contains a number of them languages that are supported in my case I kept it quite simple marker _ single and then the folder called Data the file was file 6. BDF and then the output is converted uncore BDF the converted um or the file where the converted text is going to be set and that's basically it I pressed enter and I will show you in a bit what happened so if you could see here on the left there's converted PDFs when I opened it there is file six and is showing a number of the images that have been um extracted as well from this text that's because the nature of my BDF is as follows so let me show you the original BDF file when you go and click on file 6.pdf it's my submission for the RS APC as you can see here there is a watermark and this water mark was understood to be some sort of like an image plus I do have some notes that I was putting in the uh file itself for the studying and the likes and the the structure of this file is that I do have three different types of competencies or three different categories the mandatory the optional and the core and Below each and every single category of competences there are different names or types for s that is diversity inclusion whatever it is then there's ethics and so on and so forth and then beneath each and every single competency there is a level so level one is knowing level two is doing and then there is a third level which is advising and then there is below that the summary of experience under that specific level I'm explaining this because I'm going to be using it later on when we're going to be showcasing or exploring together the code that I have drafted to convert the extracted text into a format that is understood by large language models [Music] having understood the structure of that PDF let's explore together the converted text after we typed the code on the right so if I go to converted PDFs and then file number six you will find that there is um a file called my submission and then MD MD is marked down so if you clicked on it you will find that the file is a bit of a mess but there are some data that is easily extractable from it so it's mandatory comp encies and then the name of the competency and then the level and then the sum of experience and then you if you scrolled inside your BDF you will find that there are some references to some images in my case the images there isn't any images really it's just the watermark I did some basic cleansing in order to make the document easy for the eye or the markdown file easy for the eye which I did here so manag competencies what is the competency I just pressed enter nothing major it took me about a minute to do the whole file a minute or two minutes and now the file is in a very good shape and I'm going to be doing further cleansing on it which I'm going to be showcasing to you in a bit as we have explored together in the in the code itself or the documentation let me go back to it so here you will find that there is conversion also of multiple files it works exactly the same there's just a little bit of a difference so let's explore together the code itself so marker because that's the name of the package and then the path to the input folder and then the path to the output folder worker is how many PDFs are going to be converted at the same instant and then here is maximum maximum number of PDFs to be converted and then if you want to save the metadata related to the bdf's as a Json file and the length is the length of the characters the minimum length of the characters to be converted in my case I kept it quite simple just marker and then the path to the input folder then the path to the output folder and the way to do that is marker and then for me was data and the converted PDFs and that's all about it really so I'm going to be focusing more on the conversion of a single file but the conversion of a single file is the same as the conversion of multiple files that's not going to be a huge problem so going back to the partially cleansed file or markdown file I created a python code for it so first of all I imported the regular expression because that's going to help me to to delete some characters and split the text and the lights um I also imported the Json because our phont app is going to be in Json format or Json lines format the tenter is um it was optional I prefer to make my code a little bit more flexible tenter allows the end user to have a dialog box to choose and do some actions from it so in my case I created a function to choose multiple files multiple input files uh to be converted into J Json and the multiple input files they're going to be saved in one output file in whichever location the user wants so that's that's why I imported the tenter library and then I did some cleansing because I had a draft not for assessment and not in that way not assessment for assessment and these were scattered all over the text or the mark markdown file and then I had an idea of using the architecture of this file I will have three different levels that I'm going to be focusing on so the first level is the category of the competency and the second level is going to be the name of the competency and then the third level is going to be the actual level that I want to write in and that's what I did in the next portion of the code so I created an an array with all of the text and then I began this separation into level one 2 and three level one is going to be the competency group or the competency category level two is the name of the competency and level three is the each level in the competency and in this portion of the code what I did was I had a template of the conversation in mind and I wanted to make it as Lively as possible so the human being or the person is going to be asking the the bot that they want to write about a specific competency and then the bot could could answer by saying that yeah I can help you with that which level are you seeking and then if the person is not knowledgeable enough they're going to be asking how many levels are there in this specific competency that I'm asking about and then the bot can answer that based on the data that I do possess the last level or the latest level is X for this competency is X and then the person could ask for help in any level that they would require whether it's level one or two or three accordingly and then the bot is going to be providing The Final Answer required by the end user and then lastly it's all about colting everything together in a huge function that is going to call up all of the other functions again this code is not going to be relevant to each and every single person it was quite relevant in my case because of the structure of the data that I had and I'm going to show you now the end result so I saved it here and it gave me that Json L file which is quite decent to say the least and here is the converted file it is which is in a Json line format it's composed of 37 entries which is for this specific submission so the human is going to be a want to write about specific competency and then the bot is going to be answering and then the bot is going to be saying that based on information that I do possess the latest level is level one and yeah could you help me right with level one and then yes go forward for me the test was in an area like ethics I have three levels so could I make it work in three levels okay so it worked because I have in this competency the last level or the latest level is level three and and I do have three examples one for each level so here sure here's a proposed response for level one here's a proposed response for level two and here is a proposed response for level three and so on and so forth and the end result is really good as you could see here we can even uh converted back to adjacent line format so I have 37 entries for the whole of my submission and yeah it took me sometime but the RSS APC submissions they are quite systematic so I'm not going to be developing that code again I'm literally going to going to be running that code for each and every single person it has an RSS APC [Music] submission so having extracted and converted the data in a Json line format it's now time to use this data to find youe your desired model using ons sloth I have made previously a detailed video about how to use ons sloth on your machine locally to fine-tune your designed models on customized data sets I'm going to be showcasing how you could do that briefly so ans sloth allows you to find your knowledge language models a variety of them actually La mrol F and Quin and it's much faster than the old methods you need to have some stuff in place so the first thing is you need to have Windows subsystem for Linux because UNS sloth does not run Windows so you have to have Linux or Windows subsystem for Linux and you can open power shell and type in WSL D- install and it's going to install Windows sub system for Linux version 2.0 the second thing that you need to have as well is an aond so that you can run all of them codes in aond environment so the way to do that is you can go to the link which I'm going to be putting in the description below to the main site of anaconda and then you can download whatever version you will please and then you can install it using the code that I'm showing you here make sure that you are changing the word version with the actual version that you download in my case it will was 202 24.02 the third thing that you need to have is ons sloth and I have followed the installation procedures provided by the UNS sloth team on their GitHub repository so basically you create an environment called UNS sloth environment with python 3.10 and then there are some packages that you need to install such as pytorch with Cuda enabled uh the Cuda toolkit as well ex formers and then you need to activate the environment itself and then lastly you need to download lots of them packages from the UNS sloth repository and you would need to install hugging face uh Hub be which is parameter efficient F tuning accelerate bits and bites and weight and biases I have found out that there are some problems that might arise from this approach so I have downloaded the torch with good enabled uh from the p tor.org as I've shown you in the beginning of this video also I had to install the Nvidia Cuda compiler or the Nvidia toolkit I've searched online I'm going to be putting the link in the description below and basically you're going to be copying these codes and inside your environment and it's going to run like a charm and then let's run through the code quickly because I've already explained it thoroughly in my other video which I'm going to be putting the link in the description below so importing some necessary packages such such as torch and then supervis training Transformers data sets and also Import in ons sloth main package the main thing that is going to differ from my previous video is the actual format of the data set so there is a library called load data set which allows loading the data sets in a format that is understood by the large language models so instead of having me here CSV file path it's going to be the Json file that one in the load data set that CSV is going to be Json and the data files is going to be the Json file path and definitely you'll have to have some sort of like train data and testing data so instead of the delimiter it's going to be split and then the equal is going to be by train the remaining part is not going to be used in this code because I've used it with a CSV 5 it's going to be different now because the Json is already in a very good shape uh this code that I'm have used here was used with a CSV file that wasn't in the same shape the next step would be importing your designed model ons sloth they have a very rich repository on the hugging face website which contains a lot of the models instruct models and based models in my case I imported the UNS sloth mistol instruct model version 0.3 there are other bits and pieces like llama gwiin F both in base and in instruct variants you can import whatever you would please having done that I did some sort of like a function to test the model prein tuning and I'm going to be comparing the performance pre and post find tuning so here function is quite simple the inputs toing them inputs and then decoding the the inputs into outputs and then printing the decoded outputs and then asking the model a very simplistic question and see the response pre fine tuning moving on you'll have to initiate the low rank adapters by targeting specific modules and you'll have to play around a bit with the parameters according to your use case in order to use ons sloth you'll have to have a weight and biases key in order to do that you'll have to go to the website W NDB which is weight and biases doai and then you create an account and then you get a key and then you paste the key after you do the login having done that you'll be able to manipulate the training parameters to suit your needs so here I'm importing the model and then the train data set that's data set up above and it's split by train and then the key to that data set is text and then the tokenizer is going to be the one from mrol the maximum sequence length that's the 248 which has been already defined at the beginning of the code and then you can manipulate the training argument such as the maximum number of steps for training of the model minum steps for the model in order to get into shape and also we're going to be saving it what Optimizer you're going to be using and how many logs you want and that's going to be quite relevant to how many check points you need and if that's going to be of any importance to you having done that it's now time to train so trainer train and then you're going to leave it a bit it's going to take from you anywhere between 30 minutes to an hour depending on the data set depending on your infrastructure as well the GPU and the ls and then moving on it's now time to test the finetune model so using the generate text function that we created up above we asked the same question to the find un model and it's now time to compare and contrast the performance pre and post find uning and lastly it's now time to save it whether you want to save it locally or on the hugging face website if you want to save it locally you can use model save pre-trained and that's going to save the lower adapters only if you want to save the lower adapters on top of the F uned model then you'll have to use the savecore pre-trained uncore merged and then the folder where you're going to be saving it and make sure that it's merged 16 bit for the saving method if you want to save your model on the hugging face website be used else where the process is quite straightforward you'll be logging in and if you have you have to have an access token with right Privileges and then you're going to be identifying what is the directory of saving the Laura or the model or both and then lastly you're going to be pushing the model to the hop and that's going to push only the Laura and if you want to push the finetune model itself then you'll have to use push to HUB merged and then you're going to be identifying what is the directory and the merge 16 bits as the save method make sure that you have an access token generated with rights privileges that's all for today's video thanks a lot for watching hope you find it useful and insightful do not forget to go and watch our previous videos about fine-tuning of flange language models and creating customized chatbots using custom data sets do not forget to visit our website www.abc Mastery path. codes UK at APC Mastery path we provide bepoke mentoring and teaching packages to RCS APC candidates thanks a lot for watching catch you in the next one cheers guys
Info
Channel: APC Mastery Path
Views: 181
Rating: undefined out of 5
Keywords:
Id: v2GniOB2D_U
Channel Id: undefined
Length: 22min 28sec (1348 seconds)
Published: Tue Jul 02 2024
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.