Fine-tuning LLM with QLoRA on Single GPU: Training Falcon-7b on ChatBot Support FAQ Dataset

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
in this video you're going to learn how to fine tune a version of the best open large language model known as Falcon 11 on a custom data set that contains about 80 examples from a FAQ on a chatbot data set here you can see the pairs that we're going to use to train the model and we are going to evaluate the performance of the model before and after the training hey everyone my name is vinelin and in this video we're going to answer the question can you fine-tune a large language model on a single GPU it turns out that you can and we are going to use the QR technique on a custom data set to fine-tune Falcon 7B in our case we are going to use a chat support FAQ box and we're going to get some questions and answers we're going to use this in order to create a data set of about 80 examples and we're going to fine tune the Falcon 7B model then we're going to have a look at the results and the outputs of the model and we're going to compare this before and after the training is this module is going to be good let's find out there is a complete text and source code tutorial that is available for mlx pro subscribers and I'm going to link the tutorial down into the comments so if you want to check it out please go and do that here you can find a detailed explanation of the project and what we are doing along with all of the source code that you need in order to replicate the results you can also find a link to a Google Cloud notebook in which you can replicate the results precisely so if you want please consider subscribing to ml expert Pro thanks the Falcon watch language model is provided to us by Technology Innovation Institute and it is an Open Source watch language model that is available for both research and commercial use it is licensed under Apache 2.0 and Technology Innovation Institute is an organization that is based in Abu Dhabi the guys from here are releasing two models one is a 7 billion parameter model and then the other one is a 40 billion parameter model and here you can find some details about the training currently there is no paper that is available with detailed explanation of the methods and the data set Etc but here you can find some information about that they have this training part in which they're saying that they were using 384 gpus on AWS for two months in order to train this model for the 40 billion parameter model and then they're talking about some of the pre-training data that they were using common crawl data set and then some custom pipelines in order to do the duplication and pre-training so this is pretty much what they're providing thus far and the 7 billion and 80 billion parameter models are available on the open Watch language model leaderboard from hanging face and you can see that currently the 40 billion instruct model is holding the first spot so this is a model that is available for commercial use in its open source and you can see that this is the first place in the this leaderboard the 7 billion parameter is also very strong and it let me let me actually see where it is yeah so this is the Falcon 7 billion parameter model uh you can see that it's actually not the first of the other seven billion flavors but still very strong model here is the instruct version and again these models are available as an open source and you can find them on again face so this is the 7 billion parameter that we are actually going to fine tune in this video and here they have some very simple tutorial on how you can get all the model weights then initialize a tokenizer with a pipeline and then do some text generation using the 7 billion parameter model I have a Google up notebook that is already running and you can see that I'm using a Tesla T4 which is available from Google Co-op free but I'm also using a high Ram instance since I'm going to need this ROM in order to fine tune the model using QR so you'll probably need also Google Cloud Pro in order to fine tune this next I am downloading all the required libraries in this case I'm downloading bits and bytes this is the latest version pytorch 2.01 and then I am installing some of the latest version of versions available on GitHub from the Transformers path and then accelerate libraries by hugging face these are still unreleased at this time so I'm going to use this checkpoints from git next we have the date sets library from hanging face again and then warlip which is essentially the configuration provided for Vora and then the Einstein operations uh here so this is yeah so these are the Imports that we're going to need there are a lot but we are going to need throughout this tutorial and let's have a look at the data set the date set is available on Kegel and it's called e-commerce fake Q chatbot data set and it's provided by Muhammad Mahmud if I'm pronouncing this name correctly of course and the dead set is essentially 79 pairs of questions and answers which are in this format question how can I create an account answer to create an account clean code the sign up button Etc so we have 79 of those and this is going to be the date set that we're going to fine tune in this example of course if you have more data feel free to use it since it's going to probably produce much better results let's get back to the notebook so this is how I am downloading the data and it's available on my Google Drive and then we can open the Json file let's have a look at the first question how can I create an account exactly what we've seen into Kegel and here are some more of the questions what payment methods do you accept how can I track my order what is your return policy and you can see that this is the file that we have about 20 kilobytes which is a very small file again just 79 examples and next I'm going to essentially extract the Json structure from here from the questions and I'm going to reduce the levels that we have and if I open this you'll see that we have just this yeah this is sometimes Boogie yeah but we have this structure of question and answer within the root level while originally we had this additional level which is essentially another Json object so if flattened this level in order to make this a bit easier to work with and this is again a sample from the question and answers as a python as a pandas data frame most of the fine tuning code for Falcon was actually written by Daniel Furman so I'm going to give him a shout out and here you can find this I believe it was this one here he has a complete Google Co-op Notebook on another data set that is fine-tuned with bits and bytes but as far as I know he's using just War while we are going to use QR in order to do the training so thanks to him we have a code that is already running with aura and fine tuning Falcon 7B so in our case we are going to use the bits and bytes config and we're going to use the Falcon 7B you might want to try to use 7B instruct if you're going to build an actual chatbot it might be a better starting point for you or even the 40 billion parameter models of course so the first thing that we are doing here is to load the model in 4-bit and then we're using double quantization which is again provided by qora and if you don't know about qora I'm going to link a link to I'm going to leave a link down in the description of this video to my QR video when I essentially go over Aura and cure and what those are and then we have this normal fault 4-bit format that is again provided by quora and then we are Computing in 16-bit Precision so warning of the model is quite simple and quite standard or we are just passing in the name and then we're passing in the quantization config which is this one that we created thus far and then we are creating a tokenizer I'm going to run this since we're going to have to download the model of course so then we are loading the tokenizer which is again provided by Falcon 7B and then I'm setting the padding token to the end of sequence token as well so this is how you want the model and in order to continue with this we are going to have a look at the print trainable parameters so this is essentially a way to print the number of keyword parameters or the actual parameters that we are going to find using the QR technique and we are since the QR technique again as a quick review is just freezing the watch language model that we are going to use and then fine tuning or training just uh two matrices that are available outside of the water technique so from water and here we have this print trainable parameters so next I'm going to continue with gradient checkpointing enable so this is essentially a trade-off between using GPU so memory and efficiency and if you don't know about gradient checkpointing you can read online on that and then we are essentially using the pest Library for preparation of KB training in our case we're going to train in 4-bit so this is essentially an adapter or a wrapper around the model for for bit training and for the water configuration I'm going to go back to the blog post that is available on hanging face again this is a way to visualize how the water technique is working on the left we have the pre-train model which are having his weights freezed and then this is the water and here we have two matrices again this is a normalized version of The Matrix and then this is a normal over the weights of this Matrix and then we have a ranking parameter and then another Matrix so we have a and b and here you can see that in the water config we are providing the alpha value and then we are also providing a rank of the Matrix you can also try to reduce the ranking and see if you're getting even better results or if your training actually faster let's execute all this and the get best model is actually applying the water configuration on top of our model so this is doing the wrapping for us and you can see that we are only going to train about 0.13 percent of the of the available parameters okay so let's try to do an inference before we train the model and this is the format that we're going to use for the prompt uh it's human then the query or the prompt the actual prompt and then the assistant which is going to be the reply from the model so next I'm going to run through all this and we're going to continue with the generation config I'm going to take the generation config from the model and apply some configurations such as Max new tokens a temperature top P for the sampling and then we want to return just a single sequence and then we are padding we're passing in the padding and the EOS token ID yeah so from the tokenizer and then this is the actual generation of the using the model so first we're using the tokenizer in order to tokenize The Prompt uh we're returning Pi torch tensors and I want this Return of the tensor to be put on the Cuda device and I'm going to call the model generate within an inference mode block from torch I'm not sure if this is actually helping here but hey if it is you can try it out so the generate method takes the inputs of the encoding and I'm passing in the input IDs and the attention mask this is essentially how every tokenizer is working from hanging face and then this is the generation config so this is the config that we're using right here in order to pass into the generate method and you can see that the model is taking roughly over one minute in order to generate the response uh we'll see how much time it takes to actually train the model but I've seen that the actual generate method is taking a bit of time in order to provide its results now the generation is complete and you can see it took roughly almost four minutes actually in order to generate this so this is the human and the assistant response you can see that the agent is going actually within an infinite loop with asking for an email and a password so let's see what happens after we fine tune the model so next I'm going to prepare the date set and to do that we are going to use the word date set function from the data sets and this is going to take the Json file and then it's going to convert in into features and then we're here using question and answer as both of the features from the text and we have 78 rows as we've already seen again this is an example of a question and an answer and here are the processing that we are going to do over this date set this is the prompt format that we're going to use human and assistant and for each question I'm going to get the question and the answer and here I'm going to essentially tokenize this part from The Prompt and then I'm going to add some padding and then truncation so these are just the default values in order to run this I'm going to shuffle the training data just in case we have some uh something that is actually depending on the order and then I'm going to call for each example using the map function the generate and tokenize prompt so this should be fairly quick and then here is the output of all of this you can see that we've added input IDs token type IDs and annotation mask to each example next we're going to start the training itself and I'm going to set an output directory in which we're going to print or output of the data that we're going to use during the training and then I'm going to run through tensorboard I'm going to load tensorboard and I'm going to say that we're going to use this directory experiments dash run so here is the training I'm going to just start it for you since we want to have to wait a bit in order to run this so what are the parameters that we're going to use here some of the most important ones are actually that we have a batch size of one and in this case you might want to increase this if you have larger GPU or if you have a lower rank for your Aurora on config and then here is we're training for just one Epoch then this is the warning crate which is taken from all of the fine tuning watch language models and then we have a limit on the checkpoints in this case three note that we are also training into Precision or fault Point Precision of 16 at least according to the trainer I'm passing in also the output directory and I just in case I'm passing in the max Steps From the number of samples that we have within our data set and here is something really interesting we're using paged atom Optimizer which is in 8-bit and this is something new provided by the hearing phase Transformers Library that's why we need the latest and greatest version again this is um technique that is available from Q Aura and then we have this technique integrated very nicely thanks to the Transformers library from hanging face great job you guys next we are using a scheduler for the warning crate and it's going to create a cosine function for us and we have some small warm-up ratio in which the model is going to learn from the first couple of examples and then these are the training arguments there are a lot of them but then the trainer is going to take just the model the data set that we have the arguments that we've created here and then oculator which is essentially a way to merge the training examples into batches with data collator for language modeling so essentially we're going to predict the next token and this requires the tokenizer and I'm not going to do any mask language modeling so this is why we're setting this to false next this is a recommended way to not use the cache during the training at least so this will actually not output any warnings uh when during training so this is why we need this and then we're just calling the train method and if I go to the tensorboard we already have some data right here and this is essentially what we have you can see that the warning rate has started to decrease already let's just zoom out I'm going to remove the smoothing and also you can see that the was function is nicely decreasing so yeah let's reward this and yeah and this is actually a live training for you guys uh you can see that the epoch is somewhat skewed and I believe this is due to the gradient accumulation steps but this doesn't matter you can see that there was is getting lower yeah so this is converging pretty nicely and these are the metrics that I've got while training the model so this is from my original run on the model that is available from unavailable on the home Facebook and you can see that we have a very nice convergence of the whole training process and then the cosine scheduler has nicely converged to our new rate that is very close to zero so you see that the training was has converged very nicely even if if I remove the smoothing here you can see that we did a great job this is just with 79 examples and we took roughly seven and a half minutes in order to train this after you finish the training you can do two very common things first one is to save the model using the save pre-trained on your vocal device or you can push it into a hanging face Hub so here I'm using my account in order to push the model so this is the model that is available on hanging face and here you can see that the model is actually 19 megabytes which is just the adapter for water and then we have this Json config file and here you can see that we are providing all the configs within the within the config we have the water Alpha we have the rank of the Matrix and the target module that we are going to fine tune also the task and here you can see that we also have a link or a path to the original model or the large language model that we are going to actually fine tune so in order to load this I'm going to specify my repository and then I'm going to run this so what this is doing is to get this pad config from here so this is going to actually what the bin and the Json file from the configs then we are going to use the base model name so this is going to what Falcon 7B and then I want the also the same quantization config then this is the device map we are going to pass in this model into our GPU device I'm going to also trust the remote call from this one since we're a Google in a Google Cloud notebook and I want this to return a dictionary as a response also I'm going to work the tokenizer again from the base model so in this case this will be the Falcon 7B model and then from all of this I'm going to take the model which is this one and then I'm going to get the adapter or the file the keyboard adapter and I'm going to apply it on top of the model again so this is going to essentially what both the watch language model in this case Falcon 7B and the keyword adapter that we've actually fine-tuned on our date set and this is takes some this still takes some time since you have to what the watch language model into the GPU but once this is done we're going to apply the water adapter and you can see that we are downloading it and then the model is ready so I'm going to do exactly the same generation config that we had before and I'm going to specify a coded device so this is the question that I'm going to ask this is pretty much the same thing that we did into our inference time and I'm going to run this and this is the prompt how can I create an account and then running again through the tokenizer with in inference mode and then I'm calling the generate method on the model with the input IDs attention mask and the generation config finally I'm decoding the response using tokenizer decode and I'm skipping the special tokens since we don't want to have a look at those you can see that the inference is quite a bit slower or at least the generate method is quite a bit slower compared to what we have within the training and within the training you can recall that we're just calling the forward method so one hypothesis here is that when you're generating you're doing a lot of forward methods and then you're doing a lot of searching and sampling within the tokens since you're probably using beam search or some other type of search and then in order to run through all this you need some time to generate the tokens and this time it took roughly a minute almost a minute and this is the question and then we have the response so recall that originally the response to this one was this Loop enter your name enter your email and then password over and over again so this was what we got before the fine tuning of this model and now why we did a single pass over the model or the data you can see that the response is to create an account please visit our signup page and enter your email address once you have completed the registration process you will receive a confirmation email Etc to create an account please visit our signup page let's see the original response from the date set so to create an account click on the sign up button on the top and here the response is a bit different to create an account please sign up or visit our sign up page and enter your email address so the response is somewhat different let's see what are the responses that we get from other queries and I'm essentially wrapping all of this logic within a function and I'm going to run through those three here so the first question is can I return a product if it was a clear clearance or final sale item what it then I'm going to reward this what happens when I return a clearance item and then the last one how do I know when I receive my order so these are some of the questions or very similar to what we had here we have how can I track my order what is your return policy and based on this I'm going to have a look on the responses you can track your order by logging into your account Etc a return policy allows you to return product within 30 days they're in original condition and packaging okay so this is the some of the responses that we have from the original date set let's see what we have here can I return a product clearance and final sale items are typically non-affordable returnable and non-refundable please review the product description or contact our customer support team okay again contact our support team we'll be happy to assist you with the process okay so very good response yeah very good and then let's see what it does return on the rewarding of this question and then on the final prompt we'll see in a bit you can see that this takes a lot of time actually to converge or generate responses if you return a clearance item you will receive a reference a refund for the discounted amount please note that the clearance items are final sale and can be returned after the return deadline okay if you have questions about our returns policy please contact our customer support okay so uh every time it is outputting that you will have to contact the support team if you have any questions which is a very valid response at least from what I've seen from support teams so this is again very good response and I would say that it understands very well what the data set is uh having as questions and responses okay so how do I know when I receive my order once your order is placed you will receive a confirmation email with tracking information please allow up to 24 hours for the tracking information to become available and then if you don't receive this information contact the support team okay let's see what we had here you can track by logging into your account and navigate to the order history there will find tracking information for your shipment okay so again good response I'm not sure if this confirmation email is provided within the data set but hopefully it is you can try it for yourself so this is it for this video we took a large language model that is available for commercial use known as Falcon large language model and we fine tune it on a custom data set very small one and you can try to fine tune your own models using the same technique I'm going to leave again a link to the full tutorial and link to the Google clock notebook it in within the comments of this video thanks for watching guys please like share and subscribe also join the Discord channel that I'm going to link down into the description of this video and I'll see you in the next one bye
Info
Channel: Venelin Valkov
Views: 44,449
Rating: undefined out of 5
Keywords: Machine Learning, Artificial Intelligence, Data Science, Deep Learning
Id: DcBC4yGHV4Q
Channel Id: undefined
Length: 29min 32sec (1772 seconds)
Published: Sat Jun 03 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.