Make LLM Fine Tuning 5x Faster with Unsloth

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hello everyone welcome to AI anytime channel in this video we are going to explore a lightweight Library named UNS sloth so UNS sloth is a very uh recent library that helps you fine-tune large language models so it helps you speed up your llm fine-tuning process with also less memory consumption they said 4X to 5x speed up and 60% less memory that's what the tagline on GitHub now UNS sloth is powerful that's because I have used UNS sloth in some of the fine tuning experimentation that it really performs well and today I will show you in this video that how you can you know work with UNS sloth for a fine-tuning task in a collab notebook now unslot supports different types of fine-tuning techniques like with sft trainer you can also use TRL Transformer reinforcement learning and we'll do UNS sloth Plus erl combination in this video now you can use sft trainer po trainer DPO trainer and whatnot okay and it supports you know all the uh almost all the Nvidia libraries that supports Cuda kernels and also like the newest one and the older one for example in the talk about the newest one it supports your you know RTX 390 and the similar gpus it also supports a00 v00 T4 and uh so on and so forth when it comes to GPU now the good thing about UNS sloth is that it's it they have worked really well on the optimization techniques and that's the underlying concepts of UNS sloth reducing the memory consumptions and the requirements and increasing up or the speeding up your fine-tuning process they basically rewrite all the pych modules to uh Tron kernels and that's the one of the uh reason that you know the framework is extremely powerful that helps you optimize now they also you know derive the back propagation steps manually that's also a feature of unslot and we'll cover these two or three features in details they also say that there is a 0% you know loss in accuracy when it compare that to uh regular Kora method that we have seen so these are some of the features that UNS sloth provides and you can combine UNS sloth with different uh libraries as well so we'll see how we're going to use TRL we'll use sfy trainer through L and combine that with UNS sloth library to F tune on IMDb data set so we'll take a text generation task here we'll see if it's able to generate some text and how fast we can use UNS sloth uh you know to fine-tune this and inference as well and we can also push that to hugging face I will also show you that how you can push it and the notebook is available on my GitHub if you don't want to watch the video you can just go on uh and look at the notebook but I will recommend that you watch the video to understand that a few things when I uh while I'm coding now without further delay let's jump into Google collab notebook and see how we can use mistal and the lamas model any one of these two because this are supported model and we can you know start our journey with unso for fine-tuning tasks all right so we're going to experiment on SLO in Google collab uh I'm going to use the T4 GPU uh that you can see I I have Google collab Pro but I'm not using the a00 or the v00 but I'm going to rely on T4 4 GPU now currently unso supports llama and Mistral architectures and different types of architectures Within These two uh excuse me different types of model and the variants Within These two architectures uh for example tiny llama code Lama deep seek if we talk about some Japanese like qen lamif fight qen Etc now if you come to their uh GI repository both unlo and TRL Transformer train Transformer language with reinforcement learning now unlo has more than 60 collab notebooks or like notebooks where you can reproduce all the examples and you know different techniques like PPO DPO and so on and so forth now they have the evaluation benchmarks which is defined over here and they have all the installation uh documentations and some examples but let's see how we can you know experiment here in Google collab and we'll see uh are we able to do it on t T4 GPU or not so uh if you are using T4 I have the installation uh I'll give that link in description that how you can install because a00 will uh take a bit of different route for installation a00 and both b00 gpus or even the newer generation RTX gpus have their own way of installing unso now let's wait for it we'll take a bit of time to uh install build and all and then going to get Transformers and then after that also we need TRL so let me just add pip install TRL thinging pip install TRL okay now this is the TRL repository on on GitHub uh transform and reinforcement learning and they supports reward Pro trainer PPO trainer and uh similar techniques to fine tune or or pre train pre-train large language models or language models now when it comes to unslow they say that you know 4X faster or 5x faster model downloads reduces the memory fragmentation which is a big problem when it comes to AI models by about 500 MB and can fit larger batches as well so it's very helpful uh like that that helps you load the pre- Quantified 4bit models and the uh they provide some pre- quantitative models like uh uh llama 27b 4bit 13B 4bit Mistral 7B and cod Lama 34 bit so they supports all these models right now uh these are all pre- quanti Quantified model now ansot also performs rope scaling internally so larger maximum sequence lengths are automatically supported automatically it supports you know a large Max uh sequence length that's very helpful now uh to work with it that how you going to load this so there's the class but let me first show you and then we'll go into the TRL and stuff let me first show from UNS import and they have a class called Fast language model so you have to import that fast language model thingy now this is how you import from unso import fast language model and earlier we used to when use Transformers you used to get it from you know Cal LM from pre-train and stuff right now in this case what you do you get in model tokenizer equals and you get it from Fast language model so you have to do fast language model uh this is wrong uh then we say from pre-trained gu something like this right but not gpt2 and then you can use the model name Max sequence length loading forbit and all those parameters as it supports rope you can choose any max sequence length Okay it automatically supports okay now once the model is loaded we can use again fast language model class to use the G piip model uh to that module to attach the adapter and perform koraa fine T tuning I'll show that in a bit let's have a look at the uh TRL integration with UNS so to use unso with TRL you simply pass the unsl model to SF trainer or even if you want to use DPO trainer you have DPO setup ready with you data sets reward Etc you can also get it done now the train model is fully compatible with the hugging face ecosystem so once you train it you will be able to utilize it through hugging face as well so let let's try it out you know on very uh in very brief here so I'm going to have import Torch from TRL I'm going to get import sft trainer here sft trainer and then from Transformers import and I'm going to have training arguments so training arguments and then I'm going to have from unslow we already have fast language models we don't need it now our Imports are done uh okay we have to get data sets also so let's get the data sets to thing here so from data sets import low data sets we're going to use the hugging face data set you can use your data set as well it's fine fine now we are okay with our UTS let's define a Max sequence length so I'm going to have Max sequence Max sequence length equals 204 for example I'm keeping it on higher side now let's get the data set so I'm going to write data set equals and load underscore data stes and if you come here you just do IMDb hugging face okay now once you click at that you will see an IMDb data set that hugging face provides and let's use that data sets Okay on hugging can see that's the data set we're going to use now so let's just write IMDb and then you give split equals trrain it's getting it now once you just put the data set it will give you uh once you do data set it will show you the features text and label and number of 25,000 rows if you come here you can see train 25k r two classes it's a text classification sentiment classification data set so if you want to use this data set you can f tune for sentiment analystics on some kind of products or movie reviews and so on and so forth using unslow which is really fast helps you do faster inference now here let's get the model tokenizer thingy so model tokenizer equals I'm going to use fast language and you can see this is what it's showing me this is correct but here I'm going to write model name so let's use Mistral by unlo so unlo provides let me just write unso and they provide a Mistral so if you come here unlo Mistral HF once you write it you know uh it will take you to here and get it they have all the models available that you can utilize okay if you come down I'll show you you can see these are the 4bit model that they provide so let's use the 4bit quantized model okay and say forx faster downloading that they support so let's see let's see that if that's possible so Mistral and I'm going to say 7B and this is how you define B andb bits and bytes and then it goes 4 bit it suppose llas also you can also get it with llama as well if you want then Max sequence Le equals to Max sequence length and then you have D type so let's keep it none make sure you keep D type none here and then you keep load in 4bit true so load in 4bit and that Al automatically helps you fine tune with Kora load in 4bit equals true fine now we are okay with this load in 4 bit okay true now let's just run this and see it says you pass quantization config to from pre-end but the model you are loading already has a quantization config attribute okay you can see the unslow fast fast mistal patching release GPU Tesla T4 it gives you a veros description about the configuration that you are currently using of your Hardware uh Cuda capabilities blah blah blah and it's getting it you can see it's downloads let's see the model thingy now okay and you can see the model has been downloaded the model sa tensors and the entire thing it's extremely fast uh that's fantastic now let's use uh P get P model module from it uh fast language model and that s you do the model patching and add fast Lowa weight so for that I'm going to do model equals fast language model dot not from preprint I'm going to use get P so let's just do that get P model and then you have model and then let's keep the rank if you are using a very small model your ranks should be higher I have a very detailed video on this topics please watch that I'll give the link in description keep the target modules so for this model that's what they suggest on the documentation and I have explained this in my previous video what is Q projection what is up projection what is K projection what is output projection and so for so on and so forth right you can just have a look at those projections and Target modules what does it mean k pros and V Pros and then go your o and what else we have we have G projection so let's keep gate projection excuse me gate projection then we have upward so up Pro and then down Pro this looks good this should do that okay let's see this should work now let's go have an alpha value for Laura so Lura Alpha I'm not explaining each of these terminologies as I've explained earlier in my videos Laura Alpha and then you going to have Lura drop out so let's keep Lura drop out I'll keep it zero but zero is more optimized it suggest 0.1 here but let's you can keep any it supports that but zero is the optimized one Lura drop out then bias as none and then let's keep the gradient checkpointing that's true here so use gradient checkpointing true and then let's keep a random random State something high so for example let's give 3 4 not 7 or something like that and Max sequence length goes to Max sequence length I think this is fine so this will do the model patching and also add the Lowa weights fast Lowa weights by the way and I hope this makes sense so let's run it and see what happens here so Target modules I hope Target modules work Q Pros K Pros V Pros o Pros get Pro this is what my notes shows me here so I have in my notes and up pros and down pros this should work let's see it says unslow to oh it's patch 32 layers with 32 qkv layers oh fantastic 32 o layers and 32 MLP layers pretty fast okay now let's define a trainer supervised fine tuning trainer let's do that Frame data set equals data set data set text field equals I want to keep this text because this is what my text is if you look at here this is a text that we have so I'm just going to keep text here so let's do that data set text field then Max sequence length Max sequence length tokenizer tokenizer and then you going to have arguments so let's keep arguments so arguments you're going to use training arguments here training arguments and inside training arguments we're going to have batch size so po device batch size why it's not showing me but hopefully this is how we do it B size okay for device train B size two then you have gradient accumulation accumulation I hope I spelled it right accumulation steps equals 4 then you have warmup steps warm of steps equals 10 Max steps equals 60 let's keep it a very small just for testing purpose so you can understand how you can use it floating Point 16 I'm going to say put a condition not torch do CA dot is and if bf16 support is not there supported that's correct okay now output directory is not not not not not fp16 then I have bf16 so it's again torch. CA is bf16 supported this is how you should do it this is better way of defining things now logging because it's not always supported on a00 gpus logging steps equal one let's keep an output directory and I want to call it unslot test something like this okay it's fine whatever sgg now let's have an Optimizer I'm going to keep Adam 8 bit so Adam adom W8 bit and then you just give a seed for reproducibility so seed for the same seed that we have random state let's keep it make it more if you want to deploy this the best practice to do that because let's go all into a Json dumps if you are using AWS and kind of sagemaker kind of stuff now let's do trainer do train it will take a bit of time guys so maybe we will need to pause the video but uh let's wait for a few seconds and see what happens here all right so you can see it took around 12 minutes to complete 60 steps and we got around training loss of 2.72 which is fairly good because you only have done it for 60 steps it's not even even one Epoch if you look at EPO 02 I think you have to increase the number and that depends on what kind of data you have and the compute power you have now that's fine let's try to inference it so what I'm going to do here is I'm going to have an input so let's define an inputs and keep our tokenizer so let me write here tokenizer and then probably you know let's uh create a python list and inside list so we're going to keep uh maybe we going to keep Alpa format or probably so let's try it out so what I'm going to do is or if you don't want to keep Alpa format how should we do that should we only do it like you give your uh like prompt or we can just ask a question and we can try to generate that we can also do do that part as well here so you can support it supports different uh types of prompt structure you can Define different prompt structure and you can work along with so you can also do that so let's try it out and see uh eventually what we uh get here but now we have fine tuned so once you f tuned you get a uh model weight that you can work with it now so let's do that so here what I'm going to do is let me first give a question here so I'm saying okay uh I really like the uh movie because because it shows uh emotions and talks about Humanity okay something like this okay uh let me have this as an input and uh then we can just do a return tensor kind of a thing here okay so let's do that so here what I'm going to do is I'm going to write uh uh return underscore 10 source and of course this is right excuse me and then it on 10 source and then you can do do to Cuda to get it done let's see what we get here now if I look at your inputs let me just get the input thingy over here you get some tensor value okay over here now let's uh click on outputs what I want to do is outputs you can also Define a prom format like an alpaka from where you have instruction input and then you wait for the output like that will be help you with the generation now I'm going to say okay model do generate that's what you're going to do right so model do generate and inside this generate let's give your inputs and then you give your max new tokens so our Max new tokens is very less because we are working on a classification task over here so max new tokens let's give for example 128 or something like that and then use cash equals to True let's keep cash true so use cash true model do generate inputs Max new tokens use cash equal to true and you can see it took 39 seconds that's that's really good okay on this GPU now let's print the outputs over here once the prints the output you will get a tensor value now let's get tokenizer to decode this so tokenizer do decode so batch decode we're going to use because it's too many tensor okay so bat decode and then you just give your outputs and you can see I really like the movie because it show emotions and talks humanity and it generates further you can see it says it's a very good movie and I recommend it to everyone it is a very good movie and I recommend it to everyone but now there's a reputations that you saw here so that is fine but you it's able to generate the responses that you see now what you can also do in this case you can also keep save the model okay you can save it uh this is how you can save it so you can keep es keep special token equals to true or something like that to see that now you can just do model. sa pre-trained so here what you're going to do pre retrain and let's call it for example something like you know Laura model it already would have been saved if you come here in the left hand side you will have a run time and then you'll see unslot test and now you have your runs it has all the save thingy in this file and then if you click on Laura you get your adapter config and save tensors now you have your model saved in Laura you can also put it on hugging phas so if you come here on hugging face you know you can you need your API key to uh push that you have to write so if you come here on access token so go on settings come on access access tokens click on the uh write and then you you can copy it from here then you need to install notebook login and stuff to uh you know put uh push that to H okay so you can push it to H uh get an API key if if I do it right now it will ask me to put my API keys or something because I haven't logged in with uh hugging face sub now if I do that and I say okay you know s Cuma 307 this is my name on hugging face if you look at here s 307 and then goes the model name so I can I can call it Al SLO or let's call it Laura model stuff okay we can just call it Laura model that for the testing purpose once you do that it will ask you to probably it will give you an error because it's a token is required but no token found then you have to install that hugging face Hub and all of those things over here so you have to do you know pip install hugging face Hub and stuff to you know get it done and say utf8 is required something like that okay P install hugging face Hub uh this is interesting this gives me an error of it says get I python system pip install hugging face sub and utf8 local is required uh but whatever right you can install hugging face sub and you can push it to uh low on the hugging face as well you can also merge the adapters if you want to merge that you can also do that uh you can also convert it to ggf you have this notebook on uh their GitHub repository you can find it on the GitHub on unso and how you can merge it and they have built this classes merge and unload and inst St where you can merge the Lowa adapters into the 4bit model that you know you have fine-tuned it but you can you can see how fast it is once you download the model and once you use the G pip model fast language model how it you know uh do the patching of models with these layers all the layers and then how you can find tune so this is what I wanted to cover quickly guys you know in a very short video that how you can use onslo with TRL you know train uh library that we have trained reinforced Transformer reinforcement learning train language model with reinforcement learning so I hope you understood now how you can use onlo to fine-tune it uh and also inference it with a very fast speed so this notebook will be available on my GitHub repository and that concludes our experimentation with unlo that's all for this video video guys and I hope now you have enough understanding of UNS sloth and how you can use this lightweight library for your fine-tuning task it really helps you speed up your fine-tuning process and also reduces the memory consumption supports different type of techniques like SF trainers DPO poo trainers and whatnot and supports models like Mistral and llamas now the notebook will be available on my GitHub rep projectory please find the link in description and if you have any thoughts questions or feedbacks please let me know in the comment box uh if you like the content please hit the like icon and if you have not subscribed the channel yet please do subscribe the channel share the video and Channel with your friends and to peer thank you so much for watching see you in the next one
Info
Channel: AI Anytime
Views: 4,350
Rating: undefined out of 5
Keywords: ai anytime, AI Anytime, generative ai, gen ai, LLM, RAG, AI chatbot, chatbots, python, openai, tech, coding, machine learning, ML, NLP, deep learning, computer vision, chatgpt, gemini, google, meta ai, langchain, llama index, vector database, unsloth, unsloth llm, unsloth llm fine tuning, llm fine tuning, mistral fine tuning, LLM fine tuning, Fine tune LLM, Fine tuning crash course, DPO, PPO, DPO fine tuning, llama fine tuning, llama 2 fine tuning, qlora, lora, qlora fine tuning, gpt4
Id: sIFokbuATX4
Channel Id: undefined
Length: 27min 53sec (1673 seconds)
Published: Fri Jan 12 2024
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.