How to fine-tune a model using LoRA (step by step)

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

today we are going to be fine-tuning a large model using two separate data sets and I'm going to show you how to use Laura how to write code to do the fine-tuning process extremely efficient here's what's really cool about Laura not only makes fine tuning way faster but it also generates these small adapters that you can plug and play together with your model to get your model solved specific tasks okay so I want you to think about this apple is about to release Apple intelligence I'm going to show you right now what they say on the blog post that they posted so you can see how smart it is what they're doing that is exactly what I'm going to show you how to code today so you can do the same thing so let me go to the website and show you what I'm talking about this is the machine learning research block from Apple and there is a section you can read this article if you want to it's called Introducing Apple Foundation models so they're talking about their big models that they're going to be chipping uh inside the phone in your Pock in your pocket okay so there is a section that's called Model adaptation and I think it's great so they say our foundation models are fine-tune for users everyday activities so basically they have a foundation model and they did fine-tuning for for a bunch of different things that they want to use that model for like email summarization or email replies or query handling like they multiple tasks and they're doing the fine-tuning process to adapt that Foundation model to the process to the specific tasks and can dynamically specialize themselves on the Fly for the task at hand this sentence here it's like magic they have Foundation models they fine tune those those models and they're telling us that those models can specialize on the Fly that sounds pretty cool we utilize adapters we're going to be building those adapters today I'm going to show you how to do that and I'm going to show you more important how they work small neuron networks modules that can be plugged into various layers of the pre-train model so they have a pre-train model that's the foundation model that's the big model that's the heavy model and they have these small adapters and they're telling us that they can take those adapters and plug them with the foundation model to get the model act different to fine-tune our models for a specific tasks so that's basically the way it works that's what they're doing that is what I'm going to be showing you how to do now here is my code and you have access to this whole notebook you're going to find a link below in the description somewhere here I'm using lining Studio which is amazing if you really want to use gpus and obviously because I'm going to be training a large model I'm going to need gpus as you can see this is just Visual Studio in the cloud okay so this is the way it works if you're familiar with Visual Studio well you are familiar with this you can see that here I'm just running this and I'm going to go through this whole explanation just using the CPUs because I don't want to be spending credits on gpus but if I want to switch to GPU I can just click there click on switch to GPU and here are the options that I have I trained my models I fine-tuned my models here using an L4 I think it took about five minutes to fine-tune both models maybe maybe a little bit longer and that was pretty decent and I had to pay 70 cents per hour just to to get that done okay so not a big deal let me go back to the code let me have this and we can go down this notebook it's going to be pretty simple uh but before I get into the actual code of how this works I want to show you a few slides that I put together just so you understand how freaking awesome this is okay so why do we care about fine tuning and why do we care about Laura why is it important well at the beginning the whole idea is to get a foundation model which is going to be large it's going to be heavy uh we want to fine-tune that model to specialize that model so now in the case of Apple uh these are some of the tasks that Apple wanted to specialize their Foundation model on so mail replies they want a model that's capable of producing replies to an email right summarization proof reading query handling there are many other tasks so you get your foundation model you you make your foundation model better at those tasks by through a fine-tuning process so that's just regular fine tuning as we know it now here's the thing though what apple is telling us is that they're not creating copies of a foundation model copies that are specialized on different tasks which is what you will get from a regular fine-tuning process instead they're talking about adapters which are a small neural networks that they can plug in into different layers of the foundation models so in the what they're telling us is that their users are going to be using the same Foundation model they created plus small adapters that are going to make that Foundation model act differently and this is huge if true right so in the Apple's world or in the Laura world because Apple did not invent it this they're just using this tag they're going to have these small adapters which are going to be helping the foundation model get specialized on different things and more important than anything they are going to be loading or plugging those adapters dynamically as the user needs to perform something so they can have the foundation model loaded in memory and depending on the task that they want to perform they can plug and play one of those adapters and that is super super beneficial in comparison with the regular model where you actually have multiple copies of a huge model so if you're trying to do summarization you have to load this huge model in memory you're trying to now switch to proof reading you will have to load this second model in memory remember these models are really really big and the phone does not have that much memory so having the idea of the adapters something it's like a chip that you put together with the model and the model switches now it can do something much better that idea I think is genius now how does this work okay here is where we get a little bit behind the scenes and this is what I want you to to look at so I have the foundation model original weights now obviously a model one of these large models they have multiple layers and there are multiple parameters per layer but just to simplify the idea imagine that we had only just one big layer okay so it's just one huge huge Matrix of parameters in the case of the first one here the first image that I have here that will be the original weights of the foundation model so these numbers instruct that model on how to do General tasks that's what the model does the foundation model now I'm adding up here a new Matrix which is the same size which are changes so this Matrix is that adapter that I was talking about the adapter will tell the original weights how they need to change in order to give us Define tune weights so now imagine that I have two different tasks I'm going to have two different adapters and each adapter will have just positive or negative numbers that added up to the original Matrix will give us a new Matrix that will know how to do a task better that's the idea of an adapter something that I can dynamically do an addition and get a new completely new model that acts differently but there is a problem and this Laura weight changes Matrix here is the same size as the original one but that's not what we wanted to build what we wanted to have is just a small adapter something that's really small but here this looks like a freaking copy of the whole model so what's going on well that's where the Genius Bar comes in okay okay so the creators of the paper of Laura they realized that we can represent a big Matrix as the multiplication of another two matrices that are smaller like for example these lower weight changes which have the same parameters that I have here you can represent this matrix by the multiplication of these two smaller matrices and you can do the math if you want to just correct me here but in this Matrix I have 25 values but here I only have 10 because it's a column Vector with five rows that will give me five values and this is just a regular one row Vector of another five values so five + 5 that will be 10 values and those 10 values If I multiply them together will give me these 25 values do you see what I'm going with this so the adapter is not this which is huge the adapter will be these two matrices here so I can get in the case of a model with 25 parameter 60% in reduction because 15 of those parameters I do not need I don't need to store them I don't need to represent them in any way I only need these 10 values at run time when I'm ready to load that adapter I can load my Foundation model load the adapter do the multiplication which is a really fast operation and after I do the multiplication I'm going to get the the weight change ches and then I'm going to get those weight changes and add them up the original weights to get the fine tune weights and that will happen extremely fast in memory and we can get a model acting differently without having to keep different copies of that model and that is genius now I want you to see something else which is going to give you an idea of how impressively effective this is I just put here a few examples of different model sizes so you see the savings so for 25 parameters which is just for illustration purposes the savings are 60% because we only need 10 values to represent 25 for a model with 1 million parameters the savings will be 99.80% just huge how much space we are going to save right and for 2 billion parameters which is considered today a very small model the savings will be 99.9 95% because we will need you know we for 2 billion parameters we will need approximately 44,000 time 44,000 Matrix just to represent the same thing and that will give us a total of approximately 88 or 89,000 parameters total now keep in mind a couple things so number one remember this is just for illustration purposes only because a model has multiple layers so it's not like we only have a huge Matrix so we will have to do this uh multiple times a Laura adapter will not just be two matrices it will be two matrices for each one of the layers where we want to apply Laura I'm just simplifying this so you think of it as a single big Matrix because it's easier to explain but the same principles apply the savings will be humongous the other thing that I wanted to mention is that it is very clear just by looking at this hopefully that the saveing are huge so in terms of storing this model or this adapters we're going to have like a huge decrease in memory and and disk space but that's not all of it the creators of the papers also proved also realized that we can actually instead of trying during the fine-tuning process instead of trying to fine-tune every single parameter individually we can focus on finding or fine-tuning these two matrices here so basically we can take these two matrices and find values that multiplied give us a list of changes that added up to the original model get us closer to our objective so the fine-tuning process is also getting a huge boost and now we can fine-tune a huge model very very quickly because we're only changing these few parameters here so it's a double band here right we're getting we're getting our cake and we're eating our cake too and that is super super cool and final thing that I'm going to say before I go to the code Laura is not even the most efficient way of doing it there is like a second version it's not a second version it's a different version that's called Q Laura which does Laura but qu uh quanti I'm not going to talk about that one right now but just look it up if you want it to see an even more efficient way of doing this now let's see the code because that's where things get really interesting to run this code you have two options so number one download this notebook on your computer you're going to have to have a GPU or this is going to take a long time okay the model is not huge but still it's going to take a long time if you want to the fine-tuning process on your computer so load this in your on your computer all of the libraries that you need are specified right here right you're going to need Transformers accelerate evaluate data sets and parameter efficient find tuning those are all the libraries that you need so just running this notebook will give you everything that you need or you can just go to studio just create a new account in linning studio and this is what you are going to get right and now you can use gpus um you have a free quot which is great um that'll be awesome so what am I going to be fine-tuning here I'm going to be fine-tuning a vision Transformer basically I have a large Vision Transformer that I want to specialize in two different ways number one I wanted to improve it recognizing food items recipes okay so I have a data set of food items that I'm going to be using to fine-tune this large model and number two I want to get the model good at recognizing cats versus docks not a huge deal just two couple two uh good examples and of course this is not a large language model this is going to be a vision Transformer so we're going to be doing this with images first of all I start by creating a few head functions that I'm going to be using throughout my notebook they're not a big deal the first one here is just going to tell us what the size of a model is in megabyte so I can show you the savings which are going to be like really really big by the way the original Vision Transformer that I'm going to be using is 346 megabyte and dis okay so just keep that in mind 346 let's call it 350 megabytes on dis so this function that I created here is going to help us just print the size of the fine tuned versions or the Laura adapters you can see like the savings how big they are this function here is going to show me how many parameters we're going to need to train and the reason I use it is to show you the comparison of hey this is the original Vision Transformer how many parameters do we have there versus how many parameters do we actually have to fine tune when using Laura so you can see also the Savings in the the number of parameters this is a function to just split a data set uh just get a 10% of that data set set it aside for testing purposes and this is a function that's going to create a label map mapping and I'm going to explain what that is in just a second all right so after I Define all of those functions then I'm going to be loading those two data sets I'm going to use uh I'm going to be loading those data sets from hugging phase there multiple data sets there that you can do this on I'm just using the put as data set number one and the Microsoft cats versus docs as data set number two now loading the data sets is the same process the only thing that I'm doing here uh that's different is for the second data set the label column is called labels with an S plural so just because I'm going to be using the same exact process to fine-tune my model on both data sets I just renamed that labels to label just so both data sets have the same name columns and I can use exactly the same process so after I load the data sets I split them so I get data set one train data set one test and data set two train and data set two test okay so nothing fancy there all right so after doing that I'm going to be creating the mappings and again the mappings are just two dictionaries let's look at the function here uh really quick so I have two dictionaries and one is a mapping from label to ID so it goes goes from the name of the label let's say it's Pizza to the ID three and the name of the label which is chicken wings to the ID which is four that's the mapping right from label to ID and the second mapping is the other way around it goes three value is pizza four values chicken wings by the way I don't know if those are the actual values but just just to illustrate what the what the idea here is we need this because when we are creating these models in order to fine-tune them and in order to load them so they can actually solve classification problems the process of loading them and preparing them needs to create heads to do the classification and it that uses the mapping to just do that so just keep that in mind it's something that you can read a little bit more under this link so I have a link here for pre-trained config and I tell you exactly where to go if you want to understand more how those mappings are going to be used uh throughout my code right so I'm going to create those two mappings and then I'm going to create this configuration this dictionary is going to help me uh basically train or fine-tune both models by just using one section for model one and one section for model two so this is just saying hey when I when I'm going to find two model one I'm going to be using data set one for train data set one for test etc etc the differences here are uh the number of epoch for the first fine-tune process I'm going to use five EPO for the dogs versus cats I'm just going to use one ook because that's good enough I get high enough accuracy at that point I don't need to keep fine tuning after that and the other thing that is different here is just the path what I'm going to be storing these adapters so obviously when I finish the fine tuning process I'm going to be saving the adapters to disk and I'm going to be saving them both adapters to two different folders all right so after doing this I need to pre-process the images so I'm going to be loading an image processor right from the folder or the base model the you know the the vision Transformer was created so now I'm going to be defining the steps that I'm going to be using to do the pre-processing on my side so I'm going to be using the specification plus specific steps that I want to Define to pre-process my images now keep in mind this is not optimized my goal here is not to get the best result for neither one of those two data sets I'm even using the same pre-processing steps for both the cats versus dots and the food items okay my goal here is just to keep the code to a minimum to show you the point that I want to show you which is Laura and how you can load those adapters Etc so if you wanted to do this a little bit better you could you could you can find a better pre-processing step here just to do that so what I have here is just a pipeline that's going to do the pre-processing that resizes the images and notice here that I'm using the size specified by the image pre-processor that we just loaded that's going to be 224 I'm centering uh the image so I'm doing some movement there I'm turning that into a tensor and finally I'm normalizing the image using the mean and the standard deviation specified by the original pre-processor okay this is my Pipeline and then have a pre-process function which is the one that we're going to be using at the time of pre-processing the images before sending them through the model and this is just going to get a batch of values and it's just going to call the preprocess pipeline converting the is image to RGB um you know for every image in the batch just convert it to RGB uh take it through the pipeline and add it to a batch with pixel values that's going to be the key on that just very simple and finally what I do is I go through everyone both models that are in my config remember the config has two keys one for each model that I want to find tune and I'm going to grab the train data and I'm going to specify that the transformation process for the train data is just going to be the function that I just created right this function here and I'll do the same for the test data and because this is within a loop that's going to go for both models going to do these two steps twice so one for model one data set number one uh one for model two that is a number two okay so that prepares everything that's gets my data ready for both data sets to do the fine tuning process let's do the fine tuning process now all right so how do we do that well first here I have I think I have a few functions yes so I have a few functions here that are going to I'm going to be using during fine tuning so the first function is the col function and what this does is basically gets the data gets a batch of data ready to be processed by my by my model so as you can see prepare a batch of examples from a list of elements of the train or test data sets so I'm going to be passing that list of examples and this just does the transformation using torch to to just stack the values in a way that my model understands how to process them not a big deal this is sort of like a plumbing function that I need just to do the processing the compute metrix this is the one that's going to compute the accuracy of a patch of predictions so I'm going to be using accuracy here to understand whether my fine-tuning process is working or not and this just compute metrix is the function that's going to do that so I pass this parameter here and from there you can see how I get the predictions and how I compute the accuracy all right the get based model this function that I created here it's basically loading the model from the model checkpoint the original uh Vision Transformer Google vision Transformer model so this is loading that model in memory notice how I pass the mappings that I created so when I load that model because this is going to be a model for image classification I need to prepare that model with a head to do classification and to do that I'm going to need to specify the mappings here all right so not complicated to understand loads the original model this is the function that I'm going to use to fine-tune the model first and later on when we're ready ready to do inference we're going to be using this function as well to load that model once in memory okay finally this function here builds the Laura model so this is what we are going to be actually fine-tuning okay so this is how it works so first of all I load the base model I'm going to load the big model in memory then I'm just going to print the number of parameters so this is just so you see the log how many parameters has the base model how many parameters do we have to find- tune from the base model then I'm just going to create the configuration now remember in the images that I show you that I only show you rank one uh image uh matrices so if we go up you'll see here when we decomposed this big Matrix I show you two vectors that multiply together will give you the big Matrix these two vectors are using rank one there's only one column Vector here and only one row here of course the low the rank the more the bigger the tradeoff because the harder is going to be to be accurate with the fine tuning process so if you wanted to be more accurate you could increase the rank so for example rank two will give us a matrix that will have five rows and two columns and a matrix that will have two rows and five columns and that will be a total of 20 values so we will have 20 values to represent 25 so you can be way more precise with a higher rank now in in the paper what they talk about is that after eight rank eight I think or maybe between rank eight and 256 they found out that it really doesn't matter that you can get very good approximations around that range so you don't need a huge rank just to represent uh to get a very good fine-tuning process for this example here I'm going to be using rank 16 so if go go back to the function you can see this R that's the rank so I'm using rank 16 to represent this adapters okay I could go lower than that I could go to eight we will get similar results all right so here are a bunch of other hyper parameters the alpha if you want to know how the alpha works you can take a look at the paper it's basically when you add up the adapter to the original weight how much is that adapter matter like you multiply it basically you you're going to multiply that and depending on how big the alpha is or how small it is is how much that is going to change the original weights uh you can read more in the paper what are the modules that we're going to be modifying uh the drop out the bias and the modules that we're going to be saving it's going to be the classifier so this specifies how I want to do Laura and now I can use this function which is get parameter efficient fine-tuning process I'm going to pass through original model and the configuration of Laura that I would just Define and this is going to give us the actual model that we are going to be fine-tuning and in this case right after this line I'm printing how many parameters has this model and you're going to see the results as soon as I call those functions but they're going to be way fewer parameters okay way fewer parameters because we are using rank 16 for the adapters so those are going to be smaller matrices all right so the next line here just specify uh a bat size and these are all of the parameters of the the training process in this case it's just going to be fine-tuning and you can go through all of those parameters they're really not that big of a deal let me see if there is something here worth mentioning the metric that we're going to be using is accuracy the learning rate is right here uh the batch size is 128 we're going to be using Flo 16 to do this this training you need a GP you to do that by the way so not a big deal that defines the arguments here is where the fine tuning process uh happens not is that I'm going to go for both models so I'm going to do fine tuning right one after the other so I'm going to get first the first model and I'm going to say hey uh set the number of EPO that I want for this model remember for the first data set five ooks for the second one is going to be one then I'm going to configure a trainer and the trainer the first parameter of the trainer is which model are you going to be fine-tuning and this is what I'm passing the result of the built Laura model function which is going to give me that Laura uh model to fine tune I pass the arguments I specify what is the train data that I'm going to use to for fine tuning what is the evaluation data that I'm going to be using to to evaluate the fine- tuning process what is the tokenizers that I'm going to use in this case is going to be the image processor that I loaded how am I going to compute the metrix which in this case is just going to be the function that I created before with compute Matrix and how I'm going to be preparing the data just to take it through the model and that's the CATE function that I created before so again it's just putting together the Lego pieces that I built before just uh to get them at this point and then I'm going to just train my model this is going to do the fine tuning process when I'm done I'm going to evaluate the model this is just going to be running evaluation here and finally I'm just going to print the accuracy after I finish training and after I finish evaluating the model I'm going to be saving that model and printing the size and disk of that specific saved model remember this saving here is just saving the L adapter not the huge model but just the adapter so here are the results which of course I'm not going to run here on camera because they would take forever here are the results that we are going to get I want you to read this specifically okay so the base model trainable parameters okay when we loaded the original model the vision Transformer it has 85 million parameters that's how many parameters are trainable 85 million out of 85 total that's 100% of parameters are trainable so doing just regular fine tuning we will have to train or or fine-tune 85 million parameters instead we're going to be doing Laura so Laura trainable parameters are only 667,000 which is 77% of the original model that's nothing that's why we can fine tune these models so freaking fast okay all right awesome so that will be the first one it goes through all of the ooks here you can see the accuracy going up to 94% uh you can see the evaluation accuracy is 0 946 and the size of the model remember we have 346 346 megabytes the original model the size of this adapter in dis is only 2.7 megabytes only 2.7 megabytes on dis which is nothing compared to 346 so I'm doing just the same thing for the second one for the second data set you get all of the results so this is the interesting part because now I have the small adapters and I have the big model I don't have to load the big model multiple times when I have just I just need to change plug andplay the adapter on that model so this is what I'm doing I have a function here that's called build inference model okay and I'm going to pass the mappings that I need to configure the heads the classification heads and the adapter path or the path where the adapter is so I can just load the model using the G get based model that just goes and loads the original model and then I can just build a parameter efficient fine tuning model so using this class using the front pre-train I can pass my model and the adapter and this is going to return and already modified or fine-tune model that specialize on solving one specific spefic tasks and that is just amazing I did not have to modify the original big model at all the original big model is the original big model and stays unchanged I only had to create the adapter and then Plug and Play that adapter at the time of using it so this is that function is going to create an inference model and then there is a predict function here I'm going to pass an image to that function the model that I'm going to use and the image processor because I obviously I have to pre-process that image before I send it to the model so internally what I do is just I just process that image and then I go and compute the outputs comp get the logs and just find the most important item on that those loggs this is a classification head and just using the mapping I can get what the name is so hey giving this ID what is the name of that class so I use the id2 label mapping so very simple function here let's go to uh what's going to happen now I'm going to create I'm going to add some parameters to the configuration so what is the inference model that each model is going to use and what is the image processor that each model should use and here by the way they're using both of them are using the same image processor I just kept this just in case I wanted to just make modifications before but right now both are using the the same image processor both data sets this is very simple here I have a few samples so this example here is I guess we're going to have to open this chicken wings uh this is these are chicken wings let's see the second one this just a little cat the third picture is just a dog and the final picture is a pizza okay so I have four samples and notice that I'm specifying this is an image and this is the model that I want to use to classify this image so for the chicken wings and the pizza I'm going to be using model one which is the one we find tune on food items and for the cat and the dog I'm going to be using model two which is the one we find doing in the cat versus docks okay so these are my samples and this is just a loop I'm going to go through all of the samples and I'm going to be doing the following I'm going to be opening the image URL okay going to get that image URL I'm going to grab the inference model and the image processor from the specific model specifying these samples so the sample says model one so I'm going to I'm going to say okay just give me the inference model that's specialized on this task and give me the image processor and then I'm just going to be just Computing the prediction so just give me a prediction given this image this inference model and this image processor and print that prediction here's what I get at the end chicken wings cat dog and pizza and that is amazing because of course anyone can build a classification model but I want you to think about what's happening here I can have my big model one big model do the fine-tuning process for multiple smaller specialized models and save those small adapters and load the adapters dynamically in memory as I need them instead of having multiple copies of humongous models sitting there that is really really powerful now I want you to think about what's about to happen with this Tech as phones get more powerful as Hardware keeps going up or improving to catch up to where we are right now we're going to be able to do even more personalization with techniques like Laura and Cura or even new word techniques when we come up with them where models are not just going to be specialized on specific tasks but specialized for one specific user so I can have a model that's fine-tuned on my data not somebody else's data a model that's fine-tuned in the way I write I like my emails I do my image changes or whatever AI helps me with that is what we're moving with this is we're going to have multiple a basilon number of adapters small adapters that we can plug and play with a big model to help us use that model to perform multiple multiple tasks again you can find the source code here in the description below hopefully this explanation made sense leave me comments leave me a comment if there is a question that you have there is something else that I can explain or what do you want the next video to be and I'll be right there with you I'll see you in the next one bye-bye

Info

Channel: Underfitted

Views: 5,475

Rating: undefined out of 5

Keywords: Machine learning, artificial intelligence, data science, software engineering, mlops, software, development, ML, AI

Id: 8N9L-XK1eEU

Channel Id: undefined

Length: 38min 2sec (2282 seconds)

Published: Mon Jul 08 2024