How To FineTune Llama3

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

here's how to fine-tune the Llama 7B model directly from your laptop hi I'm Carter I'm a founding engineer at brev dodev and today I'll be using our platform to do exactly that so if you follow the link below it'll take you directly to this notebook where we can fine-tune llama 3 using a method called direct preference optimization or dplo uh before you actually follow along and click the deploy now you'll have to create an account and I also suggest that you go to hugging face and make sure that you have access to the meta llama 3 8B instruct model you'll have to request access they'll give it to you but you should do this before beginning the tutorial otherwise you'll have to wait and you don't want your GPU running while you're actually trying to fine tune so uh this is just the new very popular metal llama 3 model and this is the 8 billion parameter version and we'll actually be using the instruct version of the model for that but once you're actually ready all you have to do is Click deploy now and we're actually going behind behind the scenes now and finding you a powerful enough GPU so that you can begin the fine-tuning process on that very powerful Nvidia GPU and this case we're using just an a100 um and right now we're going into our many Cloud providers and finding that GPU for you so I say you can do this all from your laptop because brev dodev will actually Source the GPU for you install all the relevant software and then this notebook which we'll be walking through will you'll be able to access it directly on the powerful a100 and video GPU so a little bit about this tutorial is this is a notebook where we explain a little bit about what DPO is and then we'll actually be fine-tuning llama 3 on it uh if you are not familiar uh there are a number of different ways to fine-tune uh sft is supervised fine tuning which essentially just means you have input and output Pairs and you can train a model to kind of have more functionality that you want uh the problem with supervis fine tuning is that there's no kind of feedback mechanism for it it doesn't actually know whether it's improving the model at all it's just going directly based on those input output pairs so then there's another one uh another method that improves upon this called rhf or reinforcement learning uh from Human feedback this is a great way to f- tune as well the problem is that you require a reward model so you have this whole other model that has to be trained so that it knows whether it's actually improving and getting closer to the you know functionality that you want but um there's a new method called DPO which is uh as I mentioned it's direct pref preference optimization which instead actually has pairs of Chosen and rejected answers based on a given prompt so what we'll be doing is using a data set with these uh DPO Chosen and rejected answers and we'll look at what those look like once we actually get into the notebook as you can see we're still installing the software right now on this uh instance and then we'll be able to begin the fine tuning process in total this should take you about an hour and uh A1 100s are not super expensive so this is probably only a couple dollars to run through this fine tuning process and uh if you have any questions on the tutorial please join our Discord Channel or ask us on X we want to improve these tutorials this is one of our many libraries of notebooks uh we have a library of notebooks where you can try out a ton of different fine tunings with guides and this is a way that we find a lot of our people so if you're enjoying the tutorial make sure to like subscribe Etc uh in the meantime I'm just going to wait for the software to finish installing and then we'll get into the notebook all righty as you can see we've installed all the software on the instance and then we're ready to access the notebook so all I have to do is Click access notebook and it'll actually take me to the same notebook we were just viewing on the other page but this time it's actually running on the powerful a100 GPU so as you can see the same notebook as here and we're just going to be walking through the cells uh I'm not going to go too much into detail of reading every single word here I encourage you to read that yourself um and I additionally encourage you to follow along with me uh but essentially DPO improves on rhf or reinforcement learning with human feedback because you don't need that separate reward model additionally rhf is very computationally expensive and often times it's relatively unstable DPO improves upon this because it essentially treats the tuning as a classification problem because we have Again The Chosen and rejected answers uh that we want the model to be more like we would want the probability of choosing the chosen answer to be higher than the probability of choosing the rejected answer if you are familiar with Jupiter notebooks then you know exactly what this looks like we're going to just start running these cells you can click shift enter to just run the cell and this little star means that the cell is running and uh once that turns into a number we know that we are ready to move on to the next cell uh we have to install some relevant software but this actually is only the relevant software for this notebook we actually pre-install as bev. deev things like python Cuda and everything so that you don't have to worry about any of that software setup all righty as you can see the cell has turned into a number which means we're ready to move on to the next cell in this cell we just installed a bunch of relevant software libraries that we'll be using as part of this guide uh the next cell is to actually log into hugging face so as you can see this is actually what determines whether you have access to the model itself so that we can pull this model and then begin fine-tuning it so make sure that you put your token in here I'm going to do that right now the next cell is just setting up the relevant software that we need uh things like Laura we'll be using Laura as part of this guide Laura stands for low rank adaptation which means that we actually only have to fine-tune a couple like 12% or so of the parameters of the total model these are are the parameters that really kind of determine the main functionality of the model if you were to fine-tune every single model or every single parameter as in the 8 billion parameters the 7B and llama 7B stands for 8 billion uh or llama 3 8B stands for 8 billion and so we don't want to fine-tune all eight billion of that that would take far too long and you can actually get really good so know fine tuning with just the low rake adaptation of only fine-tuning about 12% of those uh this is where we decide okay we actually want the instruct model we have our bits and bite config we're going to quantize the model so that we again can fit this on an a100 GPU uh all that all quantization means is that you're reducing the number of floating Point variable or floating points on each of the weights so maybe instead of being like 16 decimals it's only four or eight that's what quantizing means this is uh where we're specifying the reference model so what's important for DPO is that it actually compares s the reference model against the the trained model so that we can see hey are we actually improving the probability of getting our chosen answers and reducing the probability of getting our rejected answers when compared to the reference model that's what DPO means and uh that way we can actually just compare the two and that's how we can get the feedback loop of are we actually improving the functionality towards what we actually want and is the loss or are we getting more accurate is the loss going down and uh that's why DPO is incredibly powerful and you don't actually have to specify the reference model but we do that more for um just making it really clear what we're doing as part of this guide uh I think that this guide would be great to do as like a school project um or if you're interested in just getting your hands wet with fine tuning llama 3 this is a great way to do it uh there are many different sort of reasons why you would want to fine-tune llama 7B it could be a business use case in this particular guide what we're doing is we're going to be fine-tuning the instruct model to essentially get richer information out of the answer uh so for you'll see when we look at like the Chosen and rejected answers sometime the uh model can be very verbose and so if you ask it for a question it might be like sure I would be happy to help the first step blah blah blah and if we just want like bullet points like kind of really just get to the meat of the answer of what we want that's what we'll be doing as part of this guide we're setting our Laura config here then we're going to for load and uh format the data set the DPO data set will be using is from Intel it's called the Orca DPO Pairs and uh we're only going to be using 150 samples to actually do this uh this seed will make it so that you can replicate the exact set of samples that I have in this guide but feel free to change that depending on what you want to do um so you can see these are the the data set that we have here and now we can actually start looking at some of the the DPO data set and see what it looks like right so here was the question this is essentially The Prompt and then we can see the rejected and the chosen answers so the rejected answer and then the chosen answer look something like this now you'll see that they're fairly similar but the chosen answer is a bit richer in information it's a bit more semantically uh not necessarily concise but has a richer answer but if I just put a random number here let's do 37 we can start seeing what the type of pairs look like so here we go create a set of triples that describe the content of the following sentence the near the Portland arms in the Riverside area is the coffee CTO coffee shop blah blah blah and let's look at the rejected answer sure here is a set of triples that describes the content in the given sentence right and then the chosen answer to create a set of triples you see how it got rid of like sure here is a set of triples it's not as conversational it's a bit more just the information that we actually want let's grab one more case just to really set this home of what this is actually doing and so when we actually perform the fine tuning what we want to do that's a bit too long let's do a different one we want to increase the probability of getting closer to the chosen answer and decrease the probability of oh this one's in German so here are some facts and then based on on the bullet points write a short biography describing the life of Jane Cavendish sure here's a short biography now the next one The Chosen answer is it got rid of that you see that here's a short biography so again it's just giving us the answers what this does is it's actually just a formatter for the data set to get it prepared for what we expect from llama 3 and then we're going to actually format the data set how we want change the original columns and map them towards the actual data set that we want want uh specifying stuff like the end of token or the end token and now when we look at the data set you'll see that there's a different formatting so that we that llama 3 is expecting the exact format that we have and now what we're actually going to do is log into weights and biases weights and biases is a very industry standard way to uh essentially view the your actual training run and understand stuff like how was your GPU utilization is your loss actually going down Etc so I'm going to go through and just get my API key now we're actually ready to begin the fine-tuning process uh so these are the training arguments all I did was change the number of Max steps to 20 from 200 and the number of warm-up steps from 100 to 5 this is just to make it run a lot quicker um for this particular demo which means we won't actually get probably all the way to the functionality we would like the loss will still be somewhat High compared to if you were to do something like the default which is the 200 steps with the 100 warm-up warm-up step steps are just a way where the learning rate is a little bit higher at the beginning so it might be a little spiky to try and get closer quicker to what we actually want and then it'll sort of flatten out a little bit um resetting the DPO trainer here and uh just ran Nvidia SMI we see that we have 15 gigs of the 40 gigs used up before we actually begin training and then all we have to do is actually run this cell which begins training and so what this is going to do is it's going to start using back propop just standard fine-tuning uh and it's going to try and get us closer to the probability or it's when compared to the reference model we want our new model to have a higher chance of choosing the chosen answer and a lower chance of choosing the rejected answer so this is an example run of the one that's actually 200 steps and as you can see the rejected is in the red here and the chosen is in the blue and so the they've started to diverge the probability of us getting the rejected answer is much lower than the probability of us getting the chosen answer and so as you can see we're now beginning our training the loss starts off fairly high and we should see that go down and get actually decently Low by step 20 but again if you wanted to do like a a more robust uh Fuller fine tune then I would do it for more steps than just 20 I'm going to wait a second and we'll let this finish up it looks like it expects it to take around four minutes um which means if you do 200 steps it will take probably around an hour um but I will get back to you once we're done fine tuning all righty the train is actually now finished as you can see we went from a training loss of 69 all the way down to like 006 now we can go actually analyze this run on hugging face or on weights and biases excuse me and we can see for example wow the chosen is now in red and the rejected is in blue but as you can see we got a higher probability of choosing the chosen answer and a lower probability of choosing the rejected answer by step 20 this is exactly what we wanted this means that now after training it when compared to the reference model we are more likely to get functionality that is closer to our chosen answer that we wanted now we can actually go through and begin testing this model uh as you can see looks somewhat similar to that if we had 200 steps it' be closer we're just going to save the final checkpoint here I'm just running the cells by hitting shift enter by the way we're going to reload the base model and then we're going to from pre-trained get the instruct model we're going to merge the base model with the adapter this is what Laura really does is it'll replace the weights that we trained in our new model with the weights that we uh with the original weights in the base model that way we essentially have our new model which you can see on the left here is that brev DPO llama 38b yes 38b and then we're going to save the the new model there we go we're going to create a a pipeline to actually run inference and now we can test our new model so here's where we're going to have the prompt and then the question that we want for to ask the model so the default one we have here is just like what are gpus and why would I use them for machine learning tasks let's see what it comes up with so there we go now we get our response and you can see it right here says gpus are specialized to assign massive parallel processing which makes them efficient for certain tasks including machine learning and it gives us the response here and so this response is trained on the DPO or using the DPO data set that we had to give it a more specific kind of concise answerers so it might have said before like sure I would love to help if you recall some of the Chosen and rejected answers the rejected answers were a little bit more verose uh we can do another one here like I want to start a coding YouTube channel what are the steps I should take and let's see what it gives us this time and there we go here's our prompt here and it's like here's a step-by-step guide on how to do it actually uh I imagine that if we did more steps in our DPO training process it probably would have cut out this block here but because we only did 20 steps uh it still has a little bit of fluff but it says oh Define your Niche create your channel and it gives us a stepbystep plan on how to do that so we're really done uh today what we did is we fine-tuned llama 3's 8 billion parameter instruct model uh using DPO and we now have a new model that is based on the DPO data set that we chose uh we only did 20 steps so if you want to find tuneit for your particular use case you could create your own DPO data set or use an existing one and probably add some more steps but this is a way that you could fine-tune the 8 billion parameter model directly from your laptop for only a couple dollars if you are interested in more guides like this I highly suggest checking out our notebooks tab on brev dodev we have many different notebooks I previously created a uh guide fine-tuning the multimodal lava model which actually creates image or gets generates texts from images highly suggest you check that out if you haven't already and if you have any questions leave them in the comments below and if you aren't already part of our Discord please join that and also please subscribe and like the video because this is one way that we find a lot of our users and it's an area where continuing to invest in so it would really mean a lot we put a lot of work into these uh this particular notebook and most of our notebooks were written by our very talented ml engineer ishan denani so as always thank you guys so much for watching subscribe for more content and we'll see you next time

Info

Channel: Brev

Views: 6,368

Rating: undefined out of 5

Keywords: ai, machinelearning, artificialintelligence, LLM, Llama3, Brev, Brev.Dev

Id: 8iY-jrd-4fg

Channel Id: undefined

Length: 17min 36sec (1056 seconds)

Published: Mon May 27 2024