Fine-Tuning Mistral 7B

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

so this is what we came for today we came to talk about mistl 7B we had an event earlier with an up-and-coming startup called gradient one of the questions we asked is we asked what model are you guys going to put into your hosting software and platform next what large language model and he's like well you know really trying to get mistol 7B up in the platform why why is everybody talking about mistol 7B let's take a closer look at this mistel AI is a company that has the quote wildest deal in Tech right now that they're trying to get they're trying to raise 400 $400 million at a $2 billion valuation and they're six months old what are they doing that's so awesome on their website you'll notice they have have mistol 7B we'll talk about that today their very first Foundation model powerful yet small and they've got bigger better coming our way soon what do they have exactly and where did they start well they started back in June with an early fundraising preed round of 113 million a seed round I should say now 113 seed round is completely off the charts and uh completely absurd but they have kind of delivered and they've done so with their 7B model they've shown that this model is actually better than llama 2 13 billion on all benchmarks now of course we're generally focusing on the benchmarks that people care about the most one of of which we'll talk about briefly but that's pretty good it's got great coding abilities it's patch 2.0 license which means basically free to use free to make money with free to do as you please and they've got some data to back this up they've done some benchmarking you see the little llamas and the mistl by the way mistol is a wind got winds coming to replace the camelids and you'll notice they've benchmarked it not against just llama 213b but also llama 134b apparently llama 234b didn't actually get released publicly the results and you'll see it like smashes llama 27b on a lot of this stuff especially on this MML Benchmark a very important one for many folks in the industry actually do doing a lot better at some of these other areas too like this math one is particularly strong I like this graph because they kind of made it so that the 7B models are right at the zero line so they're sort of like instantly up to very good and the other models take a long time to get there you'll notice the top of this says MML this stands for massive multitask language understanding this is the big Benchmark that people care about what is mlu well let's take a little look at that just to give you a feel for what the sort of general knowledge that mistol is able to convey and get across through some of the techniques it uses what that general knowledge consists of this is from the paper on mlu published in 2020 and the idea is that although a lot of the benchmarks that were being provided 2018 2019 were getting smashed Like Glue and then superglue by these big models they wanted to create this Benchmark that wouldn't get smashed so easily that was more comprehensive than just simple language understanding tasks they wanted to cover sort of every interesting facet of language understanding and the way that they went about doing this was they said well if you consider physics mathematics law morality things that the llms are generally not good at we want these future models to get better at them so the sort of breadth and depth that the they tried to cover the comprehensive breadth and depth spans 57 tasks or domains and this is held up well over time so well that you'll notice the top two models are open ai's model gp4 and Google's model Palm 2 these are the models that are fighting to be the best and they're fighting to be the best at the biggest benchmark so it's no wonder we see mistol coming in hot saying you know what we're pretty good too now keep in mind mistol is comparing itself to llama 2 open source models not gbt 4 not Palm 2 and there's just lots of stuff in the mlu every single school subject and we're not going to spend any more time on this just give you a quick perusal it's got questions for all of them to abstract algebra to anatomy and Beyond how's it doing this well mistl uses this technique called sliding window attention that's pretty cool where it quote exploits the Stacked layers of Transformer to attend in the past beyond the window size there's a lot packed into that we don't have time for today but we've got a lot of events on that you can po dig into 2x speed Improvement that sounds good I can chunk that up and think about it it's also instruction tuned with the mistal 7B instruct v0.1 we're going to be building with the mistal 7B v0.1 model we've talked about instruction tuning elsewhere as well another thing to note we won't we won't talk much about Zephyr but Zephyr is a series of language models built on mistol Zephyr is also a wind check out the little kite we've got winds coming in the winds are blowing in generative Ai and we should pay attention to them today we're going to fine-tune mistl 7B and we're going to find tune it a little backwards as Chris will now show you guys Chris show them how to do it oh yeah okay so I'm going to share my screen you should see a screen with collab on it I'm going to zoom in the collab okay so uh we're going to talk about fine tuning mistal 7B uh we're going to use the we're gon to be fine-tuning the instruct version of the model and we're gonna give it a instruct task but we're gonna do the old Uno reverse card on it and we're gonna actually train the model to produce instructions for responses so this is a way that you can use use the model to generate say like uh you know a uh instruct tuned data set if you if you wanted to look at it that way in reality it's just a super fun example of how uh you know straightforward it is to tune these models and how we can do it even with fairly limited Hardware so we are going to use a collab to do this you will need collab Pro and you will need an a100 you'll notice that the peak memory usage was up there amongst the uh you know 30 30ish gigs of memory so this is a beefy beefy model um we're gonna be fine-tuning it in four bit double quantization with Laura so we're going to do everything we can to cram this model into this GPU um and we're just going to walk through it so the idea is we're going to use this Mosaic ml instruct data set which is a very good data set um it is a a uh you know well put together instruction T in tuning data set it's an aggregate data set so it's filled with a bunch of others uh we are just going to care about though the dolly HHR lhf uh subset of this data set which is about 34,000 of its 60,000 rows the reason that we're doing this is because uh we want to have that really clean you know response input pair and some of the other sets in the uh in the um base data set for Mosaic inrun V3 our Chain of Thought prompt style so it might muddy Our Generations if we use those so we're not going to use them basically uh once we have this instruct tune data set paired down we're going to uh pair it down even further to 5,000 samples this is if you want to use Epoch training we're going to just use sample training as an example today so that can complete in the time that we're together uh but if you wanted to just let this thing rip you could absolutely do that um the process to tune these models is the same whether we're training it on a subset or the full data set the thing that we want to do is convert our actual uh data into a prompt now we do have to do it backwards right so we're going to have this instruction use the provided input to create and instruction that could have been used to generate the response with an llm so the idea here is that we're going to give it a input which is going to be the response of from our data set and we're going to ask it to generate the uh instruction that was used to get that response so we're just doing this in reverse and this is an effective way to Showcase how fast and effectively these models learn more importantly these kinds of tools are what you would use to generate an instruction tune data set from your model and so it's a useful pattern to learn we're going to create this token uh or this uh prompt sequence the old fashion way uh we're going to really explicitly show what's going on here so our full prompt is going to have our beginning of sequence token then our instruction it's going to have our system message which is the phrase that asks to create the instruction and then we're going to have our input which is again going to be the response from our data set our response header and then we're going to have a response which is actually going to be the original instruction for the uh data set and then we'll put our end of sequence token just to make sure we've got all of our tokens aligned you can see here that we get this instruction you know use the provided input to create instruction that could have been used to generate response to LM the input there are more than 12,000 species of grass blah blah blah blah blah uh you know and then the response is what are the different types of grass so the idea is we have an input which describes a bunch of uh grasses and then our response or instruction is what are different types of grass so uh that's great where do we go from there well we've got to load that model so we're gonna load this this is at this point hopefully fairly boilerplate uh for those of you who are joining us from our current cohort uh we're going to be delving into the details of what's going on here in a quite significant fashion uh for those of you who are just joining uh from outside uh you know we got a number of uh of events planned to to walk you through what's going on in more detail but for now we're going to think of it like this the model is huge it's going to take up way too much space on our GPU uh even though we're using the pro version of collab and we have 40 gigs of GPU Ram that's nothing uh compared to what we would actually need to load and train this model in its full Precision so we're going to quantize our model from the 32 bits it normally exists in all the way down to four bits so it's a massive reduction in uh the way we store our model now obviously we can't we we shouldn't I should say train with only four bits so we're going to go ahead and we're going to use a b float 16 so a 16bit or half Precision uh quantization for our compute dtype and then the storage dtype is going to be four bit so we're going to hold all of our weights in four bit but when we train them we're going to upcast them to 16 bits uh which is a effective way to train and have that uh gained space from uh the the forbit quantization we're also going to thanks to Tim dmer who's a a mad lad and and Powers a lot of this stuff uh we're going to quantize our quantization constants uh which is going to save us even more space then we can just load this thing up we're going to use device map Auto to make sure that it lands in the right device and we're going to grab our tokenizer and then we're going to make uh just a couple changes to our tokenizer to ensure that it's set up correctly for uh training let's see how the model in its base form does at our task so we're going to give it this instruction you know use the provide input to create an instruction uh you know there there are more than 12,000 species of grass you know and then the response it gives us is when it comes to Grass there are many different varieties to choose from the most I mean this is definitely not an instruction I hope we can all agree this is just going on and on about grass which is dope and it's cool the model can do that but it's absolutely not what we want so uh how do we get it to do what we want ah of course it's fine-tuning so uh in order to fine-tune the model beyond the fact that we've turned this thing into a four bit representation right so we we've crammed it down we we've taken the original model and we've slashed it I into 25% its original space it's still not near cly enough so we have to use something called Laura we're using p Laura so it's a parameter efficient fine-tuning method called Laura which stands for low rank adaptation uh the idea here is that the the way that we're training our model involves using this big Matrix and we have many big matrices right Laura helps us use much smaller matrices to represent that larger Matrix and it's exploiting the fact that there's a lot of redundant information in the big Matrix when it comes to our task so if we can imagine that the full weight Matrix has you know every single task that we need to do held within it uh but our task only takes up a small portion of that Matrix then we shouldn't need to represent or use the full representation of the Matrix to do training on very specific task and so that's the basic intuition behind Laura uh the idea is we're going to replace our big thousands of uh you know Dimensions by thousands of Dimensions Matrix with this 64 uh by n and then M by 64 submatrices which we will Factor together to reproduce our big Matrix and we'll use that to do our parameter updates uh if you're not sure what I'm talking about uh no worries we will spend lots of time going into it uh you can check I got a YouTube video that goes pretty deep into it but for the most part the idea here is that we're going to use these small matrices to represent our big matrices and that's going to save us even more space even more space safe so not only we slashing the weights you know down down to a quarter of the original size we're also going to uh use you know a a very very small number of actual trainable weights and you might say well how does that save us memory right well the idea is that the the the model itself has a certain capacity right when we're doing inference with the model it takes up a little bit of space so you can see this beginning kind of Point here uh that's how much space it's going to take up on the card so it does take some GPU to just load the model du infer it but when we train the model we need to hold all of these weights for our Optimizer in memory and that takes insane amounts of memory so Laura is going to cut down on the amount of Optimizer states that we care about which massively reduces the memory the old the old example that that never gets old right if you have a uh 100 by 100 Matrix that's a lot of parameters to keep track of but if you have a 5x 100 and a 100 by5 Matrix that is much fewer parameters to take care of uh and that's this is the whole idea of Laura enough about Laura though um we need to go ahead and call our prepared model for kbit training and get PFT model helper functions from our uh P Library uh if you don't do this then the model will not be loaded with these configs and it will have a bad time and you will uh you will you will pull your hair out over ooms so make sure that you uh call these helper functions and then we're just setting up some hyper parameters hyper parameters are straightforward enough you know we're going to train this for only a 100 steps you can do Epoch if you wish just comment out just comment uncomment this line and then comment out this line we're using a batch size of four which is fairly small because we are going to do evaluation so we're going to do evaluation every 20 steps we're going to do five evals throughout this training our model is probably likely to overfit as all these models will be but we'll we'll keep an eye on it uh and then we're going to use an sft trainer or supervised fine-tuning trainer to automatically produce in the same way we're used to a supervised fine-tuning set based on our uh original unlabeled data set and it's going to be great even though mistl has a context sequence length maximum of 8K we're only going to use 248 for uh today since we're just generating those instructions then we call do train you'll notice that the model it trains uh you know we train for 100 steps and uh you can see that training loss does come down some and validation loss also comes down some it doesn't look like we're super overfitting and of course if we let this continue to train for a long time we should hopefully see those losses come down to a reasonable level we're going to save our model locally in case we want to use it later and we're going to save model and push to HUB we have to push the adapters to HUB we can't push the full uh merged and unloaded model so to be clear about that when we train with Laura we're left with this artifact called an adapter and an adapter is something we can apply to the base model to give it the uh ability that we fine-tuned with so you can think of it as like the model is a drill and then the adapter is a specialized bit right so we can put a special bit on our model that's the adapter now there's a process of merging and unloading the model where we actually it's like we uh you know instead of using a a drill with a bit we use a specialized tool that just does the one thing that that bit is supposed to do um but we can't push a four- bit model to the hub yet however the same man we talked about earlier Tim dmer is working on a PR right now that was uh was you know it's it's in the process of being pushed as recently as four hours ago they were working on it so we should be able to merge unload our model and push it to the hub soon which is going to be helpful for running it with other inference optimization strategies uh but for now we're just pushing our adapters it's fine we can just load the base model and then apply our adapters everything's Gucci so let's look at a response with our fine-tune model keep in mind we only fine-tuned on 100 samples uh or 100 iterations with a batch size of four when we ask the same question that we asked before you know use the provided input to create instruction and then there's many different kinds of grass man our response is identify the most common species of grass and provide a brief description of its properties which is much better right that's uh much more in line with what we wanted it doesn't just ramble on about grass it gives us an instruction that could have led to this uh this big block of text and that is fine-tuning mistl 7B um there you go

Info

Channel: AI Makerspace

Views: 7,800

Rating: undefined out of 5

Keywords:

Id: Id4COsCrIms

Channel Id: undefined

Length: 21min 58sec (1318 seconds)

Published: Thu Nov 09 2023