Fine-Tuning with LoRA (Low Rank Adaptation)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[Music] hey Chris is it true that we can use PFT Laura to train less than 1% of the trainable parameters in llms and still get great results that's absolutely right Greg so how much data do we need to make that happen yeah it's a lot less than you think you would need so you're saying that just with a little bit of data little bit of fine-tuning we can mod the behavior and performance of llms to be efficient for our tasks our applications and our businesses that's absolutely correct yes D that's so cool I am pumped to dig into this today we'll see you back in a bit dude for the demos hey everybody my name is Greg lochnan and I'm the founder and CEO of AI maker space thanks for taking the time to join us today as we kick off the new year we want to demystify one of the most mature aspects of building with llm applications for you today and that is fine-tuning parameter efficient fine-tuning and low rank adaptation or Lura and this will hopefully help you really get started on the right foot in 2024 we'll be collecting questions today via slido so please drop your questions directly into the link in the chat and also upvote your favorites of course Chris my friend CTO of AI maker space and The llm Wizard will be back to lead our code demos soon so let's get right into it today let's talk efficient fine-tuning of llms with low rank adaptation for Laura as always we want to make sure we align our aims for the session if you're joining us this is what you will get out of today what is fine-tuning what is PFT what is Laura how can we actually fine-tune these llms let's see exactly how it's done with the latest in greatest tools from hugging face including one of the latest and greatest open source models so to contextualize this a bit first we want to talk about as we build llms why this idea fine tuning is important in the first place then let's talk about fine-tuning proper and then dig into this P this Library this idea of doing this a little bit more efficiently finally we'll talk Laura or low rank adaptation of llms before we see Chris back to lead us for some demos and then of course any questions you have we'd love to answer them all right so when we prototype when we're building those LM applications in 2024 right now we generally think okay we want a prompt engineer as far as we can we want to kind of get this thing up and running in a minimally sufficient one shot few shot type of way then maybe we want to sort of ground our LM application in some of our own data some of our own context this is where retrieval augmented generation comes in as well as you know sort of just the idea of this question answering system and finally after we do some prompt engineering generally step two would be doing some rag we're often looking at fine-tuning with a Keen Eye and trying to understand when fine tuning Becomes of use to us a lot of people are asking do I do rag or fine tuning and the answer is well it depends it's not always linear like this as open AI has sort of put out there there actually two different things that we're optimizing when we move from prompt engineering into rag or fine-tuning and the fine-tuning piece in particular is what we're talking about is we're talking about the way we want the model to behave the kind of inputs we wanted to get and the kind of outputs we wanted to provide of course General large models can have General inputs and general outputs but more specific task oriented models like the ones we'll need in our businesses they're going to be a little more dialed in a little bit more fine-tuned and often times as you build these applications you're going to have to go from prompt engineering to rag to fine-tuning and back again to get this thing to a human level performance that you're happy with to First deploy it to production and then once you're in production of course continuing to dial in the rag systems and to fine-tune the llm as as well as the embedding model is going to be very important as you try to serve more and more users stakeholders customers better and better it's also worth noting that as we head into the new year one of the things that's gaining a lot of traction right now is this idea of small language models here we see this word efficiency that we're going to see a lot today Microsoft put out in the state of AI report this was noted last last year couple papers and one of them was called tiny stories and it was sort of interesting because it kind of compared the 125 or so million parameter models like GPT Neo or gpt2 the small versions that really don't do a very good job at generating coherent text sort of out of the box to a much smaller model instead of 125 million parameters just 10 million parameters but train Tred on synthetic data that was made up of stories using very very simple language the kind of language you could talk to a three or four-year-old with and the tiny stories data set showed that it was possible to generate smaller much smaller language models that really gave some pretty solid English generation whether you're training at 10 million parameters or you have a few more parameters but you just have one Transformer block this was shown to be possible also when we got specific about just coding that's where this other paper textbooks all you need came in where instead of training a huge model and then trying to make it good at coding it was simply taking a relatively small model 1.3 billion parameters still very very l large in the context of all models we've ever seen but when we think about the leading models of today they're at least seven billion or so and fine-tuning that and fine-tuning that on quote textbook quality coding exercises where there was some taken from textbooks some synthetically generated and really produced a very small relatively speaking only 350 million parameter model that produced great results these are worth highlighting because whether we're talking about training from scratch or we're talking about fine-tuning we're kind of talking about the same thing and we're kind of talking about dialing these models in for our use cases and that's exactly what fine-tuning is all about at the end of the day so when we talk about fine-tuning as in the latter case of textbooks are all you need we're talking about taking a language model that was trained using un supervised pre-training probably some com combination of instruction tuning and other alignment techniques as well although not in all cases so we're taking that base model that we get from this initial training and it could be a chat model again or an instruct tune model but then we're kind of taking our specific instruction examples and those might be very specific they might be generated based on our specific product line but those examples are the ones that we then take and we update our language model to be very very good at producing this supervised fine-tuned model that's really more dialed in to the kinds of inputs and outputs and Specific Instructions associated with what we want the model to be good at because of course these models can be good at so many different things out of the box so as I alluded to just a moment ago whether we're talking about fine-tuning or training we're talking about the same thing we're talking about updating the weights or the parameters within our large language models and this is not new to anybody that's been around the game for a while I mean training a simple deep learning neural network is not fundamentally different than what we're doing when we do fine-tuning of a large language model although there are some key differences that we'll highlight today the mechanics are there we have inputs that we pass through the weights and those weights fundamentally are the things that are going to be updated they can be updated through back propagation like the classic technique uh we're also seeing some really interesting things happening out there in the space where uh you can do fine tuning without even using back propagation and so expect a lot more innovation in this space especially as small models become more and more popular Chris did a really great example of how to talk through Laura at an even slightly deeper level he had a bit longer form opportunity on his YouTube channel that highly recommend checking out if you want to dig further in on these on the specific visualization of weight matrices here but we're GNA today sort of go from this mechanics of fine-tuning into this idea of PFT I mean this is a great meme because it's so true this is sort of like the new AWS mistake right is do I really need to find tune this like that again do I really need to spend this much compute messing with things and this is where we get into this let's be a little bit more efficient space PFT is parameter efficient fine-tuning and the big idea the problem that it solves is when you fine-tune on Downstream tasks specified by those data sets of specific instruction and output responses you can get huge performance gains compared to off-the-shelf llms zero shot right so zero shot is not really providing a ton of Specific Instructions Not providing a ton of examples so you can get pretty far again with prompt engineering and so when this fine-tuning comes into play that's going to be a judgment call for you as you continue to build and prototype your LM applications and we think about yft well we've kind of covered it already but llms are so big and billions and billions and billions of parameters are hard to deal with they're hard to load into memory they're hard to do inference on they're hard to just manage and because they're soig big we need a way to sort of Leverage the best parts of them but not do it in such a way where we just can't ever get it done on the machines that we have or the cloud instances that we have access to because this problem isn't going away as we've seen these models although they will continue to get smaller and smaller things like mistel and others are are pushing those boundaries on the openlm leaderboard all the time we are going to see these models get bigger bigger still as we approach and move towards AGI fundamentally we want these models to get bigger and bigger we want to see what they're capable of so the biggest Labs will be generating bigger models some of the startups out there like mistl we'll start to see generating smaller models but fundamentally we're talking about these weights of the trained model the parameters in the trained model when we download a model we're downloading the weight we're downloading the parameters this is the same thing and those parameters are always are always represented by a certain number of bits and generally if we download in full Precision we're dealing with a lot of parameters this gives way to the idea of quantization which we'll cover very soon in an upcoming event next week and we'll see a little bit of quantization today when we see Chris's code demo but it's important to understand that we're just dealing with these weights these parameters and it's important to find ways to be more efficient when dealing with them so when we think about training those neural networks training those parameters we need to think about it's the same as fine-tuning but it's not all all exactly the same because when we have a pre-trained network those weights are pretty solid we're not going to change every single weight a whole bunch one when we do fine-tuning for instance in this simple example of sentiment analysis if you were trying to do some fine-tuning for a specific domain you imagine many of the weights within the network wouldn't be particularly different although some might be and this is how we want to think about fine tuning we're not changing everything we're simply fine-tuning and there are two challenges when we do fine tuning when we do full fine-tuning even though we don't necessarily have to change every weight a whole bunch we're changing all of the weights this is very hard to do on our own hardware and again when we do inference and we try to store and deploy these things efficiently for low cost it's very very hard to do we need smaller models and so the solution is leverage a smaller version of the large model by only dealing with a smaller number of weights and then only dealing with a smaller number of trainable weights when we do fine-tuning this is exactly the approach we'll take today this allows us to have models that are better not only at out of data set pieces of information we want to do inference on because is fundamentally they're more General but also allows us to do really a whole lot with not very much data which is very very cool so from P to Laura this image probably doesn't mean a whole lot to you if you're not familiar with Laura yet but after this brief discussion I think it will Laura as sort of an entry point for those of you that are complete noobs to this is simp simply the number one parameter efficient fine-tuning method that you should go out and learn today it's the first one to start with from there you can start to dig deeper but what's really going on in Laura and what's the intuition behind why it works well the intuition for Laura is similar to the intuition behind why fine-tuning works we have these pre-trained models and not everything within those models really matters the same amount they have that low quote intrinsic Dimension that intrinsic Dimension intrinsic simply means essential that natural dimension of things that matter when it comes to the behaviors we want the model to exhibit is what we're really focused on so we can actually change not every single parameter a whole bunch but we can change just those key parameters and of course identifying those key parameters is science unto itself one we expect to see a lot more progress in the space in 2024 on but the sort of key intuition and the idea is that we don't need to change all of them a whole lot all the time when we talk about Laura we move from this idea of intrinsic Dimension to now this idea of intrinsic Rank and Laura is really focused on specific matrices within a Transformer architecture so the hypothesis that the authors put forth here is that the updates to the weights also have a low quote intrinsic rank during quote adaptation so we have this idea of intrinsic Rank and we have this idea of adaptation so let's talk about rank first well hearken back to your linear algebra days in school and maybe you recall that rank is the number of linearly independent columns in a matrix okay that's not very helpful if we think about it as a weight or parameter Matrix maybe it's slightly more helpful but still not super helpful what we want to think about is these n these linearly independent dimensions are the ones that provide unique information about what we need to solve the problem we're solving to exhibit the behavior we want to have this LM exhibit to have that task be completed properly this is a screenshot of the word to V embedding model shown in three dimensions through principal component analysis but there's a 200 Dimension underlying representation at play here to put each of these words relative to one another so each of those 200 dimensions you might say matters the question is how many dimensions matter in your language model and for your task and this is sort of what we want to ultimately drive towards in this small language model domain now as we think about how Laura works Beyond this idea of matrix rank we want to think about the two-step step one is for it to freeze all pre-trained weights this is cool and this is the sort of adapter piece then we want to sort of be able to create these adapters these injectable trainable rank decomposition matrices that we can plug and play into our pre-trained model weights into our initial llm that we downloaded and so these sort of plug- andplay adapters are made up of these rank decomposition matrices again if you go back to school talk about linear algebra decomposition of matrices is just about chunking big problems into smaller ones to make things more computationally efficient uh if you guys ever used mat lab back in the day and just tried to invert a million bym million matrix real quick you can definitely make mat lab go Burr and it's pretty fun but there's better ways to do this we don't try to invert Million by million or billion by billion matrices and we don't try to do really really computationally intensive things with matrices when we're dealing with llms either this is where Laura comes into play finally we want to understand that Laura is going to be used in a Transformer context it's not simply a feed forward neural network although it's helpful to use that image as we get a handle on it we did a really long form deep dive on Transformers we'll probably do another one soon so if you're interested in that definitely check that out it's a Transformers for a couple of hours there and then the important Point here is that Laura is not changing everything within the Transformer Laura is associated simply with the attention layer in each of the Transformer blocks so it's not associated with the attention layers and the feed forward layers It's associated with simply the scale do product attention that's happening within each of the blocks and as you'll see it's not even just it's not even all of this it's just the queries and the values the Q and The V that we're actually leveraging the Laura adapter for and of course if you're familiar with Transformers you already know that this is an encoder block and this is a decoder block many of the common Elms you'll deal with today are decoder only style and so that's worth noting here as you start building with decoder only llms and leveraging Laura for fine-tuning as well so Laura provides a ton of advantages for us it provides a more efficient way to train fewer parameters those adapters are very plug-and playay if we simply keep the initial weights and then train a bunch of adapters for different tasks we can use the same model for a whole bunch of different things so you can start to get sort of curious about how this might dovetail into hosting and inference and doing a little bit more efficient work on the Ops side when you start to deploy more and more of these applications for your company and then you can combine them with other methods uh you get pretty good results and then a key Point here is that you can merge the adapter weights with the base model and in doing so and this is a an awesome sort of hugging face Library thing you can make sure that there's no additional inference latency versus the base model itself all right so to recap before we get into demo today we talked fine tuning and when we think about fine-tuning we want to think about simply modifying llm Behavior by updating weights or parameters we talked about PFT which is fine-tuning with less weights or parameters and we talked about Laura low rank adaptation which is fine-tuning with factorized or decomposed matrices so if we put it all together we have PFT Laura fine-tuning which is just modifying llm Behavior by updating less weights or parameters with decomposition and factorized matrices so in a diagram you can kind of think about PFT Laura fine-tuning versus regular fine-tuning little bit like this so today's build is going to be a classic application the Uno reverse card application given a response an LM output we're going to predict the instruction or the prompt or the LM input we're going to pick up one of the best models off the shelf that we can find that works well with this fine-tuning Paradigm and that's mistal 7B instruct v0.2 definitely one of the best offthe shelf models you can pick up today and the fine-tuning data that we're going to use is we're going to use the alpaca gp4 data set where we're simply going to switch the output and the instruction you can check out more about this data set at this paper right here but without further Ado it's time to get into the demo F tuning with P Laura Chris the LM wizard is back let's see him in action thanks Greg yes okay so the idea is pretty simple um what we want to do is we want to do exactly as Greg has outlined here so you know this is a method to fine-tune a model on a much smaller Hardware uh than you should normally be able to get away with now you will need to use 40 gigs of GPU RAM for this specific example uh during training it does peak pretty high so if we look at our uh GPU Ram you can see we used uh peaks of 15 so the T4 might not be big enough for you uh you can use a very small batch size of course and that will help you kind of cheat around some of the uh you know some of the space requirements but uh for the most part we're going to you know we're going to focus on using uh a a larger model okay or a larger card so we're GNA use three main libraries here that's going to be PFT Transformers and bits and bytes we're going to kind of gloss over this bits and bytes step though it is extremely important uh we're going to talk about that more next week when we talk about quantization but for right now we're just going to focus on PFT and then Transformers so the idea is you know we have this model mistl 7B instruct uh but we want to make it better at a specific task right so we're not trying to teach the model anything new we're just trying to uh you know get get it into a place where uh it's doing our behavior that that we desire a little bit better so we're going to use the M 7B instruct v02 from hugging face as well as we're going to use the alpaka gp4 data set so you can just find this model on the Hub right here and then uh you can find that data set also on the Hub right here Perfect all right so what is PFT right PFT is parameter efficient fine tuning like Greg said all that really means is that we're fine-tuning with less parameters or more efficiently leveraging our parameters Laura is a way for us to to do that kind of using a very clever trick which is that you know we know our matrices have inherently or intrinsically low rank uh you know this is discussed in in in the papers that that you saw Greg talk about and so we want to fine-tune our modu taking advantage of that kind of trick and so uh the way we set this up is pretty straightforward it might sound complicated uh from the the kind of Concept side but from the code side you know Transformers Library hugging face has us uh has us uh ready to rock so we're just going to grab our dependencies and then we're going to go ahead and we're going to load our model uh you'll notice that we're mod loading the mstl AI mstl 7B instruct v0.2 as we discussed all of this stuff is great basically what this is doing and we'll we'll Deep dive again next week but the idea is all this stuff is doing right here is loading your model in a very small fashion right so we're going to massively reduce the number of uh bits that each of our weights are going to take or each of our parameters is going to take up in memory and that's going to help us load it into a smaller card so you'll notice that we have you know this whole model train and we only peaked at 15.4 gigabytes of GPU meem which is uh which is pretty awesome after that we just load our model kind of normal stuff right now we have to do this kind of tokenizer processing the reason why is we're using the Llama tokenizer as a root uh you know all of the the mistal suet of models is is kind of like a Redux of the the Llama model so uh we want to make sure that we're doing this pre-processing so our training goes smoothly um that's the only reason that's there now when it comes to our model architecture you'll notice that we have these 30 32 decoder layer blocks each of them has a self attention layer with qkv PR and then our output as well as some rotary embeddings you'll also notice that we have this MLP which is gate up down and our activation function and then we have some uh layer Norms that are also useful the big thing to focus on here is these attention layers now the original Laura paper uh when it came out kind talked about how we want to Target uh you know say q and v as a rule there's been some more information that's come out that says it might be prudent especially when using things like Cur to Target all of our uh model or all of our layers equally so all of our non layer Norm layers but for this example we're just going to use the automatic PFT uh version so what this means is we're going to let PFT make that decision for us though you could see increased results if you did Target more uh layers or more modules to be more specific um we're just going to focus on Q and vrge today so now that we've got our model loaded we've kind of decided you know uh we've decided what we want to actually focus on uh we need to turn our model into a format that's compatible with Laura and the way that we do that is pretty straightforward we're also going to use this uh print tra parameters helper function just to uh really see how few parameters we need to to use so the Laura config has a number of options the first we see is r or rank right so we talk about low rank adaptation this is the rank we're talking about now you'll be you'll notice that even though uh we have a low rank you know we can go pretty high with it uh but we're not going to get as high as say our base dimensions so in this example we'll be using 64 uh no particular reason R is a hyperparameter you'll want to do some kind of hyperparameter search in order to determine the uh the best version of R but for right now we're just going to go ahead and we're going to use 64 uh as a as a example now the idea of this rank right is comprised of the fact that we're going to have this base model Dimension by our Matrix factored with this uh R by base model Dimension uh submatrices so we're going to build these um these submatrices which are going to be smaller and that's what we're going to update right because as we'll see in the architecture what we really want to do is we want to compute the Delta weight or Delta W right and the way we do that very efficiently is we only care about this factorized version of our full weight Matrix so let's look and see what this looks like in the code you'll see that we just pass in R 64 that's great and then we pass in our Laura Alpha which is uh 128 now the kind of conventional wisdom here is that Alpha should just be twice are um there's a an incredible blog post from the lightning AI folks where they did many different experiments uh this is not always true but it's a great place to start you might want to explore different ratios uh but for the most part your Alpha should just be set to twice what your R is and Alpha is just a scaler now for Target modules again we we kind of have a decision here we can let PFT Take the Wheel right if it's a model that's already existing in the PFT uh library or we can Target specific modules for this instance we're just going to let pep take the wheel and it's going to choose K or q and V and it's going to Omit K but we could manually add K as well as whatever MLP layers we wish to add as well uh totally up to you we have this Dropout drop out is a important thing to prevent overfitting we're not using a bias and our task type of course is Cal LM since we're doing a cal LM task next we need to prep our model so the first thing we're going to do is prepare model for Kit training um this is great we want to do this and then we're going to get our PFT model also great also we want to do this there's a lot going on under the hood here but what it boils down to is just moving the model into a format that's compatible with our desired training then we're going to print our trainable parameters just to see you'll see we go from a lot of parameters right to 2.7 million parameters and that represents 0.72% of our total trainable parameters so we are only training 0.72% of the possible parameters we could train uh using this method that's why it's parameter efficient right this is how the idea of parameter efficiency comes in uh despite the fact that we're training on less than a percent of the total trainable parameters we are able to get great results so let's look at the model architecture to see exactly what's happened here right so we have our Q which was uh previously just a linear it's now been replaced by a Laura linear which has our base layer that's that W right and then it has our Laura Dropout then it has Laura A and B Lura a is the 4096 by 64 and layer or Laura B is 64 by 496 if we were to uh combine these two matrices we would get a 4096 by 4096 Matrix which is what we would desire since the input and output is 4096 by 496 so the idea here right is that we're just breaking down uh our our Delta W into these two submatrices and all we do is every time we go through and we re uh you know we calculate what our uh weight changes should be uh we update Laura A and B and then we uh combine them together and then we are uh you know adding those literally adding those to the base layer to keep track of our changes so that we can keep uh keep learning that's it both of these are initialized one is initialized as a random gausian and the other is initialized as zero her um there's some intuition behind why we do that but for the most part um you know that is the rule and that's what's going to happen you can do kind of like more tactical initializations but the base methods are just going to use the uh the actual uh gausian and then zeros you'll notice that K PR is still justce linear and then V also has our lurer layers that's because it only targeted V in and Q and that's it that's all Laura does right so now every time we we update our weights what we're really going to update is these Laura sub matrices and then we're going to combine and add them to our uh to our base layer in order to get the the final uh results so how do we train this on data now right it's great kind of uh have a a better understanding of how it might work but how do we actually train on data well we uh it's the same as every other time right we're going to grab some data we're going to create a data set it's got 52,000 rows we don't need all 52,000 for for the sample uh we're going to look at how this might look so we have some instructions we have an input we have an output that's that's great um you know we're just going to select a subset of 5,000 of these rows and then we're going to convert it to this kind of format right generate a simple instruction llm could use to generate the provided context then we're going to provide the context and then the response the idea being that you know we we want to in this case show the llm the response and have it create the instruction right so this could use help us to synthetically generate an instruction data set for generating our prompt this is a function that our a library we're about to see is going to use to help us understand uh what each prompt is so we basically feed it a row of the data set and it generates a prompt in this format that's all this is doing now we'll be mapped over our entire data set and it will result in something like this create a simple instruction that could result in the provided context then we have the instruction flag and then we have a bunch of context then we have the N instruction flag and then we have our response and that is it once we have that set up we can move to create our training arguments this is all boiler plate it's a lot but it's all boiler plate uh basically we're going to have our output directory we're going to train for 100 Epoch we're use a batch size of four if you're doing this in a T4 instance or the free collab a version you might want to reduce this to two or one in order to prevent out of memory issues we're going to use this gradient accumulation step this is kind of a way to cheat out uh a higher batch size we're going to use GR checkpointing as well as the paged atom W 32-bit Optimizer this is straight from the Kora paper uh which is uh you know just a a good Optimizer to use uh we're going to have a fairly aggressive learning rate and then of course we're going to set our uh D types we're going to talk about this more next week but uh for now the idea is we want we need to be able to ensure that we're Computing things in the correct d type and then we have some remaining kind of boil plate uh again these are all hyper parameters you will want to fiddle with these uh if you're producing a model for production use uh in order to ensure that you have the best possible uh result then we're going to use our TRL sft trainer sft trainer just means supervised fine-tuning trainer right easy peasy uh we're going to pass in our model our data set our Laura config pass in this Mac sequence which is just going to be 2048 uh you can set it to whatever you'd like as long is as it is less than or equivalent to the max sequence length of your model going to pass in our tokenizer and then we pass in that generate prompt function we uh built above so that it can be mapped over our data set and then we're going to pass in the training ARG that we just set up above after that all we got to do is everyone's favorite thing to do which is dot train uh you'll notice that the loss slowly goes down over the course of training so we're you know we get down to 806 and then we're down into the double zero so that's great it does Plateau fairly quickly and then we just kind of keep training so likely this is overfit we're going to go ahead and save this model now you can't really it's interesting right you can't really run inference with just this right so this model is a Laura model we'll need to uh set up a auto PFT model now that PFT model is going to leverage the adapter that we or the the those uh weight matrices we wound up training right so you can't run inference on the newly trained model you have to convert it to a format that is suitable for inference now you could either merge the the weights that you train in which is a basically you know you take that Delta W that's the combination of those two smaller matrices and then you just plop it on top of the base layers or you can do that process during inference now that last sentence right do that during inference is what makes Laura a absolutely insane approach to use in production in terms of the flexibility right so you can fine-tune the same model on a on a thousand different tasks it's just a little bit hyperbolic but the idea is you can find it on a number of different tasks right and then at inference time you can decide which model you use so you only ever host the base model and then you get to choose which combined matrices you use at time of inference which is what makes this such an incredible tool for uh a production environment and such an incredible tool for hosting models and all you have to do is use this Auto model for Cle LM and then it's going to load up that model with the uh with the additional PFT adapters uh we use the word adapter here Loosely because there is a method called adapters uh this is distinct from that but it is the same idea it's a it's this entity that we keep track of that uh kind of is what we train right the same way that you download model weights you can download your Laura weights so let's look at how it did so we generate a simple instruction that could result in the provided context and the context is the odd one is telegram Twitter and Instagram and social media blah blah blah blah blah and it says Identify the social media platform that is not an instant messaging service and then it lists three uh three options and that is a very much more verbose but it is a absolutely uh the the kind of instruction we were looking for now how did the base model do right so it's great that the finetune version did well but how does the Bas model do well the base model uh does pretty poorly uh it says Identify the platform that's primarily used for inst matching voice over IP instead of sharing information the know is telegram so it doesn't do like a terrible job but it definitely does a worse job uh it doesn't provide the uh lists of options and so therefore how can you identify the odd one out in any case the idea is even with only 100 Epoch in collab only using a maximum of 15 gigabytes of vram we're able to fine-tune a seven billion parameter model to be better at a task uh you know this whole process took very little time and we can choose which of these fine tunings to use at time of inference which is a uh cannot be overstated how powerful it is but that's all for me so with that we're going to pass it back to Greg and uh I'll see you guys in a bit for some Q&A man Chris thanks so much that was deep and super relevant to anybody building those production LM applications this year love to see how much of this is really oneto one from prototyping into production in case you guys missed it we just trained only 2.7 million of a possible 3.8 billion trainable total parameters and that's less than 1% point 72% and just to kind of try to put this together a little bit in your sort of Matrix algebra linear algebra Minds remember we're training only the attention layers right within the Transformer and we're training only q and V so Q has Dimensions same as the attention layer 4096 by 496 so we had Laura a and Laura B of these sizes remember Chris chose 64 Dimension from the paper and then for V it's got a dimension 4096 x 1024 so Laura a and Lura B were slightly different sizes so this is kind of how all this comes together again those adapters can be plugged right in even at inference too that's really cool and so you know in conclusion we got some great results and we learned that P Laura fine-tuning is all about that modifying llm Behavior by updating fewer parameters with those factorized matrices fine-tuning is g to be heck of an important skill in 2024 as it helps us leverage only the essential those intrinsic dimensions for our Downstream tasks and as the age of small language models comes upon us it's going to be one of those things that we're asked to do more and more as we've mentioned a couple times next week we are going to cover quantization and QQ Laur we'll talk about things like bits and bites and we'll talk about quantization in general and the trends towards smaller and smaller models will continue our discussion of that is not going anywhere in 2024 so with that love to open it up for questions and Chris come on back up and let's get it rocking and rolling feel free to drop your questions in the slido or in the YouTube chat and we'll do our best to get to them Chris that was a that was awesome man I think I think they liked it too I love Laura it really does the thing it does the thing yeah it's uh it's it's so interesting I think it's one of the few things that has kind of stuck around the vast majority of last year and like didn't go anywhere nobody budged it except the same people that kind of came out with it which kept coming out with better stuff so you know that's what we'll talk about next week so Deb asks could one use Laura to fine-tune a very large image recognition model so that it recognizes a few cases of interest on which they have not been trained uh yeah so you should I mean so Laura is not only applicable to to uh language models it it it does see a lot of uh uses in models like uh kind of stable diffusion esque model so in image generation models I I would assume that you could do this I I I haven't and so I don't want to say super concretely uh how effective it would be but I mean you could apply Laura to basically anything um it it is not spe specific to any architecture uh you know that's why we can apply it both to the attention uh weights as well as the MLP weights with with little uh little issue yeah yeah so that idea of finding new stuff it hasn't been trained on maybe maybe just if it's sort of the infinite object detector already sort of why not kind of thing yeah like if you so one of the things you can do is you can fine-tune object detection algorithms to recognize better or certain classes more robustly and I'm certain you could use Laura to achieve that okay okay yeah so try it out Deb let us know for sure um David wman asks us asks you to Define great results with PFT I believe you said and we got great results as one would assume if you're only using a subset there there's some sort of loss on the output quality is there aha yes but also no uh so yeah it I would say for the most part Laura is uh it comes kind of without a specific performance hit uh it will it will compromise your model's ability to do kind of you know all of the tasks it originally was able to do a little bit but in terms of like the base functionality of a language model which is gener ating text there's no real performance hit um so you know it if you had a model that was before good at math and pros and uh technical writing let's say and you used Laura to make it really really good at technical writing there is the possibility that it might be worse at uh the math right uh but it's not going to be much worse that that's what was found and that's again due to the fact that these behaviors exist in a low-dimensional space that we can manipulate very readily and easily nice nice uh okay Ali asks should we always use a fine-tuned model instead of prompting for production ready applications considering that accuracy plays a crucial role should we always be fine Tina and Chris what do you think no um there's no fast rule for this so I I this is my this is my opinion but uh there are lots of times that you do not need to fine tune right in fact when when prompting is enough there's no real reason to fine-tune right I mean if you get uh very good evaluations on whatever you're Building without any fine-tuning so prompting you know you know achieves your your kind of Benchmark targets then there's no there's no reason right to to do anything other than just prompt the model uh there are definitely lots of cases where you will need to F tune but if you don't have to then why spend the cash 100% yeah and we were sort of covering this idea of how to prototype so you start with the prompt engineering maybe move to rag system and then maybe look at fine-tuning now there's there's some special cases in between but fine-tuning is sort of let's say when you're mostly there right you know just fine-tune it a little bit and you can get mostly there through some other methods but definitely it's going to be one that comes up if you're messing with production ready applications to S Ali so for students to experiment with these fine tunings and this learning what would be your suggested I guess stack for them to be playing with what do you think Chris uh it's all in the dependencies of The Notebook that's all you need to get started um in terms of the Computing solution collab is just good enough right uh in terms of experimentation or like a playground uh collab is good enough like it is especially if you pay for Pro or you pay for compute units depending on which is a more uh feasible solution for you and you don't even because what we're talking about inherently right is trading these big models but using much smaller resources then you don't even need like you can use the free version with a batch size of one to find T mistal seven billion right so it's I would say like collab is a great place to get started it's really easy you don't have to manage the environment it's all done in the cloud um and you don't have to like do any weird signups or anything like you don't have to request quotas you don't have to worry about cost while the thing is not running right so collab is a really really great solution that being said running these things locally on a uh consumer grade GPU right uh like a 3090 or whatever or 2080 with 16 gigs of of vram or or even a Mac right and Mac just uh or apple just released like a bunch of uh Mac uh libraries that make it even easier to run these models on a Mac like you can do all all of this uh locally if you really wanted to but if you don't want to fight with environment setup I I mean collabs great it's so good yeah yeah yeah yeah because may maybe what the real question is here was like should I be using AWS should I be using you know should I be using one of these big cloud providers which one and it's like it's like well no not really to get started you really don't need to worry about that um okay I'm gonna I'm gonna go to the question now from the chat here from viala because I think it's super interesting and relevant to happening today uh is it possible to Laura fine tune with the DPO method we got got some uh some lines we can help discern here Laura versus DPO you got any hot takes on DPO Chris dbo is good uh What is it like better rhf in a way not really not really um but it is you know it's it's a great solution it's a great thing to do instead of doing like Po um I would say for the most part these are compatible systems uh you know you can do PO with Laura you can do DPO with Laura you can I mean again Laura is not specific to a a technology it's just specific to the fact that there is this mate weight Matrix that we know has intrinsically low rank right so anytime that's true you can just plug Laura in there uh in terms of like DPO versus Laur I wouldn't think of them as versus or in two separate camps I would think of them as compatible and synergistic Technologies like they are not to distinct things uh in in terms of like you need to uh choose between them they they can synergize and and and so that's great nice nice okay uh quick question input and output embeddings in the Transformers the the encoding and decoding blocks uh I said the common ones we use today are decoder only I did and and that's sort of based on this idea that the encoder only style is only used for just a few applications uh today and generally when we see those GPT models we see llu we see mistol we see all the latest and greatest new ones these are decoder only style we've got sort of a long form discussion of this that we've done in the past that we'll go ahead and drop in the chat for you now to sort of dig in to Transformers more deeply but that's sort of enough on that question for right now I just want to go to maybe one or two more uh we got somebody says Chris hey I'm quite new here I'm into finance and a sample use case is portfolio management how would I tune a model to that domain Finance portfolio management well uh there's lots of different ways I mean the idea is start with a a a model that is good at that so this is kind of the wisdom of um of our our friends at rcai right this idea of using a domain specific model and then adapting that model so you might want to use like a Bloomberg GPT or whatever you can get your hands on that's that's open source that's equivalent um and then you know find your your data set or build your data set synthetically and then use Laura from there to fine tune uh but the idea is you will get better results if you use a model that's closer to your actual desired Behavior remember Greg just said it but he said it during the presentation and it's so so dang true right we are fine-tuning so we're close you know fine-tuning only works if we're close to the Target right if we if we try to fine tune from here to way over here it's not going to happen so we want to make sure we start very close to our Target and then we fine-tune that last few uh last few millimeters um yeah so I would I would say start with that and then you can use Laura there you go there you go yeah and and I would say too like the sample use case being portfolio management is a little bit too generic sounding to me like if you can really dial in which aspect of managing portfolios that you can you can try to say okay this simple task associated with portfolio management maybe the most annoying task that people have to do um maybe it's the selection of stocks within even a particular you know a particular domain of you know the world and of the stock market like as as clear as you can be on the task that's going to make it easier to curate the data that's going to make it easier to do the fine tuning that's going to make it easier to get most of the way there through prompt engineering and Rag and and some of these other techniques so yeah uh great questions everybody thank you so much for joining us kicking off the new year uh Chris thank you so much for dropping the knowledge on us with status we'll see you next time man thank you everybody for joining us live today and this brings us to the end just for today until next week when we talk quantization and we talk Cura next week also on Tuesday January 9th we'll be launching our llm engineering cohort 2 course this is our course that teaches everything you need to know to build and train your very own llms from scratch including the Transformer attention fine-tuning all the different aspects and types of fine-tuning prompt engineering rhf you'll be doing it all and you'll be building your very own llm from scratch as a Capstone check it out if you're into that kind of thing or reach out to any of us personally on LinkedIn or wherever you can find us Greg at AI maker space Chris aakers space.io other than that please share any feedback that you have on today's event that might help us bring even more value to you in the future and until next time keep building shipping and sharing and we'll do the same everybody see you soon
Info
Channel: AI Makerspace
Views: 2,460
Rating: undefined out of 5
Keywords:
Id: kV8yXIUC5_4
Channel Id: undefined
Length: 61min 15sec (3675 seconds)
Published: Thu Jan 04 2024
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.