Quantization of LLMs and Fine-Tuning with QLoRA

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[Music] hey Chris is it true that we can improve on our P Laura approach with this quantization thing it's sure is Greg yes and is quantization like really as good and as dope as everybody's talking about uh yes emphatically yes emphatically yes man I cannot wait to see exactly what's going on inside um you're gonna show us how to do this today right sure am all right let's go ahead and get right into it man we'll see back in just a little bit today we're g to talk quantization I'm Greg that's Chris we're from AI maker space this is a bit of an add-on to last week's event which talked about parameter efficient fine-tuning and low rank adaptation today we're going to take it to the next level and talk quantization we'll demystify the idea of quantization and we will also talk about how to leverage the latest in low rank adaptation which is a quantized version of it called Q L up as always we'll be collecting questions with slido so go ahead and provide your questions for us throughout the day at that link and then we'll go ahead and answer as many as we can when we're through with the demo at the end of course we'll have Chris back to lead and wizard his way through the demo on quantization soon but for now let's cover what we need to know so that's going to make sense to us we're we're going to talk quantization of llms today and we're going to talk fine-tuning with Laura this is the main goal we want to understand and we want to align our aim to really grocking Q Laura and then seeing how we can Implement that we got a little bit of insight into quantization last time when we were loading the model but now we want to take a look at how it can be used to fine tune and some of the background and intuition associated with why this works and what the industry has sort of learned about the Precision of numbers within our llms so we're going to talk fine tune in quantization Q Laura and then we'll do it and to sort of contextualize this similar to last time we want to understand that often fine-tuning is coming into play after we do prompt engineering often after we set up a retrieve log Meed generation system and we want to now take a look at how we can optimize our large language model or in other words how we want the model to act how we want the input and output schema of the model to be a little bit more constrained a little bit more dialed in a little bit less large a little bit more small and this is sort of the trend we're noticing as 2024 is upon us now we are seeing a bigger and bigger interest in smaller more performant language models and fine-tuning is really a key aspect that's going to help us to get there so let's just remind ourselves what we talk about when we talk about fine-tuning with P Laura and why we need to do this you know when we talk llms they're super big they have billions and tens of billions of parameters it's likely we'll see models with hundreds of billions of parameters before too long not all models are always getting bigger but some of them are and the reason is because if we keep adding more text and more parameters we are pretty confident that our next word prediction will continue to improve but as we do this as we build larger and larger models as we have to deal with more and more compute in order to be able to handle them whether that's loading them training them fine-tuning them or performing inference on them and serving them we're kind of abstracting away from the regular developer the regular individual out there that doesn't have access to a giant cluster of gpus to be able to even play with these things and this is the core problem is that when we go and we want to do full fine tuning on many many billions of parameters this becomes a huge pain for anybody trying to use consumer Hardware any small business trying to just use the laptops that they have maybe a few resources on the cloud and this is as true for fine-tuning as it is for loading and storing certainly for deploying these models it just costs too much and the solution for for kind of dealing with the fine-tuning the storing and the deploying is is kind of the same but today we're focusing on fine-tuning today we're focusing on fine-tuning using fewer parameters it's all about using fewer parameters we don't need all of them as we started to get some intuition into to last time and in fact the ones that we have what we're going to do today is we're going to take those parameters and we're going to make them smaller in a sense we're going to make them smaller in a computational sense this is the essence of quantization so while it may not be necessarily fewer parameters when we talk about quantization although it often is when we talk about fine-tuning we're just trying to move these big big big models towards smaller packages through fewer parameters and through more efficient representation of those parameters and we saw last time we saw that Laura is the number one pep method you should know it's called low rank adaptation and the big idea of Laura as we discussed was to fine-tune using factorized matrices and again we didn't focus on fine-tuning absolutely everything we did fewer parameters that was great because it was more efficient and we found out that we could actually leverage low La adapters for many tasks so you could have one big big model and a ton of different Laura adapters and deploy that to production deploy each of those adapters to production because at inference is when the adapter would actually come into play so very very flexible very good technique for larger companies in Industry especially that want to just have many adapters and one very powerful model we'll probably start to see this emerge as as an approach to AI development in the Enterprise and you know it's really comparable to fine-tuning full fine-tuning so you know we saw in essence that fine-tuning is all about modifying behavior of llms to update parameters parameter efficient fine-tuning is all about fine-tuning with fewer par parameters low rank adaptation was all about fine-tuning using factorized matrices and so parameter efficient fine-tuning through low rank adaptation is all about modifying Behavior by updating fewer parameters using factorized matrices so this all sort of flows together this leads us directly to our new friend quantization and this meme is so good I had to put it twice because it's such an oft misunderstood idea certainly has taken a long time for me personally to really try to grock this thing so let's see if we can break it down in a way that makes sense to all of you first off the weights of our llm when we talk about weights it's the the same thing as when we talk about parameters okay so parameters might say weights we're still talking about parameters those parameters are simply numbers they're just numbers and specifically they're floating Point numbers also known as floats and it's important to understand a little bit of the detail here because this is the essence of what we're doing in quantization when we talk about floats you may hearken back to your days in school maybe chemistry where you learn about significant figures sigfigs everybody's favorite right and then if you're like me you become an engineer and you don't care anymore ever again but I was a mechanical engineer if you're a computer scientist computer engineer maybe you continue to go deeper and these days in AI if you're a developer you need to continue to go a little deeper because this idea of a float is cool this integer with a fixed Precision we can talk about representing for instance 12.345 as 1 2 3 4 5 * 10us 3 and we can then do this by using a specific number of bits in our computer when we talk about this Precision this fixed Precision there's a number of different types of precision what we're going to generally be using is what's called Full Precision when we're doing computations that are kind of default computations full Precision means that I have 32 bits to represent my floating Point number and they're broken up into a couple different pieces here but the big idea is that there's 32 bits and the question is is that the right amount when we want to go and deal with 70 billion parameter models and things like that and it turns out that in machine learning we found sort of over time through experiments that if we didn't use 32-bit precision and instead we used 16bit Precision essentially half Precision to again simply represent those decimal numbers that are inside of each of the neural network that represent each of the neural network weights sort of each of the neural network perceptrons is a way you could think about this then what we're finding is that we can get almost identical inference outcomes from our llm because remember we just want the words that come out at the end we just want the ideas that come out of that we just want the outputs we don't necessarily care about the Precision of the stuff within the Black Box we put in we get out and a lot of people were seeing this a lot lot of researchers were seeing this with the large language models that if we just leveraged half Precision we can get very very good outcomes and what this does is this effectively halves the entire model size so what are we saying we're saying that we can sort of get exactly the same thing coming out even if we represent each of the model weights using half as much information we can think about because really I mean how many sigfigs do we need and another way we can talk about moving from a 32-bit down to a 16bit representation is we can say we are quantizing we quantize the 32bit weights down to 16bit hence quantization now when it comes to quantization there are many different approaches to quantize model weights so this is very important we're not going to cover every single approach because that's not really necessary for what we want to discuss today but there are many different ways to quantise model weights and we hope to continue to bring you more content on ways that are a little bit different in terms of their actual implementation and the nuances in future content but for today we're going to focus and use this Q Laura idea as a focusing lens now now Q Laura starts the Story begins with a paper called 8bit optimizers via blockwise quantization and this was a paper that came out of the University of Washington uh Tim dmer was the lead author and he's been quite a superstar in the field and kind of he's he's kind of like the quantization guy and in this paper they showed that you can use 8 bit representations and maintain performance that we're seeing at a level of full Precision or 32bit so here we see in this kind of early paper again one of these pieces of work where they're saying hey look experimentally we're seeing that if we reduce the Precision ision we can still get great results and this is not reducing it to half Precision it's reducing it to quarter Precision 32 down to eight and this bits and bites approach this bits and bites paper turned into what became the bits and bites Library which is since evolved and is something that we'll see Chris used today and it's something that gets used all time now now bits and bytes you can go ahead and recall that one bite is equal to eight bits we're going to continue the discussion in bits today but you'll see many papers and discussions of things that we'll talk in bytes as well so pretty simple to understand why the library was named bits and bites now again this is one approach and so there are there's some tradeoffs as there are with with any approach for instance when we use the bits and bites approach to quantization we're not really getting any additional benefits to our inference latency we're not really speeding up inference a whole lot by using this particular approach to quantization however what we are doing is is we're leveraging a tool that gives us very flexible use of those Laura adapters right so for Enterprise if we're if we're thinking about how do I have one big model and just have a bunch of adapters this is going to be our friend and this is why we choose to kind of focus on this one today and this bits and bites Library kind of forms the basis for what comes next it kind of forms the basis for this Q Laur idea this efficient fine-tuning using quantization and the fine-tuning using quantization from the cura paper the big bang boox takeaway of this is it's super great even though it's eight times less precise so what we actually have going on in Q Laura is we have not an8 bit representation but we have a four bit representation and so what we can do is we can deal with fine tuning that even using half Precision on the big llama model would require 780 gigs of GPU memory which is completely insane and we can fit all of that on a single 48 gig GPU which is like just kind of incredible it's just kind of it's kind of mindblowing that we can do this and so this Cur paper is essentially coming and saying Hey listen we've got this idea that we can do fine-tuning using a four bit approach versus even a half Precision approach and we get amazing results and so this is the essence of what's going on here with Kora and so what we can kind of think about is if we go back to this idea of Pep or fine tuning where we're modifying Behavior by updating fewer parameters using factorized matrices and we add this idea of quantization where quantization is simply representing High Precision numbers with low Precision then we get to this place where we talk about PFT Cur fine-tuning where we're modifying Behavior by updating fewer quantized parameters using factorized matrices and so the process as outlined in the Cur paper and the process that you're going to see today is something like this we download the model weights anytime you download model weights from hugging face they're always going to be in full Precision 32bit then we load our parameter efficient fine-tuning model into GPU memory anytime we load into GPU memory for inference or training we're going to be loading using that parameter efficient fine-tuning method and then we'll initialize our low rank adaptation or Laura configuration and finally in this is the key this is the key to the whole thing is that during training what happens is we have the full Precision 32-bit model and we're going to actually load the 4bit model quantize 32bit down to four bit for training now during training we're going to flow through the network and we're going to as necessary each time we have to do a computation each time we have to calculate something during our training process we're going to De quantize that 4bit representation back up to a 16bit half precision repres entation we're going to do the calculation and then we're going to re quantize back down and at each step of our training or fine-tuning we're going to quantize De quantize move on so we're never holding that half Precision fully in our GPU memory but rather we're simply using half Precision to do the calculations this is the magic of what's really going on behind the scenes and it turns out this works incredibly well and again that intuition behind the 16bit piece is that we saw that for inference you can go from 32-bit down to 16 bit and get very very good results we saw this experimentally over a lot of time not just papers from the University of Washington but also papers from many other researchers and this Q Laura approach fundamentally is to load those full Precision weights into GPU memory as quantized 4bit weights and then only de quantize up to 16 bit during calculation back down as it moves through all right so this is the core approach that we're going to see today you're going to see things like this this is the bits and byes configuration and you'll notice when we want to load in we want to load in in four bit you're also going to see a data type called nf4 Chris is going to talk a little bit more about it it's very important it's very essential to the qore approach and that's it for the big Ideas we need to really see how this build can be taken to the next level so what we want to do is we want to take the same build that we've already looked at the old Uno reverse card build given the response predict the instruction we want to use the same model that we saw last week because it's you know still one of the best out there mistal 7B instruct v0.2 and we're going to use the same data for fine-tuning just keep everything simple that alpaca gp4 data set is there so again output response predict input instruction and with that we're ready to kick it back over to Chris the wizard to show us how to do fine-tuning with PFT Cur and fill in some additional details where over to you man oh yeah thanks Greg really appreciate it and uh guys I'm excited because quantization is definitely one of my favorite topics uh it is the kind of like one of the best things we can do right now and as you can see we only used around 20 Gigabytes of GPU Ram to train this uh 7 billion parameter model which is quite uh you know quite impressive in in my My Lens uh that includes fine tuning that you know in any case we'll get right into it first of all we're going to be using ml 7B instruct v02 uh this is just ml's most recent instruct tune model I love it uh and we're going to now move on from PFT which we which we discussed last week into uh the Q in Cur so we discovered or we discussed uh you know the idea of how we can reduce the number of parameters that we train but now how do we reduce the size of the parameters that we we train now first of all what is quantization Greg already talked us through it I'm going to give a a brief overview here of what's Happening under the hood and then we'll get into how to implement it in code spoiler alert it's super easy thanks bits and bites uh but let's look at what quantization is from this perspective so quantization is a process of discre discretizing an input from a representation that holds more information to represent representation with less information right that's crazy so the idea is we want to express more information with less information so how do we actually do that well in the Tim demmer's Cur paper they rely on this process called blockwise kbit quantization which sounds uh you know like uh very you know scary but it's not so bad it relies on two very important things one it relies on the fact that in neural networks the model weights are most most L normally distributed so as soon as we if you're if you're coming from a stats background as soon as you hear that word normal distribution you you know your your eyes should light up uh you know we're we're going to be able to make use of a lot of very clever tricks uh to help us do whatever we're trying to do um and then it also uh relies on this idea of the nf4 format which is a uh a number format or data type created by Tim dmer and team uh which is information theoretically optimal now not not literally it was proven this is not literally true uh but it is empirically for all intents and purposes this is this is a fact uh that nf4 is is very very efficient uh which is excellent so how does this work behind the scenes right so Okay we okay we get it model weights are normally distributed that's great so what we're going to do is we're going to essentially put a pin in the number line that is near to the mean right of our uh desired numbers which are going to be in a distribution and that distribution is going to be uh normal right and then we're going to kind of use that mean as a zero point and we're going to use this nf4 uh data type which is a zeroc centered uh number format to represent the numbers that appear around that specific uh around that specific point in the number line so there's a step that needs to take place here we're going to uh normalize all of our numbers to be within a specific range of minus one to one and then we're going to be able to have this idea of a saved place on our number line that we're going to understand a range around and that's really about it now it's a bit simplified and it's definitely uh you know the that you can look in the paper for the math it's great but the idea is that we have we kind of drop a pin in the number line and we have this N4 number format which represents a a range around that uh point to the number line and that is what's going to build up uh the buckets or bins that we're going to use to uh represent our numbers and the reason this works so well is again because of the fact that model weights are normally distributed and because this is an informationally theoretically optimal data type for that minus one21 range so this is specific Gores for that minus one to1 range for normally distributed uh uh well distribution so that means the only reason this works is because of this first fact right now Beyond just that Kora does an extra step so you might you might have thought to yourself when I said drop a pin in the number line right well okay if we drop a pin in the number line that's all well and good but doesn't that mean that we have kind of like a high Precision number right uh it doesn't have to be as high Precision perhaps but it's definitely still high precision and that's true right that pin we drop is high Precision well it can be used to represent many numbers in this case you you know uh 64 numbers from the Kora paper so each pin is associated with 64 numbers Tim deers and crew said that's not enough you know that's that's going to give us 0.5 bits per parameter of overhead right so what we need to go bigger so what they did is they actually took all of those quantization constants that's the the technical term for that pin that we're dropping right uh we take those quantization constants and then we also quantize those so we represent our quantization constants in an 8bit format and we do 256 of those for every 32bit Precision number so we have one 32 Precision or one 32-bit Precision quantization constant that sits on top of 256 8bit quantization constants which sits on top of uh each of those sits on top of 64 four bits so you can see the the Savings in terms of memory here is insane right we're able to represent so much of our data uh in that four-bit representation and we're also able to do it in a way that retains a ton of information and that is key I saw some questions in the YouTube chat kind of concerning you know what's the trade-offs here what's the performance gains and there definitely is some when it comes to latency we'll discuss those as we move through the rest of the notebook but it's in terms of the actual effectiveness of the model the the the performance hit can be very small uh it it is not zero there is a performance hit but it's it's incredibly small which makes this a very effective technique especially when applied in the way we're going to see it applied today so that's basically what we're talking about when we talk about this idea of Kora right we're talking about dropping a pin on the number line and then saving kind of numbers representing numbers around that and then doing that one more step abstracted which is harder to visualize but there it is okay so how do we do it in code now right uh well first of all we got to load our our kind of familiar Usual Suspects here so bits and bites data sets accelerate the Laura lib Library Transformers and PFT uh these are all kind of staple libraries we're going to be using when we're using these uh kind of QQ Laur tools and then we're going to grab our model and the model we're going to grab is the mistol AI mistol 7B instruct v0.2 it's the most recent uh instruct model for mistol it's a great one and then this is kind of uh you know where the magic happens this is the bits and bytes config uh this is from the bits and bytes Library we're going to see that we load in for bit so this means when we actually move our model from those saved weights uh that exist on our on our hard drive when we load those into our GPU we're going to load them in that 4bit quantized state right so that's that collection of numbers and then their quantization constants and then their quantization constants because we're using this used double Quant right if we omitted that used double Quant we would only do one step and then we would be saving less effective memory we're also going to be using the Quant type of that nf4 I talked about that's the Tim demmer's and crew uh created uh number type which is you know information theoretically optimal again not literally true but it's close enough so we'll we'll keep saying it and then we're going to have this idea of a compute dtype which is going to be torch B float 16 now this is very important right so when we store numbers in for bit that's awesome but when we try to compute with them it's really bad it's actually quite bad right uh what if you think about when you multiply two numbers together especially if they're kind of small right we usually wind up with a number that is uh relatively needs more Precision to fully accurately understand it right when we divide uh a 100 by a thousand we wind up with a very you know a small number and the idea is that we'll need more Precision to represent that very small number so what we do with the Cur approach is we actually de quantize when ever we need to compute with our weights now this is done at a per tensor level so we never need to uh we never have the full model de quantized in memory just one tensor at a time right so this saves us a ton of a ton of space and it also lets us have the ability of computing as if we have this model in that higher Precision or B float 16 format right which is huge so we're saving tons of space and then we're de quantizing so we also retain some of that compute uh uh precision and that is what lets this method really shine right the fact that we de quantize for computation and then we store and forbit I think without that this would be a a less powerful method but with that it's it's amazing uh you can choose up to full Precision here obviously that is going to come with some uh small memory overhead you do have to upcast a tensor to uh to the full Precision but it's negligible compared to the size of the model and it does also and this is critical it does come with some inference and training latency overhead right the fact that we have to De quantize and re quantize dequ quantize re quantize this means that we're performing an additional operation per computation and so that is going to impact inference now Tim and team have uh written some uh great kernels for this so it's not very slow uh but it is going to be slower than if we weren't doing that extra operation and so this is one of the key tradeoffs right we we had questions about trade-offs one of the key trade-offs with Cur uh and with the bits and bites approach is that is extraordinarily flexible it is very powerful and it works very well with uh PFT uh adapter methods so like Laura and others uh but it does cost us a little bit of inference latency uh in training time so that that's important to keep in mind um once we have our bits and bites config loaded uh all we have to do now is just load our model like we normally would so Auto model for caal LM from pre-trained we're going to pass in our uh mistal AI model we're going to pass in our quantization config we're not going to need to cache and we're going to map this to Auto which is going to shove as much as it can into our GPU in this case again because the actual model loaded only takes up about 15 uh gabyt of GPU memory it it's all squeezed into the GPU there so that's great we do some pre-processing on our tokenizer to make sure that it's set up in a right format for training and then we can look at our uh model architecture you'll notice that we have this 4bit layer right this four-bit layer is where that bits and bytes comes in you'll see that we have the four bit layer on our qkv as well as our uh MLP so it's all forbit all the way down um this is uh the idea right we don't want to just quantize some of the model we're going to quantize as much of it as we can however however you will notice that we omit some of the layers specifically we omit our layer norms and the reason we omit our layer Norms is we know that our layer Norm Ms are going to tend to a very very small number uh you know near zero and we're going to run into some training instability issues if we use lower Precision to represent these layers so we're actually going to keep those in full Precision now they're very small compared to their uh their their weight Matrix counterparts uh but we do want to make sure that we're keeping those layer Norms in a higher Precision this is to avoid training and stability issues right if we have these numbers uh kind of uh diverge and and cause a Ruckus right we're not going to be able to train very well and so that's why we don't see those four bit uh uh layers here now that we have our model loaded we can see that it's in four bit we're very happy about that it's time to uh petify it we talked about PF last week so we're not going to spend too much time on it today but the idea is fairly straightforward we are going to uh use our Laura config to set up uh our rank rank is going to be 64 in this case we're going to set our Alpha which should be by conventional wisdom about twice your rank um though you're you know again it's always worth doing hyperparameter searches here to make sure you have the most optimal hyperparameters uh your Lura Dropout pretty consistent value your bias is none task type is causal LM because that's what we're doing uh you'll also notice that we have our QV KR modules we again with qore we want to Target as many modules as we can right the Q Laura paper's wisdom is that we should actually Target all possible layers of Laura in this case we're just going to leave it up to PFT to simplify things a bit for us for our base model all we have to do is prepare our model for kbit training this make sure that we can train and that all of the uh trainable layers are set appropriately and that uh any Frozen layers are also set appropriately and then we're going to get our PFT model and our PFT model is going to uh give us those Laura layers now you'll notice that we have only 2.7 million trainable parameters out of a possible many billion trainable parameters right and the key thing about Q the Q and QA right is while this is great when we make each of these parameters one 18 the size right we're effectively reducing this by another factor of about eight it's not not strictly eight because of the the fact that it doesn't interact with all layers but the idea is it's about eight another factor of eight reduction in the uh in the total size of parameters that we have to train which is insane right it's uh we we we went from kind of we're already at a fraction of a percentage and then we even further reduce uh the amount of actual uh work that we have to do which is which is great then we can see here that our ler Laura layers are also for bit right we have our uh our Lura layers are for bit as well as our uh uh actual you know regular layers that were converted to forbit after that we're going to load some data we're just going to grab the apaca gbg4 data we're going to do this Uno reverse card train just a fun one it's kind of like the the classic now I think uh this is what you're what you're going to see whenever you do an instruction tune it's just fun and it really proves the point that the the process works so we're going to ask the model to take a input and then generate an instruction so we're going to create a model that's good at generating instructions uh we're going to use this generate prompt uh helper function in order to create these prompts that our model will be trained on and then we're going to set up our trainer our trainer this is all boilerplate the other big Insight from the Kora paper is this paged atom W3 two bit Optimizer I'm not going to go too into it here but the idea is that this is uh this idea of using paged memory uh is really really effective uh and it helps us train very stably and very uh efficiently uh with very little cost to us other than we have to flag it uh the rest of this is all boilerplate right it's good boilerplate but it is boilerplate and we are going to make sure that we have our BF 16 equals true which is going to to make sure that our compute D type is compatible when we upcast uh which is necessary it says Cuda but would a Max suffice to fine tun model to the forbit I would recommend a GPU a Nvidia G Nvidia GPU for sure uh the kernels are written for it I believe you can use forbit on other uh devices but it's not necessarily going to be as efficient or as fast uh and the optimization of the kernel really added some speed to this process but I'll get back to you uh more about that after a little bit of digging to make sure that you can do this on Mac even if it is going to be slightly less efficient uh we're going to use the TRL trainer the sft trainer from TRL in order to train our model our Max sequence length of 2048 just for M mistal itself and then we can train this using trainer. Trin at the end of the day we reload our model just a quirk of PFT we reload it we make sure we load it in forbit and that we have our torch dtype for float 16 that's the compute D type again and then we are going to you know look at the model so we say in instruction identify the od1 Out Among Twitter Instagram and telegram that's great that is that's an instruction that would result in this uh in this you know in this kind of uh the1 is telegram response and you can see the ground truth is identifi OD one out and if we look at the uh base model uh we can see that the base model's instruction is much less uh good it does not even mention Telegram and so not a very good instruction but that is it for me and the code demo so with that I will pass you back to Greg will wrap us up yeah thanks Chris that was awesome as usual and love that deep dive explanation on exactly what's happening with in the quantization method in the Cur paper so you know today we saw building off this P Laura approach that PFT Cur fine-tuning is really about modifying Behavior by updating fewer quantized parameters using factorized matrices so this idea of using fewer parameters and of using the Laura factorized Matrix Matrix approach this this gets us from 3.8 billion down to 2.7 million parameters less than 1% and then we come in with quantization this is technically block-wise kbit quantization effectively just allowing us to express more information with less and the key to the Cur method is that from that 2.7 million parameter level we're coming in and we're starting to actually quantize that down to four bit before we We Begin training during training we will de quantize when we have to do computations and before re quantizing to continue the training process next week we are going to be covering how to kind of complete this Trifecta Laura Cur and we're going to talk about efficient not fine-tuning and loading but now serving an inference with VM so we hope you can join us for that one but for today we're going to go ahead and get started with the Q&A period I'd love to invite Chris back up to the stage and if you guys have questions it looks like uh Manny is crushing it in the slido right now so shout out to Manny as usual but if you guys have questions throw it in the slido we also try to get to your questions if you throw them in the YouTube live chat but uh Chris let's go ahead and jump right into it here um first question is the reason we don't get inference latency benefit with Cur because model weights are re model weights are retained as 32bit during inference I mean I yeah I mean the question uh to be more specific about uh the phrasing I think we could say that the the model weights are de quantized to a higher Precision during inference so yes that is why we don't see a benefit to inference in fact we see a penalty it's not a big penalty but there is a penalty um and so but yes that's exactly why y oh okay nice nice yeah excellent question astute one there and then first one from Manny here when we're talking about parameters are we referring to additional features such as x's in the equation y equals predict X1 X2 xn are X1 to xn considered par parameters what are we talking about when we say parameters yeah parameters features it's all numbers uh weights I mean we have so many different names for similar kinds of objects I would think of parameters more specifically as the entities that fill up these weight matrices um that we use to uh to compute when we're actually doing that matrix multiplication um but yes I mean essentially a param any a parameter any node in the uh in the model architecture right so this is not something that you're going to want to use with like your XG boosts or your uh you know your kind of traditional ml methods right it's not like a random for Forest applicable uh you know technique it's specific to that uh deep deep neural architecture and it's also specific right now to uh that Transformer architecture though there there's no reason it needs to be uh it it is uh most explored in that space hopefully that answers the question Manny yeah yeah we'll uh we'll kind of flow through some of these other questions and uh and pop back to Manny's questions as well I I think this one's super relevant to everybody if I don't have a powerful laptop where can I practice these techniques honey P it's collab get yourself in a collab uh collab makes it so easy and the whole benefit right of this kind of thing is we can load these very large models with very little resource and so often times you can load like a three billion or six billion parameter model you can load that in a free instance of collab right using the the free free tier GPU the T4 so it's I think that's a great uh you know way to start if you don't have a powerful laptop uh as you get more embroiled in the space you might look at other Cloud hosting Solutions uh lamp amda or AWS whatever you want but um for for the getting started beginner I would say collab is your best friend if you want to you can pay for compute so you can pay to get a little bit more uh beefy gpus but stick to the free tier and and stick with your kind of three to six billion parameter models and you're you're gonna have a great time yeah yeah yeah yeah stick to the three to six Bill quantise quantise quantise quantise and um and then collab like we we teach entire courses in collab and we do a ton of fine tuning throughout so you know just try to be as efficient as possible don't sit there and do tuning for you know days and days at a time if that's not really something that you're interested in you know use small data try to make the model as small as possible through picking the small size of hugging face and then quantization for sure um but yeah there's there should be nothing stopping you if you're a beginner you don't have to get AWS you don't have to get all these things um okay Islam we got a question that's getting upvoted here can I do this fine tuning with llama CPP and is this fine-tuning possible to plug into the end to end fine-tuning within a rag framework so e toe fine tuning within rag framework yes 100% uh RCI we've done an event with them their dalm uh framework in GitHub we'll we'll get a link for you guys drop into the chat uh that is a 100% a great uh a great tool that does leverage or can leverage uh Laura as well as quantitized methods in terms of LL CPP I I'd have to double check uh I don't know off the top of my head but I will double check and then we can include that information in a comment uh if I'm unable to find it before the end of our time together today okay all right uh back to Manny's next question we say weights and biases when we talk about ml models or neural network models so if weight are parameters are we saying weights and biases that are parameters in the llm world are weights and biases parameters let me think through this question we say weights biases when we talk about LM so if weights are parameters are we saying weight and biases parameters like are biases also parameters I guess is that the question no but yes I mean I mean at the end of the day the thing we care about is the weights that's that's that's that's all answer this question we want to update the weights aka the parameters okay all right good stuff then I'm gonna go ahead last Manny question here can you speak about examples of Laura adapters like what are they and what are they created for sure yeah so a lore adapter like let's say let's talk about a lur ad adapter from this perspective right uh let's look at it from like a task perspective or like a a tool perspective so let's say we create a Laura uh adapter that's very good at uh you know translating natural language to SQL right um and then we create another Laura adapter and that Laura adapter has been fine-tuned to translate uh natural language to python then we create another adapter that one's and you you see you can kind of go on that the idea is that whenever we do inference we can choose whichever of those adapters or those Laura layers to flow information through that's going to make our output consistent with what we fine-tuned it to do so you can you can think of them as little hats you can put on your model that's going to change its Behavior but it doesn't touch the like it doesn't modify or it doesn't have to modify the base model at all just kind of this hat that sits on top of it but gets it to do a different job and the idea is that we can choose those hats as we want even at time of inference we can choose which Which hat we want it to wear yeah yeah and I mean you know this is like the thing for businesses too is like if you think about these adapters man it's like they're plug in play and so if you want the llm to do something super specific that prompt engineering has only gotten you so far and you you just can't get what you need exactly to get in and out in specific ways with your customer or your user if you want to really constrain what your user can put in you want to really constrain what comes out this fine-tuning piece this Laura adapter piece is going to be like your friend you know we had a great meme that we posted on LinkedIn uh recently where it's just sort of like if you're doing fine tuning you're kind of doing Laura so it's sort of like this this a big question you know um examples of Laura adapters would be like anything that you fine-tuned you know you might say and okay we've got a couple minutes left I'd like to uh shout out to you know thanks for the uh great note just want to appreciate your efforts appreciate a lot it looks like we've got George I think he's struggling with a specific error maybe we can comment on that after the the event he's he's put his error into slido as well um I guess uh last question this is a big question so you can take maybe two minutes Chris what are the trade-offs of using dimensional reduction techniques like Laura Kor on llms in terms of training inference fine tuning like when you think of tradeoffs maybe best practices here what do you think of I mean the big one is quality or like you know how good the output is uh there is a trade-off there it's really small and Beyond being really small it's really small like so okay this this is the way I think about trade-offs when it comes to Laura and and and the crew uh I can I can find in Aura model right to be let's say like 98 as 98% as effective as full fine tuning right but I can do that in in a tenth of the time with a thousand a thousandth of the resources right so divide by a thousand the number of resources I mean that is that is a tradeoff there is a tra you're losing 2% but like it doesn't feel like a real trade-off and especially in terms of business value it's not like a it's not a real tradeoff these days like especially if you use a high enough r or rank in your your Lura so you're using that kind of 128 uh R you're still getting a massive reduction in compute but you're retaining so much of the performance that it it truly doesn't feel like a trade-off it there is a trade-off to be clear there is always technically going to be a trade-off but it lets you do things you wouldn't be able to do so it it doesn't feel like a trade-off I mean for small companies you can find tune a model that does a whole new novel thing that fuels your business that is your business right that you just couldn't do if you didn't use these methods in that case there is no trade-off right you it's enabling you to do something that was previously impossible to you that's only Advantage uh when it comes to inference specific Al both uh the the cura or any quantized uh method using bits and bites uh and Laura if you're talking about non-merged Laura adapters do impart a small inference uh latency uh penalty it is very small at scale it can maybe be felt right if you're if you're really getting to those hundreds of thousands of requests per second uh compared to a very efficient model you might want to re quantize that to another format and serve that model directly instead of having it part of your Lura stack but again these are problems that come with scale and that scale kind of also helps you fund the solution uh but outside of that you're not going to feel these uh these these issues until you're into the six figures or more requests per second uh for your your of llm stock so I would say there are tradeoffs but when you're getting started uh they they they really don't appear as tra as tradeoffs all right yeah okay so use pep Q Laura unless you got a million requests per second uh sounds like a plan dude all right uh cool let's go ahead and wrap it up thanks Chris and uh can't wait till next time uh thanks everybody for joining us today again next week we'll be back talking in and serving and how to do it efficiently with the llm one of the hottest open source tools out there for doing that we'll tell you a little bit about the tool on its background if you like this session you might also really like cohort 4 of llm Ops llms in production launching February 13 in that course which we're going to be soon announcing an expanded curriculum for you'll learn to prototype and scale production llm systems including using r techniques including fine tuning and so much more check it out in the link and then lastly please share any feedback you have on today you can drop it in the chat or you can drop it in the feedback form that will drop to you now and that's it for today until next time keep building shipping and sharing and you know we'll be doing the same thing see y'all next week
Info
Channel: AI Makerspace
Views: 773
Rating: undefined out of 5
Keywords:
Id: XOb-djcw6hs
Channel Id: undefined
Length: 61min 50sec (3710 seconds)
Published: Thu Jan 11 2024
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.