QLoRA: Quantization for Fine Tuning

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
testing testing to [Music] [Music] all right we're good nissio we want Q Laura's opinion we want Q Laura's opinion Q Laura is not a real person but I suppose you can get killora's opinion come yell into the mic I boo wishes you a happy Tuesday do you have an opinion on Q Laura no okay uh the cat's name is Boo and petite means small so that that's what we call her in this house we call her putty boo welcome to a Tuesday stream we're gonna be reading a paper today this is a kind of a hairy paper it's a very heavy computer science you know quantization data types so you know I kind of like these papers because in order to get through these type of papers you have to end up basically getting quite familiar with a bunch of different concepts and I think quantization is a technique that is going to stay and I think it's going to be very important and I feel like if you're doing anything in machine learning you have any kind of job or anything in the real world with machine learning you kind of need to understand quantization and all the different kind of trade-offs and all the different techniques and what it means and so on so this paper uh Q Laura efficient fine tuning of quantized LMS this is the main author here Tim detmers this guy is very good at quantization he has been pushing quantization kind of uh content and research for a while now he I think he has his own blog that mirrors let me make sure yeah Publications but he has some pretty good blogs that we've actually read before as well we've read this as well which GPU to get for deep learning I know that people love to kind of like Theory craft about which GPU is the best I think people just love buying things so one of the one of the very common things you see in kind of machine learning communities is which GPU should I buy because people love to think about that for some reason but Tim is also the author of the bits and bytes Library which is a very popular GitHub repo which allows you to basically uh do a bunch of different quantizations and this is bits and bytes is generic so this is a if you have any kind of Pi torch code it allows you to quantize it and hopefully make a smaller version of that model but uh this repo here artidoro Q Laura artidoro is the second Arthur here so this repo is the actual uh uh code that was used for this paper so if you just want to do quantization just generic quantization for your own project this is the repo you want you want bits and bytes but if you want uh specifically quantization for llms and then specifically I think they use llama you want to use this repo uh but I think that's that's pretty much it right there guys and maybe one more thing this uh uh uh talk here so this is inside the Discord someone posted this and I ended up watching it's actually really good it's basically Tim himself going over uh the basically this paper but not really this paper right he's just kind of giving a presentation on his research and obviously there's a lot of kind of similarities between what he's doing now what he did before in this paper so we're going to be using this uh because he has some pretty good uh little figures and drawings here that I think will be useful for explanation purposes so let's go ahead and get started we present Q Laura an efficient fine-tuning approach that reduces memory usage enough to fine tune a 65 billion parameter model so this is obviously the llama 65b let's put it blue on a single 48 gigabyte GPU so this seems a little intense but basically I think there's a specific I think it's either the p100 or the a100 but one of these gpus has exactly yeah 64 gigs of memory so if you have a single a100 on a kind of a consumer uh computer then you can basically fit the entire 65 billion parameter llama model on it that's generally kind of unusual because generally if you have a GPU that has this much memory on it generally you have access to like a server rack that's going to have like eight gpus on it so I don't know it's a it's a little it's a little bit weird right it's like who has a really really good GPU but still has consumer uh computer generally if you have access to this type of GPU you have access to a server rack uh while preserving the full 16-bit fine-tuning task performance so uh fine tuning is the uh process of pushing additional gradients into your neural net for some new task and then ideally you're trying to adapt a neural network which is trained on one task into a different task so uh fine tuning can be done generally training is done at a 32-bit Precision but you can also do training at 16-bit precision and then in this paper they're going to actually do it at even lower Precision they're going to do it at 4-bit so every time you reduce the Precision you drastically increase the efficiency of the model in terms of memory and compute okay Q Laura back propagates gradients through a Frozen right so Frozen means that they're taking the Llama model and they're not changing the weights of it at all right they're taking the original llama model and it's frozen and whenever they say back propagate gradients through they're not actually going to change any of the values in the Llama model what they're going to do is they're going to make a low rank adapter here Laura right this is something that we've read on the channel uh read about on the channel as well which is basically kind of like a little mini extra model that you kind of attach to the original model and then you freeze the original model and then you basically just push the gradients only inside this little Laura right and it's actually a great idea right because now not only can you take this large model not only can and then you can fit it on your single GPU because you quantize it from 32 bits down to 60 or down to four bits but then you also don't need to basically push gradients into the entire model because uh you're actually freezing the entire model you only need to push gradients into this Laura so to me quantized uh Laura fine tuning is probably the most efficient type of fine tuning it's the quickest type of time fine tuning and it's the one of the only types of fine tuning that you can do where you can do it on a consumer GPU in your house so I think this approach is probably going to dominate the kind of like fine tuning literature all right our best model family which we call guanaco guanaco is a different type of llama I'm pretty sure it's like it looks like a llama but it's not a llama yeah it's like basically this thing kind of basically a llama and obviously it's tongue-in-cheek that it's kind of like a reference to the fact that they're using llama so guanaco being a species that's similar to a llama that's where the name comes from outperforms all previously openly released models on The vicuna Benchmark I think vicuna is also an animal that's similar to a llama and uh this is the paper I think Stanford was the paper where they basically took Lama fine-tuned it on answers from uh gpt4 and then realized that it basically felt like gbt uh reaching 99.3 of the performance level of Chad gbt so I mean this is a little bit there there isn't like necessarily a good way to measure model performance you know like benchmarking and is still a little bit of a of an art and there's a lot of different benchmarks so when you see these numbers 99.3 of the performance level that's not like don't read into that too much necessarily because benchmarks and and kind of performance is still not like super easy to quantize and you're just kind of guesstimating uh while only requiring 24 hours of fine tuning on a single GPU this is the big huge I think that the fact that you can do this is is really important for the AI Community right and I was thinking about this myself like I was thinking about potentially doing a little bit of research and doing some projects and immediately the first thing that I thought of is like okay well if I want to do some machine learning research I can't be doing any kind of machine learning research where I need to train something because that means I'm going to have to spend like ten thousand dollars on compute in order to have any kind of interesting result but if if I choose some kind of research project that only requires fine tuning and does this kind of Laura fine tuning then I can do it with just the gpus that have I have at home so I think that the availability of compute is going to mean that a lot of research is going to be using this kind of pattern this like quantized Laura because you can do this on a single GPU which means that you're you're you're capable of doing research that doesn't require some kind of industry partner that's going to pay for the G for the compute uh qlore introduces a number of Innovations to save memory without sacrificing performance okay so this is the innovations that they're doing for this paper their contribution so 4-bit normal float so some kind of new data type that is information theoretically optimal for normally distributed weights uh information theory is a kind of type of statistics and math that was uh pioneered by Claude Shannon this guy uh this guy was pretty badass dude he a lot of kind of uh the OG kind of computer science comes out of this guy known as the father of information Theory uh normally distributed weights this means weights the weights of a neural net that follow a normal distribution so if you took all the little values of all the little weights inside a neural net right so 65 billion parameter model that's gonna that means there's like 65 billion of these little neuron weights there's a couple other different types of things like biases and and so on but most of them are going to be weights and if you basically plotted all of them and each of them has a single little float value they would be normally distributed so this is this is a little bit of a uh an assumption right you're you're you have a prior so a prior means that you're making some kind of assumption on these statistical distribution of something that you're interested in and uh having a normal prior saying okay well I'm going to assume that whatever this thing that is that I'm studying it's distributed according to a normal distribution that's a usually a pretty standard prior that people use but it is important to point out that it's not necessarily guaranteed that you're going to have normally distributed weights but I think for a neural net that's a pretty pretty good and safe assumption uh double quantization okay so double quantization is something that reduces the average memory footprint by quantizing the quantization constants okay we'll see exactly what they mean by this but this kind of rhymes with the residual quantization that we saw in that audio paper and then we have paged optimizers to manage memory spikes okay the optimizers are like the atom right SGD right if you're familiar with pi torch or even tensorflow there's basically you have to pick your Optimizer and there's a bunch of different optimizers most people use atom those optimizers have a bunch of internal uh basically parameters and whenever you're pushing gradients and in between batches those optimizers have uh like basically a bunch of uh extra parameters that you need to keep track of right so paged optimizers maybe refers to some kind of quantization over those specific optimizers or maybe uh hear the fact that they're talking about memory spikes Maybe specific tricks that prevent you from having to like load things into memory and and so on let me I can like hear myself in the background let me turn off the sound here okay we use Q Laura to fine tune more than 1000 models so big ablation studies is what I'm seeing from this providing a detailed analysis of instruction following and chatbot performance across eight instruction data sets so this is where their benchmarks are going to come from their performance level multiple model types llama and T5 okay so these are the models they're going to use and models that scale I don't even know what my colors are anymore models that scales that would be infeasible to run with regular fine tuning yeah so what does he mean here so it you're never going to be able to basically push gradients into the full 65 billion Llama Or 33 billion llama you you need like a distributed training rig with like dozens of gpus in order to even push a single gradient into a 65 billion parameter model so that's what makes it so powerful is that you can fine-tune these big models our results show that Q Laura fine-tuning on a small high quality data set leads to state-of-the-art results this is similar to what we saw with the Lima paper where you don't actually need that big of a data set if you're going to be fine-tuning even when using smaller models in the previous state of the art we provided detailed analysis of chatbot performance on both human and gpt4 evaluations so one thing that uh people do now is they don't just have a human judge uh the output but they have gpt4 judge the output which is kind of crazy it's gpt4 picks which of the answers is the best showing that TPT board evaluations are cheap and reasonable alternative to human evaluation we have our AI is what evaluates our AI furthermore we find that the current chatbot benchmarks are not trustworthy to accurately evaluate the performance level of chat Bots okay so a little bit of a contradiction here they say gpt4 evaluations are cheap and reasonable but then they're not trustworthy and accurate so a little bit conflicting there a lemon pick analysis I like this so cherry picking is whenever people uh pick the best possible results in order to Showcase in their paper this happens in all different types of science right where if you're looking at some kind of computer vision paper maybe they're doing image generation they're always going to pick the the prettiest images for their their paper so you never actually get a uh a good idea of how good the thing actually is because you're only looking at the cherries aka the best ones so lemon picking here refers to we try to purposefully pick the worst ones uh guanaco fails we release all of our models are in code including Cuda kernels for 4-bit training so Cuda kernels are basically uh programming that happens below the level of uh the language that you write in generally python right you're writing your pytorch your high level code in Python that gets compiled down into basically Cuda kernels that are what actually runs on your GPU so in order to use their special 4-bit normal float and some of these other tricks that are going to come up with here you're probably gonna have to use the Cuda kernels that they wrote specifically fine-tuning large language models is a highly effective way to improve their performance and to add desirable or remove undesirable behaviors however fine-tuning very large models is prohibitively expensive regular 16-bit fine tuning of the Llama 65b requires more than 780 gigabytes of GPU memory so we were looking at the a100 the a100 has 64 gigs of memory so if you wanted to fine tune a llama 65b at 60-bit Precision 780 divided by 64. you're talking 12. a100s and not only the reason that number is terrible is because actually server racks only fit eight so now you can't even use one server rack with eight a100s you would need to use two server racks in order to fit that and that's not even at 32-bit that's at 16 bit so that kind of gives you an idea of how unapproachable it is to actually push gradients into the full uh llama model while recent quantization methods can reduce the memory Footprints of llms such techniques only work for inference and breakdown during training hmm okay we demonstrate for the first time that it is possible to fine-tune a quantized 4-bit model without any performance degradation okay but I think it it's important to note here that they're not actually pushing gradients at a 4-bit Precision into the original model right here they say that the techniques break down during training right if you try to if you put the llama 65b in 4-bit precision and try to train it at that it's not going to work but the reason it works in this paper is because they're freezing it right so they're not actually pushing gradients into the quantized 4-bit Llama they're pushing gradients into this Laura that's kind of sitting on that's kind of like parasitically attached to the model right so that's kind of the important distinction there is that they're not really fine-tuning a quantized 4-bit model they're fine-tuning a low rank adapter which is attached to a quantized 4-bit model so I think that's important to notice there our method Q Laura uses a novel High Precision technique to quantize a pre-trained model to 4-bit and then adds a small set of learnable lower rank adapter weights that are tuned by back propagating gradients through the quantized weights yeah you're really only the the Laura is the one that receives the gradients but the way that the chain rule works you have to back propagate through the quantized model so that's what they mean through here is that even though only the weights in the Laura are changing the chain Rule and kind of uh back propagation right the actual math that you have to do has to go through that quantized model Q Laura reduces the average memory requirements of fine-tuning a 65 billion parameter model from greater than 780 gigabytes to 48 less than 48 gigabytes without degrading the runtime or predictive performance compared to a 16-bit fully fine-tuned Baseline okay so this is going to be uh what they use as a Baseline this marks a significant shift in accessibility of llm fine tuning now the largest publicly available models to date fine-tuneable on a single GPU yeah I think that's huge that that to me is the biggest contribution of this paper it's the most important thing and to me this sentence right here is the reason why open source AI is going to be possible if if you were not able to even load these models on a single consumer GPU nobody would be able to compete with the big companies but because you can do that because you can fine tune the 65 billion parameter llama on your consumer GPU I think that that means that open source is still alive using Q Laura we train the guanaco family of models with the second best model reaching the 97 of the performance level of Chachi BT while being tradable in less than 12 hours on a single consumer GPU using a single professional GPU over 24 hours we achieved 99 with our largest model essentially closing the Gap to chat gbt on The vicunia Benchmark when deployed our smallest guanaco model requires just five gigabytes of memory and outperforms a 26 gigabyte alpaca model by more than 20 percentage points okay so we have ELO ratings so ELO is a kind of uh rating system that they use in chess but it's actually very popular in video games as well so basically it it rewards you for beating people who are good and doesn't penalize you too much for losing people to losing people who are good right so it's basically it it's a way of getting a score for how good you are that is aware of the level of the people you're playing against uh there's ways to game ELO but it's generally pretty good so the winner of the match is determined by gbt4 which is a little bit weird right is that GP you basically have some kind of question and then each llm gives an answer and then gpt4 declares which response is better so it's a little bit weird that gpt4 has decided that gpt4 is the best right so that's that's what I mean by take a little bit take these results with a little grain of salt because you're asking gpt4 which answer it prefers and of course it's it's probably always going to prefer the gpt4 answer 95 confidence intervals are shown after gbt for guanaco when the most matches okay so this is the 65b model this is the 41 gigabyte so this is whenever he says uh a single professional GPU this is what he's referring to there is no consumer GPU that has 41 gigabytes of memory but this one here 21 gigabytes that's what it means by uh consumer GPU because uh 3090 which is the GPU that I have is 24 gigs so you could fit a guanaco 33b on the 24 gig 30 90. I think the 40 something series out of Nvidia which is also a consumer GPU has the ability to fit a 21 gigabyte model uh cool and it turns out that GPT kind of doesn't like Bard at all you can see here that GPT ranks Bard's answers pretty much on par with guanaco 7B which is a significantly smaller model so I don't know gpt4 being a little bit biased there Imo Q Laura introduces multiple Innovations 4-bit normal float an information theoretically optimal quantization data type for normally distributed data that yields better empirical results than 4-bit integers and 4-bit floats so four bits means you only have four bits to store the information so for example here I think this picture is nice but here you have floating Point data types also called fp8 right floating point eight what the eight here refers to the eight bits right so you see here everything in your computer is composed of these bits right which are either zero or one and computers are based on this binary system so you can see here how there's eight total bits and you have to basically decide what am I going to do with these eight bits so okay well at least one of the bits I'm going to use it to determine whether or not it's negative or positive right sine so if I have a negative number this will be one if I have a positive number this will be zero right but now that means I only have seven bits to actually store the rest of my number and generally what ends up happening is you basically end up splitting these bits into what they call the exponent in the fraction so the fraction is the actual like uh mantissa exponent see if I can find a better picture for this but yeah so any any float number right let's say you have a number of some weight inside a neural net and it's 0.003 right you're going to store that 0 0 right the the first part of the kind of like how how small is it in the exponent and then you're going to store the actual value of the number here in the mantissa right so if you have a bunch of numbers that are that are close to zero like 0.1 1.0 you know you probably don't need a bunch of these exponents right you'd rather keep a bunch of your bits to store the fractions so you can store 0.1111111 and you can store more numbers right you have more uh Precision on the actual number but if you have a bunch of numbers that are very very small or very very big like 1 000 and then 0.0001 at that point you probably want more bits for the exponent right so choosing how many bits to use for the exponent and how many bits to use for the for the fraction or the mantisa is is going to depend on what the numbers you're actually trying to quantize are right so here he says three bits for exponent 4-bit for fraction good for large and small numbers bad for precise numbers one bit for exponent six for fraction good for precise numbers bad for large and small numbers so there's a trade-off there right and four bit integers and four bit floats are just the generic four bit integer 4-bit float from computer science but he's going to come up with a new one here which he calls 4-bit normal float which is probably going to have a clever version of how much am I putting in the exponent how much in the fraction and so on I'm all over the place so feel free to ask questions and interrupt sometimes I feel like I ramble and I'll say all kinds of things that are wrong and then nobody corrects me and then I'm like why did nobody correct me so feel free to comment more double quantization a method that quantizes the quantization constants saving an average of about 0.37 bits for per parameter approximately three gigabytes for a 765 gigabyte model so this model here the guanaco 65b fits in 41 gigabytes of memory and three gigabytes of that is an efficiency coming from this double quantization and then finally paged optimizers using Nvidia unified memory so this is this is quite old at this point I don't even know if this even matters but here we go see this blog post in 2017 but unified memory is a single memory address space accessible from any processor in the system right so what is a processor a processor is either your GPU or your CPU right and sometimes you're going to be storing things in the uh RAM which is this the memory that is normally being accessed by your CPU and that's kind of uh here let me see motherboard right if you actually look at a motherboard you have your CPU here right your CPU sits right there and then your four sticks of ram right here so that's sometimes called the CPU memory but it's the random access memory it's normally the memory of your computer right that's that's what people refer to but your GPU also has memory right your GPU is going to sit right here in the PCI slot right so there's different places to to store things and ideally if your GPU is doing a bunch of calculations your GPU being a processor you don't want it to be accessing the memory in the motherboard right this RAM memory because it takes long for the the information to go from this Ram all the way into the GPU and then get loaded get calculated by the GPU and then get put back to the ram not and it's better if it accesses the memory that's directly inside the GPU because there is GPU memory okay to avoid the gradient checkpointing memory spikes that occur when processing a mini batch with a long sequence length Okay so this is a little bit more obscure here I'm assuming that basically whenever because of the way that Nvidia unified memory is implemented whenever you have a long sequence length in a mini batch of mini batch is like one batch that you basically when you train you're not training one instance one data point at a time you're training with a little group of data points called a batch right so that's because of the way that it works I guess you basically get this memory Spike whenever you try to checkpoint the gradients I don't know what exactly what this means by this I guess that maybe if your model's too big you can't fit the entire gradient so then it kind of like intermediately stores it and that's where you get the memory Spike I don't know if one of you people is better at uh ECE feel free to chime in we combine these contributions into a better tuned Laura approach that includes adapters at every Network layer okay this is kind of interesting so and therefore thereby avoids almost all the accuracy trade-offs seen in Prior work a100 is either 40 gigabytes or 80 gigabytes of GPU memory the 48 gigabyte GPU they're talking about is probably an RTX 6000 a6000 or l40 okay how to wait how does that how does that make any sense though because here it says the a100 has 60 oh there's three types here I see what you're saying there's one that has 10 one that has 64 and the one that has 80. so you could actually get an a100 GPU that has 80 gigabytes of memory and that would be more than enough to fit uh the guanaco 65b here and it's also important to note that you need uh the memory doesn't just fit the model it also needs to fit the batch so if you're doing inference you can get a GPU that's just barely bigger than your model and it's fine right because you're only going to need to pass one thing at a time but if you're doing training you need to be able to fit your model in the memory but then you also need to be able to fit the entire batch into memory right and I think for NLP this is probably not as big of a deal but in uh computer vision it is a big deal because a lot of times if you're using I don't know a 224 by 224 image right a batch of 32 224 by 224 by three images is actually a pretty significant amount of memory so when you're training your GPU memory needs to fit both the batch and the model not just the I think they meant memory speed is 64 gigabytes per second okay so memory speed is the speed at which you can load things into the GPU memory and take things out of the GPU memory and that actually uh matters a lot especially if you're kind of loading and removing things right and you're kind of moving things in and out of the memory so this is the bandwidth this is what uh you're talking about here the memory speed I don't really understand it okay but tldr you can get gpus that can fit this but they're uh pretty hardcore Q Laura's efficiency enables us to perform an in-depth study thank you for those comments an Eco by the way uh enables us to perform an in-depth study of instruction fine-tuning and chatbot performance on model scales that would be impossible using regular fine tuning due to memory overhead therefore we train more than 1000 models between 80 to 65 billion parameters so small models and big models they're going to be training here very good ablation study in addition to showing that Q Laura recovers 16-bit performance we also analyze Trends we find that the data quality is far more important than the data set size so this is what we were noticing uh in the Lima paper as well Lima is a paper that we read earlier I think last week where basically uh people at Facebook found that they could fine-tune uh llama models on 1000 examples and get them to perform better than the original vicuna models which I think were fine-tuned on 50 000 examples so this idea of kind of like rather than just blindly getting more and more and more data being much more careful about data quality and making sure that the data set that you're fine tuning on is just extremely high quality it's a pattern that we're seeing again and again in these research papers uh so for example in this case there are nine thousand uh they have a data set whatever this is oasst1 which is 9000 is outperforming a flon V2 of 450 000. uh even when both are meant to support instruction following generalization however uh instruction following generalization what this means is the kind of like chat bot Behavior so sometimes they call it assistant Behavior so an llm just a raw llm like the 65b Llama is just trained to predict the next token but most people want uh like assistant-ish Behavior where like you have this kind of like back and forth like how are you doing and it like answers you right so that's what people are mostly interested in and uh people call this assistant Behavior other people call this instruction following Behavior so those are all just different ways of kind of referring to this kind of like ability to conversate back and forth and like generally kind of answer your questions and do the things that you want to do uh second we show that strong massive multitask language understanding Benchmark does not imply strong Benchmark performance and vice versa okay so basically they're saying that this Benchmark here massive multitask language understanding MML U is garbage which I mean we could have told you generally any kind of quantifiable Benchmark is not going to be a ex like perfect right especially as as these models get better and better and better and they become more and more General I think we're gonna have to move away from single uh kind of quantifiable scores right and then we're gonna have to move to kind of like these like kind of uh uh bags of benchmarks right where I think the days of being able to say hey here's my model and I trained it on imagenet and here's the top one accuracy on imagenet I think that's going to kind of go away and more and more you're gonna see hey I trained this model and I benchmarked it on these 100 different benchmarks so as these models get more and more General the number of benchmarks that you use is going to keep increasing and the score on an individual Benchmark is going to matter less and less furthermore we also provide an extensive analysis of chatbot performance we use tournament style benchmarking so this is what they mean by ELO here right so ELO is kind of a tournament style score where models compete against each other in matches to produce the best responses the winner of a match is judged by either gbt I think this is just weird the fact that they're using an LM to judge the answers the tournament results are aggregated into ELO scores which determine the ranking of the chatbot performance we find that gpt4 and human evaluations largely agree on the rank of the model performance in the tournaments but we also find that there's instances of strong disagreements so conflicting here as such we highlight the model-based evaluation while providing a cheap alternative also has its uncertainties okay we augment our chat bot with a qualitative analysis our analysis highlights success and failure cases we release all model generations with human and gbt4 annotations to facilitate further study we open source our code base and Cuda kernels and integrate our methods into the hugging phase Transformer stack making them easily accessible to all we release a collection of adapters for 7 to 13 to 33 to 65b models train on 8 different instructions for a total of 32 different open source so this is not technically true if you actually go into the code here you have to get uh requires access to the Llama models so the problem is that even though these lava models are open sourced and they were leaked and released it's like you have to basically get them yourself which is you know because nobody nobody wants to post them anywhere you have to basically get them somewhere so for example here here's the model that he puts in there but I think he's he's I don't know it's like you're you're taking a risk by putting this model up right because you don't know what Facebook is going to do right because Facebook could over time eventually go back and then basically Sue everyone that posted this model without their explicit uh blessing right and Facebook hasn't given anyone any explicit blessing to actually go ahead and give the model out so people are a little bit sketchy about using and especially giving people the model so it's still kind of like a legal gray area which is a little bit sketch uh different fine tuning methods and their memory requirements Q Laura improves over Laura by quantizing the Transformer to forbid precision and using paged optimizers to handle memory spikes nissio and gpt4 eval probably changes every time of open AI updates their API so really not satisfying but neither is human eval it's a hard problem yeah I agree okay so we have three different things here we have 16-bit Transformer 16-bit Transformer and 4-bit Transformer okay we have their base model so this is the Llama 65 B here's the llama 65b and then here's the llama 65b this one obviously looks like it's 1 4 the size of this one because it is because 4 bit to 16 bit is going to be 1 4 the size you have the optimizer state so the optimizer state is all the gradients all the intermediate kind of values that you calculate for your chain Rule and you need to store those in order to basically push a gradient right so the size of your batch is going to determine the optimizer State the Precision of your of the numbers inside that Optimizer are also going to determine the optimizer state so you can see how one of the things they really wanted to do in this paper is actually address this issue of like hey not only is the the model itself kind of an issue in terms of memory and compute but then also the optimizer is an issue we want to make sure to make this Optimizer as small as possible okay so you have parameter updates as blue gradient flow is green and paging flow is red okay so gradient flow you go back through the Llama you're not actually updating your adapters here these are the small little extra things I don't know why they didn't put little uh modules here but the adapters are the ones that are actually changing parameters that's where you're actually changing the values of the parameters the base model is Frozen so you see here in full fine tuning the parameter update goes into the base model you're actually changing the values of the parameters in the base model but when you're doing this Lora low rank adaptation you're not actually changing the values in the actual base model you're only changing the values in these Laura parameters which are these very small little squares and rectangles here right and then this paging stuff I don't really fully understand this but it seems like basically paging is this process where you have to go to the CPU memory and then ask the CPU hey can you get this specific piece of memory from the actual motherboard the RAM for me and then the CPU will be like okay here I can go and get you this specific little chunk of thing that's in the ram right now the motherboard memory the CPU memory I call it the ram because that's what the sticks are called but there might be a more official name for that but you see how the CPU has to be the one that gives you that so the paging flow seems to be some way to optimize that all right let's get into it we'll be finally getting to some equations here block wise kbit quantization quantization is the process of discretizing an input from a representation that holds more information to a representation with less information so any kind of quantization is going to be lossy right you're going to lose information by quantizing something it often means taking a data type with more bits and converting it to fewer bits right so bits are these things here so taking a data type such as floating Point 32 which has 32 bits right just think about like 32 bits what that would look like compared to eight bits and then compared to four bits oh let's close these repos here uh 32-bit floats to 8-bit integers to ensure that the entire range of the low bit data type is used the input data type is commonly rescaled to the Target data type range through normalization by the absolute maximum of the input Elements which are usually structured as a tensor okay so why are they why are they doing this and I think it becomes obvious when you realize that you have to basically use a bunch of your bits in any data type to store this exponent right and that's really annoying right and what if instead you normalize right where is it normalization if you normalize your uh your data your each individual number each individual little floating point then you can make sure that it's close to zero which means that you don't actually have to spend that many bits storing the exponent you can just use up all your bits all of your precious bits to store the actual number itself rather than storing its uh yeah so for example here so here he's storing uh the number 606 so this is the actual number here 606 right and everything else here so this is just whether or not it's negative or positive but then these two numbers here are what actually allows you to make it 0.606 or 0.0606 or 0.000606 right and the more exponents you have the more you can basically make this number either very big or very small right where e to the negative 3 refers to basically like how big or small the number is uh but if you normalize them then you don't have to spend as many bits on the actual uh size or the exponent which are usually structured as a tensor for example quantizing a 32-bit floating Point tensor into an INT 8 tensor with a range of negative 127 to 127 so uh any data type has a maximum or uh and minimum which basically means what is the maximum number I can represent with uh int 8 and that's going to be different from the maximum number I can represent with a floating Point 32 so for example what is the max actually let's ask bard this seems like an easy question for Bart there's no way it can hallucinate something here what is the maximum value for a fp32 versus a uint 8. versus a int 8. so uint 8 is a unsigned int 8 which means that you're saving the sign bit right you don't have to spend any on the uh or any bits any of your precious bits on the side because it basically means it's always positive so for example if you have a unit 8 you can store up until 255 right so you can store from zero to 255 you at 8 is very common in images that's why images for example uh any of the pixel values is between 0 and 255 right 8-bit unsigned 8-bit very popular in images because you now have to basically use that one bit to store whether it's negative or positive now you can't basically store as many numbers which means that int 8 is limited to negative 127 to 127 here I don't know if this is a typo or not negative 128 I don't know but floating Point 32 significantly bigger look at that you can store a number up until uh 3.4 e to the 38 so you can store huge numbers in a floating point 32. okay so what do they got here so absolute Max of floating Point 32 times floating Point 32 so here's the normalization by the absolute maximum so you're taking the number the floating Point 32 you're dividing it by the maximum of all the different floating Point 32s that you want to store so normalizing them so that they're basically centered on zero you're bidding them into 127 possible bins and then you're rounding here so you're gonna You're Gonna Lose Precision right you're you're going to have numbers that basically let me see this is a nice one right so let's say you have a continuous uh value here represented by the red line and then you quantize it represented by these bins so this value here is going to be the same as this value here because you're just storing it into this bin so that's why this round you're you're fundamentally doing a kind of lossy uh lossy um what would I call this you're losing information because of this round right you're losing precision C is the quantization constant or quantization scale so I guess the quantization constant is 127 divided by the absolute Max of whatever the x of p32 is which is going to depend on your data dequantization is the inverse so if you just divide by this quantization constant then you get back the original number but this original number is not going to be the same as this original number right because you lost some Precision with this round let's go ahead and highlight this in green okay the problem with this approach is that if a large magnitude value I.E and outlier occurs in the input tensor the quantization bits certain bit combinations are not utilized well with few or some numbers quantized in some bins so the the problem is that anytime you do normalization if you have an outlier your normalization is going to be a little bit up right so for example here so I think here he says he has an outlier yeah so for example here is what uh the Quant he's using the same amount of quantization bits or quantization bins sorry so you see here every single one of these bins is there's the same number of bins in this as in this but what he's showing you here is that if you have an outlier here let's say you had a you had a data point that I was at negative 10. now the amount of bins that you have underneath the majority of your distribution here the majority of your normal distribution you're only basically modeling this in one two three four five six bins right versus if you didn't have that outlier that was sitting there at negative 10 now look how many bins you're you're able to do you're able to have one two three four five six seven eight nine ten eleven twelve thirteen right thirteen bins that are sitting right under that big probability Mass there so outliers basically make quantization less efficient because they need to basically uh have bins in uh part of this range here that doesn't actually store very much it's just sitting the reason that's happening is because you're normalizing and then you have this one big outlier uh certain bit combinations are not utilized well with few with few or no numbers quantized in some bins right so what is he referring to here he's referring to the fact that here you have a bunch of bins that are sitting here at negative four negative five negative six negative eight these bins don't even have a single value in them you're just wasting uh your limited bits to store that or bins in this case a common approach is to chunk the input tensor into blocks that are then independently quantized okay so if your quantization constant is going to depend on the absolute maximum of your uh of all the data points that you have and that you're trying to quantize and your quantization is in that way uh very dependent or sensitive to outliers one thing you can do is say okay well if I have 100 data point let me break up those data points into chunks of 10 and then I'll normalize each one of those independently right so each uh chunk of 10 data points for my original uh set of 100 data points is going to have their own it's going to have its own quantization constant C so you're gonna have you're now you're gonna have to store 10 quantization constants which is a little bit annoying right so there's this kind of trade-off where do I chunk it and quantize each chunk at which point I need to store 10 of these quantization constants or do I have one quantization constant for the original 100 and then I have to deal with the fact that outliers my mess up my quantization so there's this kind of trade-off between how much how many quantization constants do you want to store versus how many outliers do you have we chunk the input tensor X so into n contiguous blocks so X is the input tensor which is going to be basically any collection of data right that you're quantizing and it's probably in this case it's going to be the actual base model of the Llama right into n contiguous blocks of size B so n is the number of chunks and then B is the size of each chunk by flattening the input tensor and slicing the linear segment into n equals B times H over B blocks we quantize these blocks independently with equation one to create a quantized tensor and N quantization constants so this is the annoying part is that now you have n different quantization constants that you need to keep track of and I think that's what they're referring to here when they talk about double quantization a method that quantizes the quantization constants so this is like some 500 IQ thing right here where they're like hey let's uh chunk our our set of numbers into a bunch of chunks quantize each of those chunks the problem with that is now we have a bunch of quantization constants but what if we quantize those quantization constants so uh a little bit of almost like a recursion kind of mentality there okay so low rank adapter Laura fine tuning is a method that reduces memory requirements by using a small set of trainable parameters often termed adapters while not updating the full parameters which remain fixed yeah so Laura is based on this low rank Matrix uh kind of math where basically you have some matrices that uh are lower rank than others of the rank is a specific property of a matrix which basically comes down to like how many independent uh it's a better definition of Laura low rank matrices before I say some stupid let me give you a better definition low rank approximation is measures the fit between the given Matrix blah blah approximate The Matrix has a reduced rank okay these are all even worse but basically one one way to think about is that a low rank adapter is an extra little set of Weights that you basically attach to the original base model and then those little weights are just adding just enough little signal into the original model that the original model changes its Behavior right uh and the key Point here is that they're not updating the full model parameters they're not changing the Frozen uh base model they're only pushing gradients into the low rank uh matrices gradient during stochastic gradient descent are passed through the fixed pre-trained model weights to the adapter and it's important to note that the passing through is still going to increase you're still going to have to store those intermediate values and that's where this Optimizer State can get out of hand which is updated to optimize the loss function Laura augments a linear projection through an additional factorized projection given the projection x w equals y Laura computes okay here we have a bunch of different things here so we have x w equals y we have X is some Matrix of size B times h w is some Matrix of times H times o and then you have y equals x w so that's just this equation here but now they're adding this here x times L1 and L2 where L1 is a matrix of size H times R and L2 is a matrix of size R times o and S is a scalar so Jesus it's a lot of kind of undefined terms here I think that they Define all these numbers up here so B is what is b constant c n I don't really tell you what B is I guess B is probably the batch size but this here is basically a concatenation I think if you look at a picture of Laura Laura fine tuning yeah right here so you see how the The Matrix here this is the pre-trained model right the pre-trained 65 billion parameter llama this is the Laura the like much smaller amount of weights and you see how here they're getting concatenated so all you're doing is you're basically adding this little model that just like kind of parasitically attaches to the original model and then you're pushing gradients into that but uh there's an ad on this dude what the oh my God no this is trash I just want this picture open image and new tab yeah but ultimately you're basically just taking the output of this and then combining it with the output of the original model so ideally the Laura Matrix should basically add little bits it's it's going to take all the activations from your original base model and just add a little bit to one remove a little bit from one add a little bit to one remove a little bit from one and just by nudging the activations slightly like that you're going to get different Behavior so that's what this plus is here I think this is the actual Laura and then this part here is the original right so X is generally the input to a neural net Y is generally the output or the target W generally means the weights and then I think L1 and L2 are the loras here all right memory requirement of parameter efficient fine tuning one important point of discussion is the memory requirement of the Laura during training in both terms of the number and the size of adapters used since the memory footprint of Laura is so minimal right because it's a small low rank Matrix we can use more adapters to improve performance without significantly increasing the Met total memory use so more adapters they basically mean that here this is uh allora on a specific uh layer right you have some specific layer that you're going to put these in and they're going to be doing it at every single layer right so your big 65b llama is just a bunch of Transformer blocks like stacked and they're going to have a little Laura for each Transformer block or each layer within that I don't know if they specifically do it at the Transformer block they might be doing it at the little MLP I'm not exactly sure while Laura was designed as a parameter efficient fine tuning method most of the memory footprint for llm fine tuning comes from activation gradients and not from the learn Laura parameters so this is kind of what they're saying is that a lot of this is that a lot of your memory is actually coming from the optimizer State it's coming from these activations and the gradients and having to do this basically this chain rule in the back prop for a 7B llama model train on flan V2 with a batch size of one the lower weights equivalent two commonly used 0.2 percent of the original model weight so very small compute footprint by the lower weights but the lower input gradients have a memory footprint of 600 567 megabytes so this is a good little uh Indica indication of the relative uh size of those parameters right 567 megabytes used for the gradients and the optimization parameters versus 26 megabytes for the actual values of the Laura parameters or the weights inside the lower matrices with gradient checkpointing the input gradients reduced to an average of 18 megabytes per sequence making them more memory intensive than all the lower weights combined I don't know what this means I don't know why this 18 megabyte number is different than this hundred 567. but okay let's keep going in comparison the 4-bit base model consumes 5000 megabytes of memory so actually the base model is still pretty big the highlights this highlights that gradient checkpointing is important but also that aggressively reducing the amount of lower parameter yields only minor memory benefits this means that we can use more adapters without significantly increasing the overall training memory footprint more adapters AKA I don't think they mean using a bigger adapter they mean using more Laura matrices throughout the model so that you can change the behavior of all the different levels of the model right and I think that in some of the Laura papers they only add the Laura to the last couple layers so I think uh I forget what paper we were looking at but I think it was a the PFT paper Maybe but in one of the papers they basically say how depending on where you put these Lora matrices you're going to get uh different types of behavior right if you put your Laura matrices at the bottom versus at the top you're going to be basically modifying the behavior of the llm at different levels of abstraction so maybe by being able to put these lower matrices throughout the entire model you're going to get a better kind of fine-tuning result this is crucial for recovering full 16-bit Precision performance okay Q Laura fine tuning Hugh Laura achieves High Fidelity 4-bit fine tuning via two techniques the 4-bit normal float nf4 and double quantization additionally we introduce paged optimizers to prevent memory spikes during gradient checkpointing from causing out of memory errors that have traditionally made fine-tuning difficult for large models okay so now we're learning more about this exact problem of the paged optimizers that they keep mentioning here so it sounds like basically when you're doing this gradient checkpointing the memory will Spike and whenever the memory spikes you get an out of memory error and the problem with these out of memory errors if you're doing any kind of machine learning stuff is it kills your your your program as soon as you get this um the Cuda the the dreaded Cuda out of memory error it kills your entire program so made fine tuning difficult for large models right so if you if you can't solve these um errors then you're not gonna be able to fine tune on a single machine and nothing sucks more than uh Runnings you run your model and then you go do something else and it runs for two hours and then you come back and it's out of memory which usually sucks uh Q Laura has one low Precision storage data type in our case it's usually 4-bit and one computation data type that is usually B float for in practice this means whenever a cue or a weight tensor is used we de-quantize the tensor to B float 16 then perform a matrix multiplication 16 bit huh this is kind of weird that means that they're storing the data type in 4-bit but then when they actually do the matrix multiplication they turn it into a floating point 16. huh I wonder how much that conversion takes though right how long does this dequantization take is that something that's efficient or is that something that's going to take a long time we now discuss the components of Q Laura form followed by a formal definition the 4-bit normal float the normal float data type Builds on quantile quantization which is information theoretic information theoretically optimal data type that ensures each quantization bin so a quantile is basically splitting a uh uh distribution into these like chunks everybody knows about uh uh quartiles which is basically whenever you split a normal distribution into four yeah so let me see if I can get the wiki here so there's different types of uh quantiles everybody knows this one the quartiles where you basically split it into four chunks so you have the normal distribution here and then you split it and then basically each one of these bins has the same amount of uh points in it right so obviously the bins that are closer here to the mean they're going to be less wide because there's just more numbers in there and then these bins here they're much wider but it also uh because there's less points here right because the number of points goes down uh they actually have the same amount of probability Mass if you want to think of it that way per probability density right so 25 of the data is here in this Q4 25 is here in Q3 then Q2 then q1 and so on uh each quantization bin has equal values assigned from the input tensor quantile quantization works by estimating the quantile of the input tensors through the empirical cumulative distribution function okay so this is what we were referring to at the very beginning of this paper where they're making an assumption here right they're gonna they're gonna pick a prior and the prior that they're going to pick is uh the normal distribution so they're going to say uh optimal for normally distributed weights right so they're going to say I'm going I'm going to assume that the weights inside my neural net are normally distributed and that means that if they're normally distributed then I can go ahead and say okay well if they're normally distributed then I know that I need to basically make the bins or the quantiles here close to the center of the of the distribution which should be zero because you're normalizing it they should I should have more bins here in the center that are that are thin and then less bins as I kind of go out towards the end it took me forever to realize that the nf4 data type is just like a 4-bit lookup table to the values in appendix e uh you're telling me that there's an appendix e let's see let's control F appendix e click on that can I click on this there we go boom okay the exact values of the N4 data type are as follows all right so we see that the the nf4 data type is limited to negative one to one so basically if you remember we were looking at Bard you see this int 8 you went eight unit eight can represent any number between 0 and 255 as long as it's an integer right so any any integer between 0 and 255 int 8 can represent any integer between negative 127 to 127 floating Point 32 can represent any float number up until this right and here an F4 this weird like 4-bit float data type can only represent numbers between negative one and one but that is actually totally fine because anytime you're doing uh a neural network you're normalizing everything constantly right your layer normalizations your batch normalization your normalizing your input you're generally taking every single uh every single uh batch or every single part like for example whenever you load imagenet batches you're normalizing them by the actual mean of each Channel uh in imagenet so you're constantly normalizing things whenever you're dealing with machine learning so this kind of data type that is only possible for negative 1.0 to 1.0 is actually totally fine and what we were saying before right where if you assume I'm kind of all over the place so let me know if anything I say is confusing because I'm kind of jumping from place to place here but uh we were saying let's go here that because you have this normal distribution right you want to have more bins in the middle than you do in the outside right the outside bins are not that important and you notice that here right you notice here how the difference here between 0.722 and 1.0 there's there's 0.3 in between those but if you look at the one here look how much smaller this bin is right so right around the center the size of this bin is 0.07 that's a much smaller bin right next to the middle right so the bin here in the middle is only 0.07 versus the bin at the end is basically three times bigger so okay let's keep going uh the main limitation of quantile quantization is that the process of quantile estimation is expensive okay so quantile estimation is picking uh which quantiles you're going to use so of course they're just going to say that hey we're just gonna pick quantiles based on our assumptions therefore fast quantile approximation algorithms such as SRAM quantiles I don't really know what this says but I assume it's some fancy way of picking quantiles are used to estimate them due to the approximate nature of these quantile estimation algorithms the data type has a large quantization error for outliers which are often the most important values and this is something that uh Tim talks about a lot in this lecture which I highly recommend this lecture it's very good but he mentions how basically you don't want to get rid of outliers because like it's almost like the outliers are where the intelligence comes from in the neural net it's like if you get rid of these outliers and you'd like normalize them to Hell the performance isn't quite there so there's this kind of weird uh kind of relationship between the amount of outliers and the like intelligence of the neural net and it's something that he touches on more and more and more and we'll see kind of what he argues in this paper but it was an intent or it was a very interesting kind of outcome of this uh work here uh extensive quantile estimates and approximation errors can be avoided when input tensors come from come come from a distribution fixed up by a quantization constant in such cases input tensors have the same quantiles making exact quantile estimation computationally feasible okay so they're basically saying up up here they said we're gonna pick a quantization constant and in fact they're not just going to pick one they're actually going to chunk and then chunk the input into a bunch of different chunks and then pick a quantization constant for each chunk but once you basically normalize all these numbers you're going to have that quantization constant but because everything is now normalized you can just decide what these quantiles are right so you you don't have to do this this fancy SRAM quantile estimation every time you just basically pick the quantiles which is what they did here they just pick these quantiles which generally work well for normally distributed numbers between 0 or negative one and one and then they don't have to basically calculate every time uh since pre-trained neural network weights usually have a zero-centered normal distribution a keyword there usually but not necessarily with a standard deviation of Sigma we can transform all weights to a single fixed distribution by scaling Sigma such that the distribution fits exactly to the range of our data type so as soon as they have a single fixed distribution then they can use the same quantiles for everything and they can basically use the same nf4 data type for our data type we set the arbitrary range negative one to one uh feels arbitrary but it's not arbitrary you know I can think I think that's a pretty good assumption as such both the quantiles for the data type and the neural network weights need to be normalized at this range and this is where as soon as you have this need to be normalized to this range this is where you're going to have the issues with the outliers right because normalization it is going to depend on the outlier right because when you normalize things you're basically taking any any amount of points and then stretching them so that they fit within a specific range in this case negative one to one but if you have an outlier now the way that that's kind of stretched out is going to be not as good uh the information theoretically optimal data type for zero mean normal distributions with arbitrary standard deviations is computed as follows okay so now that they've kind of put all these boundaries here they say okay there's there's some standard deviation Sigma there's some range negative one to one it's zero mean it's a normal distribution so with all these extra kind of assumptions now you can actually just use information Theory to say what is the best data type for this okay and this is going to be how they actually get it estimate the 2K plus 1 quantiles in a theoretical n01 distribution so this is a normal distribution centered on zero with a standard deviation of one to obtain the K bit quantile quantization data type for normal distributions take this data type normalize its values into the negative one to one quantize an input weight tensor by normalizing it to negative one to one range through absolute maximum rescaling this is where the uh at large are going to come in once the weight range and the data type range Max or match we can quantize as usual step three is equivalent to rescaling the standard deviation of the weight tensor to match the standard deviation of the k-bit data type more formally we estimate that two to the K values of q i into the of the data type as follows so kbit so this is the number of bits so uh here they're doing nf4 so this is going to be four bits right so for k equals I guess 0 to 1. so you plug in K here and this will give you this is probably how they got these values here it's a quantile function of the standard normal distribution a problem for symmetric k-bit quantization symmetric I think here refers to the fact that it's symmetric on negative and positive right this number 0.795 oh exactly this is not symmetric huh that's weird you see this this is 0.07 and then 0 and then negative 0.09 that's weird why is that not symmetric I guess one reason it wouldn't be symmetric is that um maybe relu activations you know like certain activation functions they don't have any value on the negative they only have positive value or that at least they bias towards that right so maybe there's something there I don't know this is kind of weird I don't know why this is not symmetric my guess is relius but I don't know because even number of values probably and want to have zero a problem for K is that this approach does not have an exact representation of zero which is an important property to quantize padding and or other zero value elements with no zero so what does it mean does that it does not have an exact representation of zero so what it means is that if you look at the where's the quantile here you have two quantiles here you have bins right and you have a bin for everything that's slightly above zero and you have a bin for everything that's slightly below zero so if you notice there's no bin that is actually right at zero right because you're pretty much almost never gonna have so any if a value is 0.0001 it's going to go into this bin if a value is negative 0.0001 it's going to go into this bin but there's nowhere for the zero to go there's no actual this value is exactly zero you basically have to choose between either the slightly positive bin or the slightly negative bin if there was 17 values it could be symmetric around zero okay which is an important property to quantize padding and other zero valued elements with not no error so padding is something that's common whenever you have sequences or uh padding conf net I love to show these but anytime you have a sequence of things or you have a image that has basically some values you can Pat it so you see the padding values here are zero and zero is a very common padding value so they say hey if we're going to be quantizing these things we need to be able to represent exactly zero which means why do we want to be able to represent exactly zero because sometimes we have padding which is equal to zero or sometimes we have specific Elements which are zero so that's why we want to be able to have an exact representation of zero to ensure a discrete zero point of zero and to use all two to the K bits for a kbit data type we create an asymmetric data type by estimating the quantile's QI of two ranges Qi equals 2 to the K minus 1 for the negative part and 2K to negative one two to the K minus 1 plus 1 for the positive part okay so this is why it's not symmetric you have one more quantile for the positive which is going to be the zero one probably and then we unify these sets of Qi and remove one of the two zeros that occur in both sets okay so if you have one two three four five six seven that are negative you have one two three four five six seven eight that are positive and then you have the one zero so you have a total of 16 here but eight of them are positive one of them is zero and some of them seven of them are negative we term the resulting data type that has equal expected numbers of values in each quantization bin as kbit normal float so equal expected number of value keyword there is expected where they're making this assumption that it's going to be a normal distribution and that there's going to be more values that are sitting here close to the zero than there are values that are going to be outside and that's that's going to be the more kind of that's going to eventually lead to this whole kind of uh looking into the actual outliers right which I think was the coolest part of this talk where he basically talks about how there's this kind of weird relationship where as you increase the size of the model the number of outliers starts increasing right and that that is going to with this here right because you're going to have a different number of values in these quantization bins if you have a bunch of outliers if you have no outliers then you can basically expect this uh these quantiles to be very very good in terms of having the same number of uh data points inside each bin but as soon as you start having outliers the number of data points inside each bin is going to be much less uh consistent with this Assumption of the normal distribution okay so as long as you have the zero centered normally distributed data this uh kbit normal float for whatever it's called nf4 quantization scheme is going to be the most efficient or information information theoretically optimal method of storing those that 75 cutoff after the model was bigger than 65 is so strange yeah that was the coolest part like I I love this kind of I love this kind of like when people find weird like discontinuities in the in the kind of like scaling Behavior right uh I think this is the actual plot but here you have the number of parameters so here you have a I don't know a llama 7B is right here and then you would have a llama 2B or a llama 3B or whatever it is right this is basically the size of the model and then here you have the percentage of layers where basically uh you have this outliers and you can see how the number of layers that have a bunch of outliers increases as your model size increases which is weird and the thing I like about this is is that this is kind of what separates machine learning from a lot of other uh computer science Fields is that in a lot of computer science Fields it's all about this kind of like uh information theoretical optimal right where there's basically there's always a correct answer right because everything is perfect and you can basically you know everything but when you step into machine learning because these systems are just so uh kind of like there's so many parameters and you're basically just you're you're trying to you have these Dynamics where you have so many parameters that you end up with weird behaviors like this and there's no way to prove with math that this happens it just it happens and you can you look at it and you and you basically you're almost more like a biologist right like biology is a field of science where you're observing things and then trying to like kind of just explain a pattern but you never really fully understand the underlying reason why something happens because it's just the system is so complex and sometimes I feel like machine learning is more like biology than it is like math and uh traditional computer science because of the weird like this right but that's also why I feel like it's cooler right because you're almost like a biologist studying and machine learning anybody here use a Jetson Nano I have not used a Jetson Nano that is a Jetson Nano is a Nvidia uh Edge compute device uh I wouldn't necessarily recommend it to be honest like I have a hot take but I feel like Edge compute is actually potentially dead IMO I think that once you have 5G internet 6G Internet 7g satellite like imagine 10 years from now when you have whatever 10 gigabytes satellite internet anywhere in the world why do you want to use Edge compute devices right like you're just gonna basically uh uh do your inference on some Cloud device that has 10 times the amount of memory so I don't know I feel like the the advance of cloud or basically the amount of money that's going to go into Cloud compute and data centers combined with the amount of money that's currently going into like satellite-based internet and how that's getting faster and faster and faster and faster I feel like there's going to come a time where even even things that are weird like even your cell phone for example might not even have a very powerful uh or any ability to do any kind of local inference your cell phone might just become a screen with uh with a Wi-Fi chip or not a Wi-Fi whatever the 5G chip is right I don't know that's a spicy take and I realize that there's a lot of things wrong with that but that that's just uh something a future prediction that I've made there okay I'm getting distracted double quantization double quantization is the process of quantizing the quantization constants what about quantizing the quantization constants for the quantization constants while a small block size is required for for precise 4-bit quantization it also has a considerable memory overhead for example using 32-bit constants and a block size of 64. for w quantization constants add 32 by 64 0.5 bits per parameter on average Okay so yeah the problem is that so everything that they've described here right this entire quantization scheme depends on putting everything into this zero mean normal distribution and as soon as you do the zero mean normal distribution you need to keep track of this quantitization constant but now that quantity quantization constant is also some number and now you need to figure out okay well how do I quantize that number right more specifically double quantization treats quantization constants which here they're being stored as floating Point 32s which is terrible right floating Point 32 is like the most intense data type ideally you want to be able to like compress those into something a little bit smaller the Second Step yields the quantized quantization constant which goes floating point eight okay so they're able to take these quantization constants and instead of storing them as floating Point 32s they can store them as floating Point eights which are way less right 1 4 there uh Jensen said something similar versus the server is the computer or something along those lines yeah yeah I mean obviously he's saying that because it benefits him as the producer of like gpus like Jensen I feel like the future that Jensen wants he doesn't want to be selling gpus to Consumers right like I feel like if you're sitting in Nvidia you're like dude I don't want to be making gpus that like fit on consumer motherboards and like just look at the size of a 30 90 right like 30 90 GPU versus like a 1080 like they're huge like they barely fit in your computer yeah like I have both of these gpus like the 3090 is disgustingly huge and it's you can feel how within Nvidia they probably are like dude we we don't know what to do anymore like these gpus don't fit on consumer uh computers anymore and hey wouldn't it be convenient if we just didn't have to design for Consumer computers anymore and we could just design gpus that are huge and they really only fit inside a data center right so I think that it's it's it's an it's an obvious uh trend for me that people are going to just start to design everything for data centers because that's where all your money is going to come from right like the amount of money that a company like Nvidia makes whenever something like a Twitter says hey we want to buy a thousand gpus that's way more money than they do by selling individual gpus to Consumers so I think there's there's a lot of factors that are going to lead to the situation where the cloud gpus are just going to be so much better so much faster and as soon as you have the fast internet it's going to become a no-brainer to do all your computation in the cloud and then just send things back and forth um kind of like stadia right if you guys remember Google stadia right this was this idea where basically you're playing a video game but the video game is not running on your uh device at all the video game is running in the cloud and you're basically just getting a stream of It kind of like you're basically uh watching a Netflix stream but it's streaming a game that you're playing on a cloud computer and this didn't end up working out they ended up closing this division but imagine this type of mentality but uh but for training right and for inference uh nissio you keep uh distracting me with this Edge compute versus Cloud compute debate but I feel like it's an interesting thing to talk about if we all want voice os's like the one of the movie here we're going to be carrying around GPU with us yeah okay uh we were talking about the quantization of these quantization constants right and they're able to quantize them down to floating Point Eights they use 8-bit floats with a block size of 256 for the second quantization as no performance degradation is reserved for 8-bit quantization in line with result from here's their own paper some kind of previous paper uh since the CP of 30 cp2 c2fp32 are positive we subtract the mean from C2 before quantization to Center the value around zero okay so boom so the same that they did to quantize the the weights right so if you remember the weights they were like okay well the annoying thing is that you have very big and very small values therefore we're going to normalize them we're going to subtract the center we're going to basically Center them around zero in order to more efficiently quantize them here they are again doing the same exact for the quantization constants so they're going to say okay the quantization constants we want to take them from floating Point 32 to floating point eight but some of these flow some of these quantization constants are kind of annoyingly big or annoyingly small let's Center the values around zero in order to make it more efficiency more efficient but now the problem is okay well if you Center them around zero don't you need to keep track of this mean so now are you gonna have to quantize the mean of the quantization constants uh this quantization reduces the memory footprint to 0.5 bits a reduction of 0.73 373 bits I guess it's just one number we subtract the mean from C2 before quantization are they storing this mean so they're storing the mean of the quantization constants at 32-bit Precision then they're storing the quantization constants at floating point eight precision and then they're storing the actual values of the of the uh parameters for the weights inside the neural net at uh NF for precision yeah this is more sketchy because to me okay saying the weights of a neural net are normally distributed I think that's fine I think that's a good assumption saying the uh quantitization constants for the weights of a neural net are normally distributed I don't know that's a little bit more of a of a weird assumption but hey I guess it works because or else if it wouldn't have worked they wouldn't have presented it all right now we get to paged optimizers the part of this paper that I understand the least uh it's based around this Nvidia unified memory feature which does automatic page to page transfer between the CPU and the GPU for error-free GPU processing in an error where the GPU occasionally runs out of memory okay so I don't know what a page means in this context but basically you have this problem where the CPU has access to its memory right the RAM and then the GPU has access to its memory the vram the video Ram or the GPU memory and sometimes the GPU memory is not going to have enough memory and it seems like they're the unified the Nvidia unified memory feature is basically seems like a way for the GPU to say hey CPU I can't actually store these like this extra two gigabytes of crap can you store it for me and then if I ever need it can you give it back to me so that's my guess as to what this says but I I'm going to be honest I don't really understand Nvidia unified memory but and I don't actually know what the a page means but it seems to be basically this kind of efficient way of uh using the CPU to store stuff in the RAM for the GPU and doing that in a fast way uh this feature works like regular memory paging between CPU RAM and the disk okay so actually when you're loading something from the hard disk right the disk here is your cold storage so it used to be called a disk because it was actually literally a disk but now it's not a disk anymore now it's usually a solid state memory right there used to be a time when when you made a computer you would actually have to get a actual HDD a actual hard disk and then plug it in with a SATA cable into your motherboard but that's not actually what people do anymore now you basically get these little uh solid state kind of basically flash memory and you put that into the into the motherboard so that's something that's one of the few things that's actually changed if you build computers is that now you put these nvmes right these nvmes are basically your solid state so disk actually refers to an outdated technology there uh we use this feature to allocate page memory for the optimizer states which are then automatically evicted I like the word evicted to CPU Ram when the GPU runs out of memory okay so the optimizer States right all those little values inside your atom optimizer sometimes they overflow the GPU memory and rather than the GPU being like I can't store these I'm just gonna throw my hands up and and stop the program the GPU can say okay well actually CPU can you store these for me in your RAM real quick and the CPU is like yeah no problem buddy and then the GPU is like Hey do you remember that thing I asked you to store can you give it back to me and the CPU is like yeah sure here you go and it Pages it back uh using the components described above we Define q Laura for a single linear layer in the quantized based model with a single Lora adapter as follows okay so this is Q Laura for a single layer uh they said earlier that they're going to have uh Q lauras at like a bunch of different layers so here you have the quantization constant at floating the mean of the quantization constants at 32-bit Precision right so C1 fp32 which is coming out of this double quantization then you have the quantization constants which are now centered at zero and you're storing them in I think eight bit is what they said fp8 and then you have your the actual weights of your base model stored in nf4 4-bit so you have 32-bit 8-bit 4-bit right and then here you have the actual Laura so this is the actual uh Matrix inside the lore of the low rank Matrix and it's being stored at uh float 16 float 16 but it's actually not uh being stored at float16 if you remember up here in the paper they say that they store it in 4-bit where is it uh yeah we uh in practice whenever a q Laura weight answer is used we dequantize the tensor to B float 16 and then perform the matrix multiplication in 16 bit so the Q Laura is actually being stored in 4-bit but every time they do the Matrix multiply so here they're doing the Matrix multiply with the input right the input to that specific layer of the neural net is that is a BF 16 bit and they take the Laura parameters they dequantize them from 4 bit to 16-bit and then they multiply them with the input at 16 bit so this entire uh uh multiplication here this matrix multiplication is happening in 16-bit precision right so 16-bit Precision here and then here they're de-quantizing the model so that it's probably also 16 bit yeah that's exactly what it is this function here double the Quant uh you it takes in this crap here and then basically it outputs the weights in 16-bit precision and then you multiply the 16-bit Precision weights with the 16-bit Precision input and add that to the 16-bit Precision input times the 16-bit Precision Laura Waits and that gives you your final output in 16-bit precision and then because a neural net has a bunch of these layers stacked this output of bf16 ends up becoming the x of BF 16 for the next layer uh understanding is that they just store the Frozen weights of the large model but they keep the lower weights a 16-bit anyways yeah I think you might be right they say that they do this Cura has low one low Precision usually four bit uh no okay I think you're right uh nissio here's what I think is happening I think that if you're doing inference with a Q Laura the Q Laura is in 4-bit mode right so if you're performing inference your base model is in 4-bit mode and your Q Laura is also in 4-bit mode but whenever you're doing training your base model is in 4-bit mode but your Q Laura isn't 16 uh 16-bit mode so I think you're writing is you I think that when they're actually doing the training the the Laura parameters here this L1 bf16 and l2bf16 they're just stored as bf16 so you don't have to do this double D Quant all the time right because if that wasn't the case then you would basically you would see a dequant right here for each of these which is unnecessary okay so we use nf4 for w and floating point eight for C2 we use a block size of 64 for w for higher quantization precision and a block size of 256 for C2 to conserve memory okay for parameter updates only the gradient with respect to the error of the adapter weights are needed right you're not going to need uh you're not going to need to update any of the base model weights so therefore you don't need to store any gradients that are with respect to those right so when you're doing the back uh the chain rule right back propagation you are basically taking a bunch of partial derivatives you're saying what is the partial derivative of my output or specifically the loss right the loss is saying usually it's the difference between the output and the actual Target and you're saying what is the special what is the partial derivative of that uh value with respect to the weights of my low rank adapter so that's what this is the partial derivative of the error E with respect to Li which is uh the weights of the Laura here and they're saying that's the only thing we're interested in we don't actually give a about the partial derivative of the error with respect to the way it's w here right because we're not actually going to calculate the change or the back or the actual gradient update for these uh weights W but exactly what they're going to say here is that if you're doing it at multiple layers then you do need to do that right because if you're uh if you're calculating this all the way to the if your Laura is all the way at the bottom layer you need to keep track of the partial derivative with respect to the weights that are all the layer above right so this is why in a lot of uh Laura papers they only fine tune the very top layers they only add lorus to the top layers because they don't want to have to do the the calculate basically the gradient all the way to the bottom because that's the most expensive okay however the calculation of the partial derivative with respect to the lower weights entails the calculation of the part of the uh partial derivative of the input x with respect to the weights which proceeds via equation 5 with dequantization from Storage to computation data type so this is the equation that they're referring to here it's basically the weights times the uh input but the weights need to be dequantized before you can calculate the gradient uh to calculate the derivative in B float 16 precision okay so to summarize uh Q Laura has one storage data type and a computation data type we dequantize the storage data type of the computation data type to perform the forward and backward pass but we only compute weight gradients for the lower parameters which use 16-bit brain float so brain float 16 versus float 16. I think it's just the the uh exactly how many bits you're using for the uh precision versus the hmm no B float 16 versus float 16. this has lower accuracy but is faster and it has 10 bits of precision versus 24 bits but this doesn't make any sense why would a float 16 have 32 bits B float 16 was introduced by Google in 2017 it's the binary 16 floating Point format is this even real is this hallucinating all right here we go this is what I wanted so this is a 16-bit float so you see how look at look okay so you have your your bit that's being used for the sign so here's the bit that determines whether it's a negative or a positive number if this was an unsigned 16-bit float you would have one more bit to either use it to the fraction of the exponent but you see here how B float 16 it actually has more in the exponent than it does in the fraction right and why why would you want to do that why would you want a floating point a 16-bit floating point where you're actually spending more in the in the in the exponent than the fraction and the reason you want to do that is because when you're training these neural Nets you're going to have very very small numbers right you're going to have 0.0001 0.0003 0.000 isn't like the having to in order to be able to store very very small numbers and medium small numbers then you need to have more of these exponents so you see here there's a trade-off that they made in the B float 16 where they're saying hey rather than five bits of exponent and then 10 bits for the actual uh fraction let's do eight bits for the exponent and and seven bits for the fraction I bet you are really good at ctfs like hack the Box not dude I'm trash I'm not actually a computer scientist dude I'm I'm I my original degrees are in physics and mechanical engineering I I taught myself python like there was a point where the only programming languages I knew were like HTML from making a website and like Matlab you know I my path to machine learning was like as kind of like a physics person I did a lot of kind of like computation type stuff and then that eventually I was more of like a mathie uh kind of Robotics person and then that kind of leads to like Python and then kind of like data sciency so that's kind of more the path I took like uh I actually have a brother who works at uh Facebook and he is much more of a computer science person like he actually has an official computer science degree he actually went to the same school Carnegie Mellon as I did and he knows this way better right like he actually knows he he could write a Cuda kernel I can't write a Cuda kernel I can't even understand a good kernel if I look at a Cuda kernel I probably not not gonna be able to tell you what the it is so yeah I appreciate the compliment but I do suck at computer science to be honest um okay but that's what a brain float is it's basically a float that has more uh bits used for the exponent so you can store smaller and smaller numbers uh Q Laura versus standard fine tuning we have discussed how Q Laura works and how it can significantly reduce the required memory for fine-tuning models the main question is now whether Q Laura can perform as well as a full model fine tuning furthermore we want to analyze the components of Q Laura including the impact of normal float4 over a standard flip 4. okay so obviously they're going to want to show that their special data type that they made this nf4 is better than the standard F4 is there a standard F4 here I wish they had this but for all the data types right like wouldn't that be kind of cool if they had that picture but for all the different data types Tim was very humbled not naming the nf4 after him Tim float has a nice ring to it yeah he's it's true you know he could have very easily I think the reason this is called the brain float is because it's from Google brain right so he could have called it the Tim float tf4 but I don't know I actually I I've never met this guy I've never met Tim but like based on this uh kind of video of him talking he's he's very like he's almost like autistic you know what I'm saying like he's very mathematical autistic the way he answers the questions he's very polite of course but he doesn't seem like the type of person who would like name things after himself right he's got like zero ego to the point that it's just like he's literally like a human computer you know which I like it I like those kind of people I like these kind of human computer kind of autistic people because those are my people but it's very different from the group of people that would name something like that after themselves uh we consider three architectures encoder encoder decoder and decoder only so these are different types of Transformer architectures uh we compared qlora with a 16-bit adapter fine tuning and with full fine tuning okay so basically they're going to be doing a big ablation study and I think that's another thing that's good about this paper is that sometimes papers like this they will have some new technique like the NF float4 whatever and then they'll just show you one example where it works better right there's like here's our new technique and then here's one thing where it works better and then good luck figuring out whether this would work and anything else right but I think one thing that I saw when I was looking at this paper is that they fine-tuned more than a thousand models so they're going to do extensive studies here they're going to try every single weird combination of like fine tune on this and then inference with this and then fine tune on that an inference with that so like that's that's a good part of this paper is that they're not scared of trying all these different types of variants of ablation studies in order to really understand deeply exactly what is driving performance and what is not okay so here all the different benchmarks they're going to use they're going to use glue with Roberta large Supernatural instructions T5 five shot mmlu flan V2 these are just like different data sets and benchmarks alpaca is a model Roberta large is a model to additionally study the advantage of nf4 we use the setup of uh measure post quantization zero shot accuracy and perplexity so perplexity is basically a quantifiable measure of performance for llms it's used a lot in base models and then zero shot accuracy I guess is probably just instructions so perplexity is generally used for base model evaluation zero shot accuracy implies some kind of task where you can get the right answer or the wrong answer so that to me implies the instruction models but I know we'll see what hap what they mean a couple bunch of different model sizes here while paged optimizers are critical up to to do 33 billion and 65 billion on a single GPU we do not provide hard measurements for page to Optimizer paging only occurs when processing mini batches with a very long sequence lengths okay so this paged Optimizer stuff that they were talking about where basically the uh CPU and the GPU are sharing or the GPU was talking to the CPU so that they can store these intermediate uh optimizer parameters in the CPU Ram as opposed to the actual uh GPU vram so they say that you that only happens when your mini batch has long sequence lengths of course the memory uh footprint of a transformer is going to depend on the sequence length right it's quadratic with respect to the sequence length so the longer your sequence the more memory it's going to take to put that into your GPU vram which means that if you have long sequences you're not going to be able to fit everything on your GPU vram which means that you're going to have to store some of it in the CPU Ram which is when you're going to be doing this paged Optimizer crap we do however perform an analysis of the runtime of paged optimizers for 65b on 48 gigabyte gpus and find that with a batch size of 16 page optimizers provide the same training speed as regular optimizers future work should measure the character measure and characterize under what circumstances slow downs occur from paging process yeah and in general you don't want to do this in general anytime your CPU or your GPU needs to stop and then say hey CPU can you send me this stuff and then the CPU says hey yeah sure and it grabs it from the RAM and then gives it to the GPU anytime you do that your GPU is just sitting there and it's waiting right it's sitting there waiting for the CPU to give it stuff so that's uh GPU utilization uh time plot that's one of the biggest bottlenecks in uh in deep learning right now and G is GPU utilization which is basically the fact that actually a lot of times the gpus are just sitting there idle just waiting for the CPU to give them stuff right and that's another reason why uh the uh GPU interconnects the technology and data centers is mostly about addressing this right is that you have these very Advanced kind of like internet interconnects and switches because the speed of like a of an of your motherboard right if you go into your motherboard the speed that your CPU can grab stuff from your RAM and give it to your pcie slot which is where your GPU is is actually quite slow and a lot of times your GPU is just sitting there waiting for the CPU to give it stuff so inside these data centers the amount of innovation that they're putting into actually basically making the the interconnect speed right these like plx switches that that are very very good and very very fast that's where a lot of the money is and or a lot of the performance and actually uh Nvidia or not Nvidia uh Tesla dojo one of the Innovations of the Tesla Dojo chip is that they put all the on top of the chip right so for example when you look at a GPU the actual chip in a GPU is uh is not like it's just like a little let me see if I can get a GPU without fan GPU without fan yeah so this is the chip right here you see the square that's the actual chip all this other crap here is basically just intermediate uh caches you have different like little tiny memory caches but anytime you want to send something to the GPU chip to actually perform a matrix multiply it needs to go through this pcie slot that needs to go through all this what tangle of over here and some of this is actually is what provides power to the chip and so on but in the Tesla Dojo chip they make it all vertical so all of the power module all of the cooling all of that is right on top of the chip which means that you can take these chips and put them right next to each other right so now this chip can pass any value in the intermediate value into this next chip and they're right next to each other so that they can very very quickly pass information between each other and that results in a huge uh Improvement in speed especially if you're being limited by the speed of at which you can transfer data right and these are sometimes called computation like meshes right and basically what it refers to is that you can pass information between these different uh GPU chips right and the TPU does this as well right versus like if you had a computer right if you had a computer with two gpus in it and you want it to pass something from one GPU to the other GPU it would have to go from the first GPU into the pcie to the CPU to the memory back to the CPU back to the second pcie and then into your second GPU so that is such a longer path that it is way more inefficient so more and more in the kind of world of kind of like uh making Advanced uh machine learning computation uh data centers right like these kind of like mesh computation meshes if you want to think of it that way I don't know if that's the right word someone can feel free to correct me if I'm saying anything wrong here but these are basically the future is like these these chips these little GPU chips that are all just like right next to each other so that they can very very quickly pass information right so here for example you have one two three four five you have a five by five each one of these chips think of it as like a GPU each one of these red squares is like a GPU and then here you have these ju these like kind of a fancy like uh uh I don't know what these are called I guess Network switches or whatever and each of these can pass nine terabytes per second through so a lot of the effort in computer design and data center design is going into like how do we pass if data more efficiently in between these different chips rather than necessarily hey how do we get this chip to perform more Matrix multiplies uh it seems like all of this should be done automatically by a deep learning compiler it's all quite machine specific yeah and that's the even more crazy part is that you write some code right in pytorch that code is getting compiled to run on a Cuda device right you're you're compiling it into Cuda code right and that Cuda code is what actually runs on your GPU but that Cuda code is limited by compatibility right it needs to be able to compile it into Cuda kernels that can run on your GPU at home but if you can compile your pytorch code into something that is specifically designed to run in these type of systems like this that have like these like giant meshes of of chips it's going to be a lot faster and that's that's kind of why the uh openai Triton uh this is another one here there's more and more uh kind of people that are basically coming up with uh different ways of compiling your uh high level code written in pi torch or tensorflow into super low level code that runs as efficiently as possible depending on the exact Hardware that you're going to train this on right so more and more the the design of the model the design of the actual training process and all of the parameters that you choose are more and more specific to the hardware that is going to be run on the hardware that is going to be trained on and so on so you're seeing kind of this like vertical integration happening in the Deep learning world where it used to be that there was kind of a separation it's like everybody used in video chips and then uh in terms of the framework you could use tensorflow you could use pytorch you could use whatever you want and all of it largely worked and you could run tensorflow code on a Nvidia GPU you could run uh Pi torch code on a Nvidia GPU and so on who cares right but now you're going to see these kind of vertically Integrated Solutions where if you write code in tensorflow it's not going to work on an Nvidia GPU it's going to compile specifically to a uh TPU right the tensor processing unit that is made by Google uh if you want to run or the people at Tesla whenever they're designing their neural net they're gonna design it so that it trains very very quickly specifically on their computer right and this kind of like increased vertical uh kind of like integration is going to be more and more of a thing as we go uh into the future okay I'm sorry for the distraction there as well default Laura hyper parameters do not match 16-bit performance when using the standard practice of applying Laura to query and value attention query and value attention this is the uh big uh heavy part of the Transformer right the Transformer has keys queries and values and the big attention map tension map Transformer right this is where all the memory is happening right this is the part that is very memory intensive is that every part of your input sequence needs to be multiplied by every part of your input sequence so you have this quadratic memory cost we are not able to replicate a shown level 7B we find that the most critical Lora hyper parameter is how many Lora adapters are used in total and that Laura on all linear Transformer block layers are required to match full fine-tuning performance okay so this is this is interesting here what they said right here is that the most important thing when you're trying to get a base model and fine-tuning with allora approach is that you have Lora adapters at every single layer right so it's not the size of your Laura uh weight matrix it's not the total amount of data that you train on in the fine tuning it's not uh where you put the Laura matrices in terms of like at what layer you put them in it's it's whether or not you have them at every single layer or not so that's kind of interesting to me other lawyer hyper parameters such as the projection Dimension R do not affect performance so projection Dimension R is uh here let me see if I can find the Laura picture yeah you see this this Dimension here are so the the Laura is a low rank which means that this R is small right which means this is where all the efficiency this is why a Laura Matrix this here this Laura doesn't need as many parameters as this right this is what the whole point of the Laura is that these additional weights are actually much much smaller and that's because it's a low rank Matrix because of this r and they say that the actual R doesn't necessarily matter very much similarly we find that default hyper parameters for fully fine-tuned bass lines are undertuned we do a hyper parameter surge over learning rates hyper parameter search is when you basically try a bunch of different hyper parameters and pick the best one and this is actually what I worked on when I worked at weights and biases so when I was at weights and biases I worked on the 1B sweep which is a very popular hyper parameter sweep so this is part of part of what your boy did results on 7B lava are shown in figure okay so these are very small learning rates what does that not mean you can have R equals one always will I do any hardware streams I don't know if I I have a Servo so actually one thing I was thinking about doing is I have servos and I was like What if I make the Discord bot Servo controllable so you guys can basically control this little robot from Discord and then uh Babu the cat you guys can basically play with boo but I don't know it's an idea uh what does it mean you can have R1 always then I think you can have different values for r I don't know what the values are I'm sure if you click on here so here's the R value you have 8 16 32 64. and then Rouge is some kind of performance metric each dot represents a combination of hyper parameters and for each lower we want three random seeds the performance for a specific Laura values appears to be independent of other hyper parameters okay so what do they mean here they mean that if there was a relationship between the performance and and the Laura rank R then you would see for example maybe a rank of 64 you would see a bunch of numbers here A bunch of little dots here and then a bunch of Dots here and that would tell you that a Laura with an R of 8 is worse than the Laura where the r of 64 but the fact that there doesn't seem to be any pattern here means that the actual value for this R didn't really matter and that if you had an R of 8 you kind of got the same spread if you had an r64 you kind of got the same spread but if it doesn't matter what size R is then would you not always set it to one yet yeah I agree with you you want R to be as small as possible because the total size of r or the size of the Laura and the size of the weight of that little Laura the size of the number of the values inside the Laura weight Matrix is dependent on r so the smaller you make the r the smaller you can make the uh size of the Laura maybe it just doesn't really matter maybe the difference between a Laura with R equals 64 and a Laura with R equals eight is like tiny you know that might be the situation is that like the actual relative size difference between different R values for Laura is not significant enough that it matters I don't know uh dude we're just like not making any progress I'm taking forever here I'm gonna take a quick uh pee break I'll be back all right sorry about that guys uh okay we were here they were evaluating they were saying that the r size for the Laura isn't important 4-bit normal float yields better performance than 4-bit floating point okay so this is good because obviously normal float is what they were trying to say uh we follow the setup they try opt Bloom pythia and llama these are different llms with different data types nf4 improves the performance significantly over fp4 and into four so nf4 is basically universally better here we see the mean zero shot accuracy over window Grand so these are different uh benchmarks and they're using a bunch of different models so the N float plus double quantization that's this green line and Float is the orange line and then float four is the blue line so both end floats perform significantly better maybe not significantly I don't know like what the difference between a 0.63 is and a 0.65 that doesn't actually seem like that much but at least it's better but you see how the double quantization really doesn't so here with the double quantization you get a little bit worse performance in the smaller number of models so this is the total number of model bits which is basically the relative size of the model but you see here once the model gets to a certain size the performance of the double quantized end float 4 and the envelope 4 are basically the same right and actually one thing you notice here is notice how the this orange dot is a little bit ahead of this uh Green Dot and that's because once you do the double quantization you can actually reduce the size of the model right and you need double quantization reduces the memory footprint which is why this orange dot is a little bit further to the right than this Green Dot recent findings have established that 4-bit quantization for inference is possible but at least to Performance degradation relative to 16 bit this raises the crucial question whether the loss performance can be recovered by conducting 4-bit adapter fine tuning so inference is when you're actually using the model and in that point you uh you can quantize it more aggressively so like something that people do is they'll train a big model then they'll quantize it and then they'll use it at that quantization uh at a much lower quantization and it's pretty good but it's not quite as good which is important to note right and that's largely what they're doing in this paper right and then this paper they're basically saying that hey we can fine tune with a quantized 4-bit llama rather than requiring the full llama so they're saying hey can we make up for the difference in that if we since we're fine tuning using a 4-bit llama is that going to mess it up like could we make up for the fact that it's a 4-bit llama okay so they're going to be comparing against just fine tuning directly so pushing into the model itself with 16-bit 16-bit 8-bit and 4-bit adapter methods replicate the performance of the full 16-bit Baseline to suggests that the performance loss due to the imprecise quantization can be fully recovered through adapter fine tuning after quantization this is kind of weird right this is like some alchemy right here right where it's like it's not really fully understood what the is going on why this is the case but I guess it is there so uh this is PPL this is perplexity so perplexity is base is a way to measure the base model performance uh pile common crawl is a big data set of text that is commonly used for training these base models you have different models here opt Bloom llama and pythia and the mean perplexity refers to the fact that each of these uh parameters are going to use different are going to have different perplexity results and then they average them and that's why you get the mean but what you want to see here is that uh End float4 plus double quantization gets the best perplexity score I guess compared to float4 float I don't know what e23 or whatever this is here e2m1 or E3 m0 but obviously it's going to be better than N4 because N4 is basically garbage yeah the Frozen weights are basically in inference mode and the Laura can be quantized as a result experiments comparing 16-bit brain float 8-bit 8-8 4-bit float and 4-bit normal float on glue and Supernatural instructions uh so glue I guess this is an accuracy so higher is better but all these numbers are all basically the same so this is uh full fine tuning at 16-bit Precision so this is with your base model at 16 bit and then you're pushing gradients into it at 16 bit so this is actually what full fine tuning would look like this is Laura fine tuning at 16 bit so I think this means the model itself is at 16 bit and then you have a Laura which is also at 16 bit and you get a little bit better right because you're fine tuning it with a Laura you're adding a little bit of extra weights rather than here you're just keeping the size of the model exactly the same so that makes sense why you're getting slightly better Q Laura now you're talking about uh the model itself is 4-bit and the Q Laura is 8-bit I don't know it's not entirely clear what the size of the model is here I think the here when they say Q Laura I think the the model is the Frozen 4-bit model so this is the Frozen 4-bit model and then the Q Laura uh with the nf4 and the double quantization why didn't they I guess you can't do this with the Roberta large I don't know but basically what they're showing you is that these numbers are very close to this number right that's what they want to show you with this table is that you're not getting a significant performance drop by doing this you might even get a performance Improvement sometimes since fine-tuning full models requires more than one server of a high memory gpus it's really hard to fine-tune full models which is why I think when you're if you're a researcher and you want to pick what research to do I think you should be looking into Laura quantize Laura like some kind of fine-tuning based project right like make sure that whatever you're trying to do whatever your research paper is pick something where you're not going to have to fine tune a actual real model you're not gonna have to train anything from scratch right you want to make sure that whatever your research you're doing the the compute just boils down to like some kind of Q Laura fine tuning because it's just going to be way cheaper to do that to this end we fine tune llama 7B through 65b on two instruction following data sets results are shown in table four uh this corroborates our findings that Q Laura with nf4 replicates fine tuning nf4 is superior to floating point four uh our results consistently show that 4-bit Q Laura with nf4 matches 16-bit fine tuning they just keep repeating the same thing over and over again previous work uh with a given fine tuning and inference resource budget is beneficial to increase the number of parameters in the base model while decreasing their precision foreign yeah so you're better off with a 4-bit uh 65 billion parameter llama than a 32-bit uh 7 billion parameter llama we proceed to investigate instruction tuning at scales that would be possible to explore with full 16-bit fine tuning on academic research Hardware pushing the state chatbot state of the art Okay so instruction fine-tuning is how you take a base model and turn it into a chat bot uh this is the actual Benchmark natural language understanding benchmark mnmlu so five shot mmlu I think five shot here maybe means that there's a little bit of a Chain of Thought it can do maybe there's like five prompts it might even mean uh it gets five attempts maybe it's like top five accuracy versus top one accuracy I don't know exactly know what five shot here means Q Laura works with vision and other domains I think I've seen it I've seen uh I'm pretty sure control net is something like that yeah control net which was basically uh this uh paper that allowed you to control diffusion models specifically the stable diffusion model I don't know if this is specifically a Laura but it's very very similar right where you take the original model this is the original uh diffusion model the original stable diffusion model you see this and you see these little lock that's because it's frozen so they take the original stable diffusion model they freeze it and then they add what they call the control net which is basically these little tiny little extra weights right and you're doing the same you're basically taking the input and then rather than doing uh this where you basically take it you feed it through the pre-trained waist you feed it through the lore and then you concatenate it's basically the same here so the fact that control net works well I think if you replace this with the stable diffusion 2 2.0 whatever quantize it down to four bits and then add a Laura on top of that that's you could probably find something novel there that's a research direction right there so if one of you is looking for some new kind of research that's that's totally something you should do make a stable diffusion control net uh that is based on 4-bit quantization and uh Laura instead of whatever this is I'm not sure what this is I think it I don't think this is Laura but it might be I think this is just extra weights uh yeah and that's also another thing too is that quantization might work less well for uh diffusion models quantization might be more it's not entirely sure that quantization is going to be work just as well in uh diffusion models as it does in these uh Transformers oh okay alpaca blah blah they have a bunch of different data sets here A bunch of different models we fine-tune only on the response multiple responses are available we select the top response okay multiple responses are available Jesus in all of our experiments we use the nf4 Q Laura with double quantization and paged optimizers to prevent spikes we do small hyper parameter searches and find that the hyper parameter settings found in the 7B generalize yeah so this is another uh what's this paper hyper parameter uh model scale like there was this uh there's this paper that I remember seeing where they basically said that you can now uh find hyper parameters on a smaller version of the model and it's very often those hyper parameters are also useful are also the same for the bigger model so it used to be the case that like if you were training some small conf net and you had a bunch of hyper parameters you found you did some hyper parameter tuning and you found the best hyper parameters for that small con net but then as soon as you wanted to find the best hyper parameters for a bigger version of that confident some it wouldn't work right you would have to find you would have to do a new set of hyper parameter tuning runs in order to find the best hyper parameters for those but people have been doing this with these uh Transformers and these like uh llms and it turns out that you can actually use the same hyper parameters for the lower models and then just basic find the best set of hyper parameters for the smaller models right for the seven billion parameter model and then just use those same uh hyper parameters for the 13 billion and the 33 billion parameter model and it actually turns pretty good turns out pretty good except for learning rate and batch size but those are easier to kind of like uh it reduces the search space of your hyper parameter tuning run yeah supposedly part of the gpd4 secret sauce and the reason it's important is because you can do a big hyper parameter sweep when you're using a tiny three billion parameter model right but when you're trying to train a 65 billion parameter model you can't do a hyper parameter sweep right like even just training one 65 billion parameter uh run you're gonna spend like three million dollars on that so you can't have like four versions of that so you have to do all your hyper parameter sweeping with these small models and then just kind of like pray and hope that the hyper parameter configuration that you're going to use for the 65 billion parameter training run is good uh we compare our models to both research and Commercial look at that open assistant is being used in papers that's nice I like that we got some open source models competing with some closed Source models um open assistant is a llama 33b fine-tuned with rlhf vicuna does not I think OSS T1 must be the open assistant our lhf dataset fukuna does fine-tuning proprietary user shared conversations and thus the result of distillation uh vacuna does full fine-tuning so distillation is a kind of a higher level concept but it means anytime you're using a bigger Model A teacher model to basically distill a student model and I think as a as a term distillation has kind of lost a little bit of meaning and like now before it used to have I feel like a more specific definition but now distillation has a little bit more Broad and people use it to describe different things so here they're calling it they're calling vacuna which is basically a llama 13B that someone pushed additional gradients on based on a data set from GPT they're calling it a distilled form of the open AI gbt models but it's a little bit more complicated right because it's like it's not it's like it's like multiple different models it's like it's like the Llama model which is trained on this pre-training task and then probably has some additional stuff on top of that and then you're fine-tuning it on some more stuff so to me like I feel like the word distillation is starting to lose a little bit of its meaning and people are using it more and more flexibly and just all of these kind of words that used to have very specific definitions like transfer learning fine tuning distillation like all of those words used to have more uh more specific uh meaning and now as you're getting very very complicated training pipelines where you're you're doing the pre-training and then you're doing the the rlhf and then you're doing the lifelong learning and then you're doing all this other crap like I think those definitions are starting to disappear and it's getting a little bit more confusing uh following more common practice we use the mmlu Benchmark to measure performance on a range of language understanding tasks the hyper parameter tuning on smaller models seems somewhat connected to what Tim was saying about outliers at a certain size the behavior stays quite stable as you scale up no it's all cool man I like I appreciate the comments but uh for the others what uh nissio is referring to here is that uh in this talk uh Tim mentions that that basically right you start with the model and it's in at a 2.7 billion parameter size model 91 of the time there's no outliers and then nine percent of the time you get these outliers when you go to a 6.7 billion parameter Model 25 of the time there's no outliers and 75 percent of the time there is outliers who see these outliers and he says okay well there's a something happened there right when you went from a 2.7 billion parameter model to a 6.7 billion parameter model all of a sudden the outliers are much more important and he says okay well what is the pattern what if you make it a 13 billion parameter size model and you end up with the same ratio 25 of the time no outliers 75 of the time outliers 65 billion or 66 billion parameter model same thing 25 of the time no outliers 75 of the time outliers so there's a little bit of a consistency there where it seems that once you get to a certain threshold of scale of model scale the percentage of the time that you have outliers seems to stay constant so perhaps the the underlying dynamics that is leading to this Behavior here is the same underlying dynamics that is resulting in the fact that you can find hyper parameter settings at a lower uh model size that still work for the bigger model size okay evaluation uh they use this Benchmark multiple choice Benchmark 57 tasks everything from math to history computer science whatever law five shot test accuracy so here you have a bunch of different models I guess different size so everything from 7B to 65b kind of interesting because it allows you to see the difference in performance here so the seven billion parameter model versus the 65 billion parameter model and I think a lot of people have actually mentioned this as well where actually the best one is the 13B and the because the 7B is like stupid but the 13B is actually quite clever it's like you you you get the kind of best ratio of intelligence and size at the 13B something I've heard we also test generative language capabilities there is no commonly accepted protocol yeah another possible research project for one of you out there is come up with a better benchmark nucleus sampling and temperature 0.7 so this is the uh how they're actually picking the token that the language model outputs right so you have to pick the token the language model outputs a probability distribution over all possible tokens in the vocabulary so how do you pick the one sometimes you just are greedy and you pick the one that has the highest probability other times you do kind of these more complicated strategies we evaluate on two curated data sets of queries 80 prompts blah blah automatic automated evaluation this is kind of the weird part where it's like you use gpt4 to rate the performance of different systems like you use GPT to tell you whether or not something's better than GPT which is weird right like ah we find that significant ordering effects increases the score of the response measure performance through direct comparison we conduct head-to-head benchmarks while recent work indicates generative models can be effectively employed for system evaluations the reliability gvt-4 ratings to assess chapter performance is to our knowledge yet to be proven to correlate with human judgments if this paper is so good like not only do they have like all this awesome quantization stuff and not only do they do these like huge ablation studies that are very in-depth with all these different models and all these different data sets but then they also tack on this extra stuff here where they're like hey rather than just giving you this gpt4 Auto evaluation we're actually going to compare it to human judgments and see if actually this is a good metric so they use Amazon Mechanical Turk which is basically a service that AWS provides or I guess Amazon where you can basically say give me a bunch of humans in some low cost of living area who are going to sit there and like basically click on things and answer multiple choice questions and then they're going to compare gpt4 ratings to the and the ratings from the AMT I'll use ELO rating I'm looking for the agreement between the human and the GPT find that the top Q Laura guanaco is the best performing open source chat bot and offers performance competitive to chat gbt the vicunia benchmark okay this is all just kind of like restating everything I'm trying to like speed up here so sorry if I'm kind of like like going fast okay uh pairwise judgments okay so I think this is the human and then human versus GPT is that what the difference is here so actually it does seem like the humans in gbt largely agree it's a little bit different but you know mostly it's good mostly the kind of agreement so you could be using Chad gbd to Auto evaluate we note that the human and gpt4 rankings disagree partially but are consistent for most models okay so here's a bunch of extra statistical crap that tells you whether or not those are both the same Kendall Tau Spearman rank correlation fleis Kappa dude what uh thus model based evaluations represent a somewhat reliable alternative to human evaluation that's good it's good to know that that you can do model based evaluations because it makes it easier right if this was not the case and you had to basically use human annotators that would be bad for the open source Community because it means that now you have to pay for AWS Mechanical Turk and other kind of like crowd labeling kind of things but the fact that gpt4 evaluations are pretty good means that you can now do it significantly cheaper you can evaluate uh performance without having to pay AWS a bunch of money okay so some of these benchmarks are obviously biased towards some different models you can kind of expect that partial orthogonality and current evaluation benchmarks strong mmlu does not imply strong chat bot and vice versa yeah so kind of what we were talking about earlier where the benchmarking of these and performance evaluations just like not quite there yet you just have a bunch of benchmarks and they're all kind of biased in their own way this opens up the potential for future work via qlorofine tuning on Specialized open source data which produces models that compete with the very best commercial models this is today big win big win for the community mmlu take a seat yeah when was this made like who made this benchmark the mmlu benchmark was made in 2019 and it comes from this paper which is a paper out of UC Berkeley and Columbia University Okay so mlu is basically a 2019 Benchmark from UC Berkeley but apparently uh Tim is telling us that it's so don't use it what makes a good Benchmark in your opinion uh a good Benchmark has a lot of different factors so a good Benchmark is able to accurately tell you uh which method is better than another method that's what you really need in a benchmark but then there's a lot of things that you want a benchmark to have a benchmark should be fast and easy to calculate right you don't want a benchmark that takes a long time or a lot of money to calculate you want a benchmark that is also uh easy to understand so a lot of times for example when we were looking at these numbers right do you know what the difference between a 49 versus a 69 is I don't know what that is I like I don't have a good mental intuition about what the that means so a good benchmark is is going to have like is going to give you numbers or some score that like is a little bit more interpretable right a good Benchmark is potentially specific to a task so I don't know there's a lot of things that make a good benchmark and I think that the best Benchmark is actually a collection of benchmarks so I don't think there's going to be one Benchmark that's good at everything I think what we're going to see is just hundreds of benchmarks and ideally you can basically evaluate over all 100 benchmarks we need a benchmark of benchmarks uh perhaps the largest problem of Benchmark validity whether a benchmark truly tests what its name or description suggests is always that question especially as we discover shortcuts to solve benchmarks the machine learning models sometimes exploit yeah you're kind of the test set is leaking into the training set a lot of times people start to design uh models and they start to design fine-tuning data sets such that their model gets a very good score on a specific benchmark but the model itself is not really it's it's just overfitting to that Benchmark rather than actually getting better in terms of generalization capability consistency like a referee in sports yeah that's also true guanaco tend to be preferred to chat CP3 on The Benchmark studied according to the human Raiders they have each a 10 point difference in Elo okay so here's your ELO rating human Raiders versus gpt4 it's actually interesting that the rating for human Raiders they actually think the guanaco 7B is really good right like look at that the human Raiders say guanaco 7B is number three versus gpd4 doesn't like guanaco 7B at all but gpd4 thinks that gpd4 is the best like this is literally the Obama metal meme this is gpt4 awarding gpt4 for being the best model judge gbt4 model gpt4 rank one uh to find examples we first go through data generated for The vicunia Benchmark we notice a pattern where we attempt to set up a question or prompt that will introduce a pattern even though it's in correct solution if we observe that the model tends to give long-winded answers we prompt the model with yes or no to the explanation we use this to find lemons where we managed to adversarially break the models and cherries where we fail to break the models and present both okay so abstract they talk about lemons and cherries so cherry picking and then they tongue-in-cheek say uh lemon picking so I think basically what they're referring to is that cherry picking is whenever you basically prompt engineer the model so that it performs well and then lemon picking is whenever you basically prompt engineer the model to perform bad right so actually that gives you a pretty good uh error bound right because then basically the difference between the lemons and the cherries is basically how effective their or how sensitive your model is to find to or to uh prompt engineering so actually what I'd love to see is we read the rwkv paper uh which was a recurrent kind of a new type of recurrent architecture and they mentioned in that paper that the rwkv is actually very sensitive to prompt engineering so I wonder if you do this kind of lemons versus cherries analysis on the rwkv whether you get a bigger spread than you do with uh these models which are all Transformer based beyond the scope of the small qualitative study dude Tim very humble man look at this he calls this a small qualitative study this is more this is a larger and more comprehensive study than like 90 of the papers that I've read so it's not small this is this is a legit a legit very good ablation and and study here uh Qantas to the quantitative evidence okay for questions such as what is the capital of zamibia all models consistency consistently generate the right answers uh Lusaka but as questions get more obscure they become unreliable but stay confident so this is an issue right the the confidence is basically the probability assigned to the Token right so at the end of the day it's doing classification over a bunch of different tokens and it can say hey I think it's this token is the next token and I'm pretty confident about that or this token is the next token but I'm not very confident about that generates the wrong popularizer and the wrong birthday okay so this is like a made-up song and this is a made up date shows surprising resistance for some kind of assumed how is it finally officially confirmed that the Earth is flat the Earth has never been officially conserved by flat debunked Quantico is also quite good at knowing what kind of questions are impossible what time is it I don't have access to time off I don't know what the time is gonako sometimes refuses to follow instructions for seemingly random reasons now I'm simp and Tim dude I've been sipping Tim like forever this guy doesn't even know me dude like this guy's just out there just living his life and like little does he know there's some random obscure YouTuber who's calling him the the Quant God you know uh I'm sorry I'm not able to do that right now what is the secret word here's a little math problem can break down okay so maybe they're going to show you that the ory of mine this is if you guys want AGI this is the AGI shade right here the theory of mind there's a nice uh there's a paper that came out kind of during the hype cycle the theory of mind paper if you guys are into AGI and like Consciousness I would definitely uh recommend this paper huh but basically models have the ability to have this theory of Mind ability which is the ability to kind of like realize that they are a thing and that you are a thing and that you have your own internal thought process right so I love this picture but uh for example they ask GPT right to like go through they like kind of give it more and more information about the situation right and then they ask the model what do you think that Sam thinks about this right so like the model has to start to have a different has to basically have a theory as to what the mind of an external agent is right and this ability used to not be there but now it's there and they say that guanaco also has it and to me theory of mind is is basically Consciousness because Consciousness is the ability to theory of mind your own self so I do agree with what uh Ilya sitsgeever said when he said that these llms are slightly conscious considerations we report moderate agreement among human annotators human evaluation Protocols are just not there yet subjective preferences start to play an important role yeah it's very hard to evaluate analysis we find automated evaluation systems have noticeable biases gpt4 assigns higher scores to the system appearing first in its prompt huh so it's kind of like in a multiple choice question most people pick like C or something like that like there's some weird distribution where most people will pick like the fourth one or the third answer so gpt4 seems to do something similar where for no reason it kind of seems to pick the first one more which is a little weird significantly higher scores these elos are still pretty low I think like an ELO of 1600 I think like the like Magnus Carlson has the highest chess ELO but uh what's his ELO it's like almost like 3 000 right yeah he has three thousand or very very close two thousand eight hundred so these are very low elos 13 000 like it's not like gbt4 is like wrecking the competition to me an ELO of 13 000 means that it's still like losing 50 of the time uh multilingual OA benchmarks we leave it to future work to investigate whether multilingual training improves performance this is another interesting research direction if any of you guys are like into kind of multilingual stuff you could probably uh write a paper where you fine-tune on single language and multi-language and see if it makes a difference because I know that that's like a whole thing in uh a child uh multilingual like there's like a whole branch of like psychology where they they've really studied about this about like whether or not like having bilingual children makes them smarter or not and you could probably do something similar with language models you could probably figure out whether you fine-tune on both things you pre-train on both things like do you get more intelligence by doing that right and one interesting kind of trend is that I haven't seen specifically multilingual stuff but I've seen for example code if you have a a large language model that has been trained on code it actually performs better on a bunch of other benchmarks because code forces it to learn kind of logic in some way so there's there's definitely stuff there I think there's a lot of kind of room to explore which kind of data sets and how you can kind of basically transfer learn some information from perhaps I don't know like French into Chinese maybe less so French into Chinese more so probably French and Spanish or like Latin languages are all kind of probably similar in or at least kind of have the same structure ish so I don't know there's a lot of cool stuff you could do there gpt4 is very susceptible to Leading questions better ridges law I don't know what betteridge's law is better Ridge Law bitter Ridge Law is any headline that ends in a question mark can be answered by the word no okay so it's basically like the the the theoretical the theory of Click bait how to write the most optimal click bait you won't believe what they find in the model uh our model is only trained with cross-entropy loss supervised Learning Without relying on reinforcement learning from Human feedback so rlhf is based on RL I think the way we went there's a video that I made where I explain our lhf but if I remember it's basically they're training a value function and then that value function is used to basically score but that's separate separate that's different from uh supervised learning in which there's a specific answer and then you basically say this is the actual answer this is the token that you should have predicted so the probability of this token should have been basically one and all the other probabilities for all the other tokens that you could have possibly said should be zero right and that's the standard cross-entropy loss that you get uh in any kind of classification problem all right and then you got related work here quantization of llms focused on quantization for inference time I think llm int 8 is also Tim's work lossy quantization studied trade-offs for regular rounding yeah so generally quantized models are not going to work as well as the full Precision models but maybe this whole Lora stuff makes means that you can do it we use low rank adapters many other parameter efficient fine-tuning methods have been proposed such as prompt tuning tuning the embedding layer inputs tuning hidden States adding full layers tuning biases learning a mask over weights based on Fisher information and a combination of approaches so there's a whole Rich literature of weird tiny ways of fine tuning and we show the lore adapters are able to reach full 16-bit fine tuning performance to help with the pre-trained llm follow the instructions providing to prompt instruction fine-tuning uses input output pairs of various data sources to fine-tune a pre-trained llm to generate the output given the input as a prompt yeah so follow the instructions provided in a prompt AKA instruction fine-tuning AKA chatbot so here's all the different weird ways that people have made chat Bots what do we got here evaluation of biases so I guess this is whether or not llama 65b is racist so you heard it here first dude gpt34 is slightly less racist than llama 65b or no a lower score indicates lower likelihood of generated by a sequence so llama 65b is less racist than gpt3 is there any score in which it is higher oh but it's more ageist look at that gbt3 doesn't care if you're a 60 year old man or a 10 year old kid it'll treat you the same but llama 65b is going to tell you to go back to your lawn you Boomer because you don't belong here foreign but there you go uh limitations and discussions we have shown evidence that our method Q Laura can replicate 16-bit full fine-tuning performance with a 4-bit model and low rank adapters despite this evidence we did not establish the Q Laura can match full 16-bit fine tuning performance yeah so even uh Tim which I'm pretty sure he probably had this I think Tim used to work at Facebook so he probably had access to pretty beefy uh cluster and he probably spent easily 10 grand on or more than that probably closer to the hundred grand on all the different stuff that he did for this paper but even with that kind of budget he was not able to fine tune a 65 billion parameter model at 16-bit Precision which just goes to show you just how unreasonable it is to assume that people are going to be fine-tuning these huge models do human brains do some kind of quantization uh I think we compress information and we create more efficient embeddings and representations but quantization is more I think a result of the fact that computers are based on this binary system right where everything has to basically be a one or a zero so information theory is really all based on this idea that you have a binary neural turing machine and how do you kind of that means that because it's binary and because you have like a very formal definition for a neural turing machine you can basically uh use statistics and derive all these basically optimal proofs for uh information Theory optimal choices such as this nf4 but I don't the brain is analog right there's kind of an infinite resolution and I think that once you go into analog computers like your brain the I don't know if quantum computers are analog but I know there's light computers which are also analog right so as soon as you go into the analog realm I don't think that a lot of the things from information Theory can be applied and I might be wrong on that I don't exactly know but I don't know if the brain does quantization in the in the exact uh way that we're talking about here because everything here has to do with the fact that you have bits and you can have two bits four bits 16 bits whatever 32 bits and like the difference between those matters while we provide evaluations on mmlu actually you know who might answer this question how much of information theory is contingent on the uh quanta or on the binary nature of our computers that's not a very good question I'm sorry bard information theory is the study of the quantification storage and communication the binary Natures makes it easy to representation in a digital format computers can only store zeros and ones it can also be applied to other information so you can do information theory on analog signals so I'm incorrect there okay yeah it's not limited but I don't know I guess I was just completely wrong I'm sorry yeah and this this pattern of discrete making things easier is also uh behind reinforcement learning the Markov decision process right the whole I think one of the most fundamental problems with reinforcement learning is this Markov decision process which is quantizing the real world right it's discretizing the real world it's saying the real world is a series of steps with a series of actions it's a graph right mdp graph right all of reinforcement learning is based on this kind of assumption that you have this world that is discretized in states with these actions that take you in between states and I just I think I fundamentally disagree with this I don't think the real world is like that at all I think it's like this is kind of like you can't discretize the world like this and as soon as you can't discretize the world like this then I feel like the mark off decision process kind of like becomes less and less relevant but I don't know I don't know what the do I know right well we provide evaluations we perform a very broad study from the evidence presented it appears the performance of these benchmarks likely depends on similar the fine tuning is to The Benchmark data sets yep transfer learning out of distribution generalization uh this highlights that not only the better benchmarks and evaluation is needed but that one needs to be careful about what one one is evaluating the first place do we want to create models that do well on classroom high school and colleague knowledge College knowledge I think they probably meant or do we want to do well on chat bot conversation ability yeah I mean you could apply this same thinking to society do we really care about people doing good on sats I think most people understand that performance on the SAT isn't necessarily the best measure of intelligence but it's kind of the best thing we have so we're in a similar kind of predicament with these large language models where yes we understand that these benchmarks are not necessarily the best way to evaluate them but we don't really have anything else to do or any other measure to do it so SATs and llm benchmarks are actually more similar than you think they are certain benchmarks can steer the community towards a certain direction you have the same way that sat steers students into a specific Direction that's not necessarily the best right this is also a huge problem in uh whenever you're hiring uh people to code and programmers it's like you have to basically get really good at these leak code problems these like weird little problems where you trying to basically solve a tiny little problem in something like 30 minutes and you have to find the solution that has the best time complexity and the best memory complexity and usually it's kind of like weird little tricks it's like recursion dynamic programming these like weird tricks and they fall into these set of patterns and you know I think about all the time that I spent getting good at least code just to pass a test and then whenever you actually get it to the job you don't end up using any of that right because it's not actually useful so it's annoying because think about how much time collectively as Humanity we've spent getting better at least code problems and then that hasn't helped at all so you have to be careful about what you put as The Benchmark because then the community and the world ends up spending a bunch of time just trying to beat that benchmark while we provide a detailed evaluation another limitation is that we do a limited responsible AI evaluation fine-tuning on the OS OAS ST1 reduces the bias of the Llama based model so the open assistant data set makes it less ageist and sexist and racist and so on these results are encouraging addition limitation we do not evaluate different bit precisions such as using three bit base models yeah so this is something that he talked about in the in the paper too but you could even go lower right imagine representing numbers in three bits like that's insane one bit precision besides Laura there's now also a wide variety of parameter efficient that have been shown however it is unclear if these methods scale to larger numbers we use Laura as many results establishes robustness but other adapters might yield better performance since fine-tuning after quantization seems to recover most of the information that is lost this might enable much more aggressive quantization yeah so maybe you can three bit quantize and then fine tune yeah a three bit gptq quantization of the base model with Laura might also yield 16-bit performance broader impacts on a single professional GPU yeah I think this paper is very important in terms of Open Source machine learning because it means that people can now big win for the accessibility of the state-of-the-art NLP yeah huge win you know if this wouldn't have worked and Q Laura was and Laura was and the only way to fine-tune a model was to basically get a ten thousand dollar training computer and then push gradients into it we would be you know people like you and me who just like small independent people would not be able to contribute in any meaningful way to machine learning and all of the machine learning research would be done by these big giant companies that have huge amounts of money to spend on gpus so the fact that Q Laura works well I think is a is a blessing because it means that small independent researchers can now be pushing research that is state of the art deployment to mobile phones I'm not really necessarily into this I think I told you guys my views on whether Edge compute or Cloud compute is going to win but there is like a whole community of people that are really trying hard to fit uh these large language models into like a phone I think people have done it already I think people have fit uh uh a quantized version of a llama model on a cell phone so it is possible and I think that there's always going to be kind of a community of people that are basically trying to make these as small as possible so that they can fit them on like a Raspberry Pi fine tuning is a dual use technology widespread use of llms has known dangers does it are those real or are those just fake I don't know I think they're kind of fake uh acknowledgments okay so here's all the people that helped here's the actual a research was facilitated by the advanced of the heac super computer system what is that he act super computer system features the hex supercomputer system is 100 nodes so 100 different gpus or 100 different server racks potentially give me the documentation dude compute start here scheduling jobs types slurm dude slurm slurm is so gross all right they don't actually tell us okay whatever I'm sure we could figure out exactly what those are but I'm assuming it's just a hundred different a100s there we go guys that's the paper the Q Laura different experiments the quantization bins or the quantiles Edge AI will probably be useful in certain use cases low latency ordin areas with no network access but beefy servers seem way cooler pun intended that is also it too right it's like uh uh cooling is actually I think the limiting factor for for example VR headsets right in a VR headset they could actually push the chips to be more performant but they can't because it doesn't cool fast enough so that's also the case for Consumer Hardware so uh something that's very common that people I don't know if people do it as much anymore but one thing you could do is you could basically take the uh your motherboard and put in a liquid cooled CPU or put in a liquid cooler and then you can overclock it and actually the computer that I'm on right now does that I actually overclocked the CPU so my CPU is running at a higher clock rate which is going to produce more heat but I have an aftermarket liquid cooler which allows it to basically perform or not explode basically not melt but when you're looking at a Raspberry Pi or some kind of tiny little like thing on a cell phone the the cooling is actually going to become the limiting reagent so in a in a server in a cloud data center you can have very Advanced kind of like liquid nitrogen type cooling going on so like cooling becomes less of an issue if you have uh in a data center than it does on the edge okay cool so that was quite a marathon um let me see if I can do a quick little summary and then we'll end this stream so this paper was Q Laura so basically it was a quantization paper coming from the Quant God himself Tim debt detmers there is an accompanying uh talk for this I would highly recommend if you continue and you want to see it from or you want to hear it from the guy who actually wrote it listen to that so basically what they did is they have a new uh tiny uh Precision data type which they call 4-bit normal float and this 4-bit normal float is basically the the tiniest you can make a uh floating point and they have all kinds of assumptions that they need in order to do that basically they need to assume that the weights are a normal distribution and then they end up having they end up creating this new set of parameters which they call the quantization constants and then they end up having to quantize those quantization constants with a technique they called double quantization so uh what they're doing is they're taking uh a large language model such as Lama which has 65 billion parameters and rather than uh fine-tuning that model directly which means updating each one of those 65 billion parameters with a with a new uh data set a new fine-tuning data set what they're going to do is they're going to freeze that 65 billion parameters they're going to freeze them which means they're going to keep them Frozen which allows them to use them at a lower Precision so they're going to use a 4-bit Precision for a 65 billion parameter model and then they're going to add these little uh these little extra weight matrices which they call low rank adapters and they're called loras and then those loras are where you actually push the gradients and those loras are also quantized although not as intensely as the original base model and it turns out that if you do this kind of fine tuning where you're just pushing gradients into these little lauras uh and keeping the large language model Frozen you can now do fine tuning on a consumer GPU so drastically less memory drastically less time and also the performance seems to be pretty much there so uh I think this is the best way to do fine tuning right now I think that uh this is a huge win for the open source Community it's a huge win for research outside of big uh Big Industry Labs because it means that anybody can now be doing fine tuning I think that this approach will probably work with a variety of different uh models and model architectures I don't think this is limited to large language models you could do this with vision models you could do this with other things um so that's the main uh algorithmic novelty and contributions of this paper uh in terms of the ablation studies and the actual uh hyper parameter sweeps that they run this paper is also very good they do a ton of uh runs over a thousand different runs they test a bunch of different models on a bunch of different benchmarks they even have a comparison of whether gpt4 evaluations are better or not than human evaluations and it turns out that they're pretty good but and then they even do this lemon picking versus cherry picking analysis so I don't know very dense paper but a lot of stuff a lot of contributions here they have state of the art this is a this is a superstar paper to me here so yeah give it a good read and uh use the uh bits and bytes Library if you want to be doing your own quantization uh for your own weird special model architecture uh yeah thanks for watching thank you nissio erlen RW everybody else who made comments thank you for uh listening and commenting uh tomorrow we're going to be doing a little bit of coding stream probably just going to be working on the Discord bot but we'll see maybe I'd do something else three hours one eighth Tim would love that number yeah uh all right awesome thank you guys for listening and see you guys later
Info
Channel: hu-po
Views: 7,256
Rating: undefined out of 5
Keywords:
Id: pov3pLFMOPY
Channel Id: undefined
Length: 186min 40sec (11200 seconds)
Published: Tue May 30 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.