Efficient Fine-Tuning for Llama-v2-7b on a Single GPU

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
foreign [Music] hi everyone my name is Diana Chen Morgan and I'm part of the deep learning.ai team bringing you all together for all things AI community and events today we are very lucky to have a workshop with some special speakers From Prada base in this Hands-On Workshop we'll discuss the unique challenges in fine-tuning llms and show how you can tackle these challenges with open source tools through a demo by the end of this session all attendees will understand how to fine-tune llms like llama 2 on a single GPU techniques like parameter efficient tuning and quantization and how they can help and how to deploy tune models like llama2 to production with continued training with rlhf and how to use rag to do question answering with trained llms this Workshop will be recorded and remain live on our YouTube channel and you'll be able to access the notebook in the description of the video and the chat everything covered in the workshop is presented as continued education from our existing AI short courses and we'll be dropping the link to access all of them in the chat to start I want to introduce our first Speaker Pierre Molina Piero is the co-founder and CEO of predabase he was one of the founding members of uber AI Labs where he worked on several deployed ml systems including an NLP model for customer support and the ubereats recommender system with graph learning and collision detection later he became a staff research scientist at Stanford University working on machine Learning Systems he is the author of ludwig.ai with 8 900 stars on GitHub an open source declarative deep learning framework in 2021 he co-founded pretibase the low code declarative machine learning platform built on top of Ludwig hey Puro hey Diana thank you so much for having us yeah we're so excited to have you here and for our second speaker we have Travis Adair Travis is a co-founder and CTO of predabase a low code platform for predictive and generative AI within the Linux Foundation he serves as lead maintainer for the horabad distributed deep learning framework and as a co-maintainer of the Ludwig declarative deep learning framework in the past he led Uber's deep training learning Team Deep learning training team as part of the Michelangelo machine learning platform we are so excited to have them here today and we can't wait to see what they have in store for us so Pierre take it away thank you so much Diana I'm also super excited and you know uh I've been part of this community from the other side of the deep learning AI so contributing to it and somehow it's really you know uh it's really great for me so let's start talking about efficient uh fine-tuning for Lama 270 billion on a single GPU because it's not that easy to make it run on a single GPU um Diane already introduced us so I will skip this slide but let's talk about what we'll cover so first of all I want to give you like a little bit of an overview of how I think about fine tuning and why would you want to continue an alarm to begin with I will introduce Ludwig which is the low code framework that for building custom AI models that I developed when I was at Uber and now it's like the foundation of our technology apply the base and also what are the challenges of fine tuning in particular the memory bottleneck and how to overcome them through you know different techniques to squeeze the 70 billion parameters into something that fit into the 16 gigabytes of vram of commodity gpus and hardware and so we will cover half Precision quantized training low rank adaptation called Laura and Q Laura the quantized version of it and we'll also show you like a little demo of how to make this work on your own so what event you you don't have a lamb to begin with um this is like a graphic that I created that basically shows a distribution of AI tasks within organizations and with respect to the availability of data within organizations uh on the left hand side you can see that um if you have a lot a lot of data in your organization then you can even go as far as training your models from scratch it will be pretty expensive but it's certainly doable but if you have um in a reasonable range of data and already have pre-trained models that you can use you can fine tune them on your tasks to get really really good performance if you don't have a lot of data the best choice that you have right now is to do in context learning through for instance retrieval augmented generation using general models but what we'll be focusing on is the fine tuning part in particular because it has like a really great trade-off in terms of accuracy versus performance and speed with respect to the bigger more General models in particular because you can take a smaller model like um Lama to 70 million 7 million and fine tune it on your data to make it perform as if it was a 70 billion model or even better than that so um first of all let me tell you a little bit about Ludwig so that um that is the tool that we're going to use for showcasing these capabilities uh my experience when I was at Uber working on different machine learning applications uh was that through all of them like for instance intent classification model for customer support the fraud prediction one and the product or Commander for ubereats that Diana mentioned in all of these cases there was a lot of code that needed to be written and the development process took a long time not really for writing the code but for iterating and experimenting and then the deployment of these models also took a long time so I thought that it could be a better way for doing this that's why I came out with came up with Ludwig which is again an open source the creative machine learning framework in particular tailored for deep learning use cases and what does it mean to be like a declarative machine learning framework it means that you can specify your deep learning pipelines by just using a configuration file and the configuration file only contains information only needs to contain information about the schema of your data what are the inputs what are the outputs and what are the data types associated with them in this example you have one input sentence with type text and one output intent with type category and this could six line configuration matches 100 of the performance of the intent classification model that was the one that I developed when I was a uber it's very easy to iterate over these models and you know get started building models in this way and you don't need to write a level machine learning code at the same time though you have all the expert level of control that someone who has been studying the Deep learning AI classes um games right um You can change every parameter of your models uh from the training parameters to the architectures that are used all the way down to the single hyper parameter of the architecture like the single activation of the single layer and pre-processing parameters there's more than a thousand different parameters that you can change and modify and this makes it very easy to iterate of your models because you can just change one line in a configuration file instead of changing multiple lines in your code and finally it's accessible so you can write your own python classes give them names and if you use those names in the configuration then you can extend the system as much as you like finally contentious Advanced functionalities like hyper parameter optimization set up the Earth models are already baked in and also distributed training all out of the box so with this kind of approach Now new people that before were not empowered to build deep learning models but just writing 15 lines of a yaml configuration file in one day they were able to train a model for website personalization and so these were engineers at Uber not product Engineers not deep learning experts and in one week they managed to deploy it into production and so this kind of approach makes it substantially easier and faster to develop deep learning models the reason for that is that it at the core of it there's this architecture which makes it possible to have multiple inputs and multiple outputs of different data types like for instance categories numerical and binary values or text image and audio and to produce as output text numerical binary values and so on the inputs are first pre-processed then encoded with a you know piece of a deep learning model that produces Vector representations that are then combined by one single component to be provided then to the decoding components that produce from these vectors the final predictions of the different types and you can imagine how mixing and matching different types of inputs and outputs can lead to different uh machine learning applications so for instance if you have category numerical and binary values as input and numerical values output you're basically training a regression model if you have taxes access input and categories output then your training a tax classifier or an image's input and taxes output then you have an image captioning system and you can imagine how many combinations of types of inputs and types of outputs lead to different applications and how flexible these approach is notably also if you have in particular text as input and taxes output you can definitely use an llm for attacking these kind of tasks and in particular if you have this kind of data then you can use it for fine-tuning those other lamps here I'm showing you a configuration where you just specify the model type llm the base model which is lemon juice 70 billion 7 billion sorry and inputs and outputs text features and you can specify also the parameters of the trainer like the learning rate the batch size the gradient accumulations the number of epochs and a few others like Decay or any other kind of Optimizer related parameter once you write a configuration like this you can just provide it to the Constructor of the Ludwig model object and then Coltrane providing also the data frame containing your data this will fine-tune the entire model both the embedding part and then also the prediction part but if you wanted to fine-tune a model to perform a new task so you're basically replacing the head of the model with a task specific head what you can do is you can change the output features to contain in this case a category output for instance a sentiment a category output and then you can specify the large numbers model as an encoder for your for your for your task the rest of the training and the how do you create a model how you train it is exactly the same and you could just do the same thing if you want to actually just freeze the weights of the large language model if you do that the advantage is that you can also cache the embeddings which makes training substantially faster at the expense usually of some performance because at this in this case you are training just the classifier head and not the core trunk of the model in all of these cases though um the problem is that these large language models are in particular like gamma 7 million are larger than what fits in the memory of of a GPU um there are like newer and better gpus like the h100 and a100 then have quite large drams of 80 gigabytes the problem is that they are pretty expensive in on from cloud providers and also there's currently a pretty bad shortage of these gpus meaning that if you request from AWS a GPU like that it may require quite some time to obtain one if it if you will get it eventually right and so the better idea is to use actually commodity gpus like t4s or the RTX 4080 which is like a consumer grade one that you could go out and buy and the problem with this gpus is that they have only 16 gigabytes of vram because of that you need to build with smarter in what you do in order to fit the models into the 16 gigabytes of RAM and this is like what now Travis will talk about in this section of the presentation and also we show you how to do it within Ludwig all right thank you very much Piero all right hopefully everyone can see my screen okay so as Piero said um the main issue that we want to talk about today is how to make the most use of very finite resources which when we're talking about trading large language models is almost always the vram or the amount of memory that your GPU card has available to it and that's in particular because you know your T4 or your 4080 typically has about 16 gigabytes of RAM and for the purposes of this visualization every one of these squares you can think of as being you know a gigabyte but if you want to train even a fairly modestly sized large language model like Lama 2 7 billion seven billion parameters when you actually want to go and train ends up looking like something like this in terms of the memory requirements and so if we break this down into uh three different buckets the model parameters the gradients that are produced during training and then the optimizer State that's needed to uh to keep track of the the training variables then what you find is that you need about 28 gigabytes uh just for the model parameters themselves and then an equivalent amount for the gradients during training and then usually you need about 2x the number the memory used for the model parameters for keeping track of the optimizer state so in total you're looking at about 96 gigabytes more memory than you actually have available and so you know the idea of trying to pack all this into 16 gigabytes on the surface might seem like an impossible task but that's exactly what we're going to do in this talk is kind of break this down piece by piece until we get into something that is very manageable and fits on this single single card so let's start with the first piece which is just loading the model in the first place right so the very first thing you run into is that because you have seven billion parameters and every one of these parameters is by default represented as a 32-bit floating Point 32-bit floating Point means four bytes right so that means you're looking at seven times four which is 28 gigabytes of RAM so that's the first uh hurdle we need to overcome here is how do we pack the model parameters into a smaller form factor so you may be familiar with the fact that there are different ways to represent a floating Point um you know floating Point 32 is the one that we most commonly use because it gives you a really nice mix of uh the dynamic range which is expressed as the exponent here in Orange as well as the Precision which is expressed as the significant also sometimes called the mantissa uh here in the blue right and so very natural thing you might say is well what if instead of using flow 32 we go down to float 16. there's a trade-off there which is that if you you know just use float 16 by itself you can uh you know obviously reduce your your memory requirements by half but you know you're going to uh non-linearly decrease the range of values that you can represent because you're reducing this exponent right and so that can lead to a very significant drop in the uh how big of a range of values you can represent particularly for your gradients which can lead to things like Vanishing gradients or exploding gradients which can lead to nands and things like that so one alternative that exists for people that have the hardware is to use a data type called B flow which stands for brain float came out of Google is now officially supported on newer generation cards like the Ampere you know a100s and things like that but for those of us still using you know t4s and lower generation Hardware you know you have to stick to something like float 16. so there's a trade-off there but if you were to go with load 16 or B float 16 you know that right there allows you to cut the memory down in half 14 gigabytes suddenly you know you can fit everything on your single card and you know no problems but very quickly you run into the next problem which is the gradient so gradients um you know I'm sure most folks here are familiar with the normal stochastic grain descent training process but essentially what happens is uh during the training process you produce a forward pass over the model you know those are what we call the activations and then when you compute those activations you want to then use the results to compare against the Target that you're trying to predict and then the difference between the prediction and the actual Target is then used to produce these gradients which are used to then do the back propagation and then ultimately update the model so the problem is that the gradients are typically use the same data type as the original model parameters and there's typically one gradient for every model parameter that you have particularly the ones that you want to train and update and so as a result anytime you're doing normal gradient descent training you're looking at about a 2X just without any Optimizer on the amount of memory required for training so how do we go uh Beyond this well one approach is to say well is there something lower than uh float 16 that we can do maybe 8-bit and at this point oftentimes a common technique would be to use quantization so when we say quantization what we mean is taking a continuous space like the space represented by a floating point value and discretizing it into a specific finite number of bins which can be represented just as integers right kind of like an index like you know one equals bin one and three equals bin two Etc and so you know naively you could do this just by very you know simply slicing up the space uniformly and then bucking them all together but in practice you often find that values are concentrated in certain parts of the the range so for example we have a higher concentration of values here towards the mean and so if you were to just naively bucket things you would lose a lot of precision towards the mean at the expense of uh you know in exchange for basically getting more Precision towards the outer edges of the distribution which is not really what you want right you want to retain that Nuance as much as possible and so usually what you end up doing is calculating some statistics about what's each bin represents in terms of what values are in there and that allows you to then reconstruct this distribution when you want to go back to a floating point value in the future without losing much of the of the data right and so outlier clipping is another approach that's commonly used under the hood to you know avoid this problem of completely shifting your buckets too much based on like very extreme values and so you know this seems like a very complicated process but the nice thing about declarative Frameworks like Ludwig is that we make it very simple you just say quantization bits equals eight and boom you get all this compression for free essentially so let's say you do that so once again we cut everything in half so essentially now we have only seven gigabytes required to allocate the model parameters and additional seven gigabytes required for the gradients that means we're only using 14 of our 16 gigabytes right so great however the real kicker is the optimizer and I think a lot of people don't realize this initially but if you're using um a very common state-of-the-art Optimizer like Adam um Adam is great people love it it yeah it has become like a de facto standard in the industry but it does have quite a high memory footprint which a lot of people don't initially think about but it's very real and in particular as I said you have 2X the number of parameters um due to the additional vectors that you're keeping track of for the optimizer and that leads to very significant uh problems as you can see here where essentially all the parameters of the optimizer are completely out of memory now one question you might have is why does Adam use so much memory and this is the normal update formula for how you update the the weights based on the gradients so these uh parameters here the gradients and then you know W's are the weights and then these parameters here M and V is momentum and variance and these are two atom specific parameters that are essentially vectors the same length as the gradients right and so we don't need to really get into the math here but the important part is that this you know M momentum vector and this variance Vector they're each 7 billion parameters in in size and so that means that that's ultimately where your 2x value is coming from in terms of you know 2x the model parameters are needed just for the optimizer so what can we do about this well this is where a technique that you might have heard of called low rank adaptation comes in and the interesting thing about low rank adaptation or Laura this was a technique pioneered by Folks at Microsoft research I linked the the paper down there in the bottom right um the interesting thing about this technique is that we often talk about it as a way to reduce the number of trainable parameters to you know reduce the size of uh the footprint of the of the model weights when you save it and speed up training and those are all true but I actually think the most valuable initial thing about Laura is that it significantly reduces the number of parameters need that need to be tracked by the optimizer as well as the gradients and that's what ultimately leads to a very significant memory reduction during training and just as a quick kind of uh refresher or or kind of explanation of what Laura is really doing um so the key idea behind Laura is that when you're fine-tuning a very powerful model like llama you don't actually need to fine-tune literally every single parameter there are typically some parameters uh some layers let's say that are more important than others usually things like the parts that do attention and you know determining which tokens in the sequence are related to the other tokens and what way are they related and so the idea behind Laura is essentially to take those those parameters those matrices for like the you know queries and keys and things like that and then inject a uh another kind of lower rank Matrix beside it that initially acts as some somewhat of an identity function so it's not really doing anything but then over time when you you know propagate these gradients through and update the parameters the only thing you're actually modifying is this ancillary Matrix that you've added side by side and so really the only parameters you need to update are here and that means that the total amount of space you need to save is just this little extra Matrix and that additionally means that your Optimizer and gradients are only taken with respect to these parameters and not these big pre-trained weights the pre-trained weights are essentially Frozen and again this is all done in in Ludwig by the declarative config which is a single parameter which is nice so let's say we do this let's say we use Laura and we um you know now have added these additional um low rank matrices alongside our existing parameters what does that do to our overall memory footprint and so it's not 100 like as easy as saying that you know the number of parameters is in the optimizer is 2x the parameters in the mall or anything like that because the number of parameters that you add during the lower process can vary depending on some of the hyper parameters you set for it like the r which is you know how many um the rank of the matrices you're going to add and then you can also change which layers you want to add these lower rank adaptation matrices to as well so it's uh it's something that can vary a bit but in practice it can be as low as 0.1 percent of all the parameters in the model which is huge and even in like kind of a more aggressive case you're typically looking at something like less than 10 percent of the parameters usually like one percent of the parameters are actually going to be modified uh with Laura and so you know one way that we might kind of you know split the difference a bit is say yeah let's assume that one gigabyte of memory is going to be used for these lower parameters now importantly um in the formulation of of doing quantized training normally what you do and this was in the Q Laura paper that we'll talk about is you quantize the original model parameters the ones in Orange here so these are represented in ant-8 but the Laura parameters those you typically do in fp16 or fp32 here we're doing fp16 and that's because since these are the parameters that you're training you typically want a little bit higher Precision on those and that just and since there's so few of them it doesn't make that much of a difference anyway in practice in terms of the memory overhead so 16-bit lower parameters here plus the gradients for the 16-bit parameters so 2x plus the optimizer State and the optimizer State here I'll say is usually fp32 which is why it's you know not just um 2x the number of parameters but 4X so we have you know 2x the number of parameters times two when we go from fp16 to fp32 and so you know back of the envelope math you might be looking at about four gigabytes of Optimizer State just using Laura and intake but there's still one more problem which is the activations themselves and so activations are you know when you're doing the initial forward pass uh effectively the memory overhead from activations so what comes out of each layer is going to be the size of the largest layer in your network um also times the batch size so how many examples you're updating uh at once right and again this is not a kind of hard and fast thing because you know if you increase your batch size you're going to have more activation memory to deal with if you change you know some layer parameters you might have more activations to deal with but you know back the envelope math you might be looking at about like four to five gigabytes of memory from activation something like that and so that is potentially another big thing that's going to cause you to have to deal without a memory errors even if you get everything else under control so what can we do how about one more trick which is can we go from 8-bit quantization all the way down to four bit and so this was one of the most I think interesting and kind of uh groundbreaking things that came out of the Q Laura paper and so credit to Tim detmer's um you know linked his uh tweet there for the original image source here but definitely check out uh the paper that that he and others did on uh Q Laura which I think has been like a really big game changer for this you know efficient fine-tuning approach so the way that this technique essentially works is that you go from you know full fine-tuning to Laura which only adjusts these adapter weights right and then with Q Laura what you essentially do is the base model parameters are represented now with just four bits and that's because again you're only doing kind of the forward pass on these you're not having to update these parameters or anything like that the adapters still in fp16 still 16 bit but then now you have this Optimizer State as well which is fp32 and one additional thing that they did in the Q lower paper which we won't get too much into here but is you know one additional thing they did to reduce memory pressure was introduce this paged atom implementation which allowed them to essentially offload the optimizer parameters to the CPU when needed to reduce the the pro the effects of memory spikes during training and effectively get this whole thing down to some very very small form factor that can train on you know very uh commodity Hardware right but specifically we'll focus on the 4-bit part here um for the purposes of of the diagram here and so you know the way they did this they introduced a new data type they called it normal float four um definitely recommend checking out the paper for more details on it effectively what they did was they defined like kind of a hard-coded bucketing system um the main idea was to try to avoid the problem of um not having a way to represent zeros they kind of you know worked around that problem pretty cleverly but essentially you do all this um you know now all of a sudden you go from end eight down to normal flow of four seven billion parameters can fit in just 3.5 gigabytes of vram the Laura params and the gradients are still the same in fp16 your Optimizer state is still the same here in in green and then finally you have um and apologies that this is five instead of four green squares and then finally you have your activations uh in red here which are just now at about uh three gigabytes because again when you reduce the model parameters you're also reducing the size of the activations themselves which is a nice benefit and so all in you might be looking at only around 13.5 gigabytes Peak memory usage required here so let's talk about one last thing here which is your batch size so typically um during training we try to pack multiple sequences into a single batch one of the reasons to do this is less to do with efficiency of training because often when you're training on small Hardware um you know you're lucky to be able to get one sample and a batch at a time let alone two or four Etc but one downside to doing that is uh it kind of it can increase the variance in the training process in a way that's uh not desirable so if you think about back to you know cs229 like um or you know like kind of basics of training processes um you know you have on one end of the extreme pures to cast agree into sense so just like one sample is One update all the way to full bash screen descent and you know the one of the reasons why we don't like to do full stochastic gradient descent is it's a very noisy kind of walk through the optimization space right you can oftentimes take you know lots of small turns in different directions whereas as you increase the batch size you take bigger steps and it kind of Smooths out the process and so this kind of you know sitting in a nice Middle Ground between you know taking these big smooth steps and taking these very uh jerky small steps it's one of the Arts of of deep learning but in general you do want to kind of hit a sweet spot there in the middle um which is why we often use uh batch sizes in the area of like 32 64 128 but the problem is if you are you know only able to pack a single example into your model at a time for training you end up doing this like very janky stochastic grade descent thing so how do we work around that and this is the last technique I want to talk about today which is gradient accumulation and so the key idea behind gradient accumulation is that you can get the facts of training with a larger batch without the additional memory overhead and the way it essentially works is you do a normal forward pass so here's a batch size and of two in this case and you can ignore the fact that this is a multi-gpu example like you can just kind of generalize this to one GPU you do a backward pass but then instead of updating the model params um you store it in a buffer and then do another forward pass and another backward pass and then basically some some of those two uh together into a single Vector uh which you then you know if you're doing distribute training you would then sum it once again and all reduced up or continue doing accumulation for subsequent steps and the effect here is that you basically have a larger batch size without having to increase the memory overhead by doing it in one step so now you know your total update that you apply to the model is however many uh accumulations you did and that can help smooth out the training process to lead to better model conversions putting it all together this is how this whole thing would look in a liquid config as Piero showed you before so we say our model type is LM so this means the text generating large language model we're going to use lamba 2 7 billion as the base model we're going to use uh Laura for doing the low rank adaptation quantization in four bits we're going to specify the prompt template which is how you define your task given the input Columns of your data set uh we're going to specify that this is a text to text model so input and output features are just text and then our fine-tuning parameters are going to be you know certain learning rate some number of app blocks that we train for and an effective batch size of 32. and then that's effectively it and with that you can just you know call ludic model from this config train it on your data set and you're Off to the Races and so in the next section I'll show you a collab notebook that you can run yourself that does this all on a single T4 GPU which you can get for free in in collab but before that just kind of summarizing what we talked about here so the different techniques that we covered and there are a few that we didn't cover today that I wanted to also call out so have Precision this is how you can reduce the model parameter memory footprint the gradient footprint and the activation footprint and then if you want to go beyond reducing Precision quantization allows you to get the same effect but in a more aggressive manner low rank adaptation this techniques technique helps you reduce the footprint of the gradients as well as the optimizer state creating accumulation helps you reduce the footprint of the gradients and the activations and then a couple techniques we didn't talk about today so paging and the optimizers this was one from Q Laura helps you reduce the grade the memory footprint of the optimizers and then one last one that's worth checking out if you're still running the issues is gradient checkpointing which reduces the memory overhead from the gradients even further by basically recalculating them during the backward pass instead of just storing all the data um so yeah there are lots of different techniques available to you um in uh in Ludwig and so I definitely recommend checking it out and so to give you to motivate this a bit let's go ahead and jump into a Hands-On tutorial and yeah this link here shows you how you can get at this notebook yourself to try it out um and yeah please give it a shot and let us know how it goes so let's go ahead and jump into the notebook so the motivating example here is going to be using LMS for code generation and I think this one is particularly timely because meta of course recently released a new set of co-generating models fine-tuned on Lama two and so you know you might naturally ask yourself well how could I have done the same thing myself right and so the purpose of this demo would be to show you that it's really not that hard you just have a good data set and you know you can do it with a relatively minimal amount of uh data to build a custom code completion model that's bespoke to you know the type of code that you write right so it's tailored to you and your use case as opposed to a general purpose one and the way that we're going to frame it is um in this kind of alpaca style we're going to use the code alpaca data set where you have an instruction like create an array of length five which contains all even numbers between 1 and 10 and the idea is that the model should produce a response like array equals two four six eight ten now as you recall from the slides Ludwig lets you train models just with a very simple yaml based configuration file um and so to get started with Ludwig uh you just download it with Pip install here I'm going to build off of the main branch of Ludwig but uh because we're trying out a couple of newer features like the effective batch size stuff um but the latest version of Ludwig 0.8 also contains all these LM capabilities as well and one thing that's definitely important to call out I think a lot of people run into this issue with llama2 is that you know well llama 2 is freely available for commercial use and the weights are you know readily available online it is gated in the sense that you need to get explicit approval to download the weights and so the way it works is kind of described here is that you go to um hugging faces website and you basically get a an API token that kind of says you know this is me and then you request access to the Llama 2 weights and it should be a pretty quick process to get approval but once you have that you just plug in your API token here it'll prompt you for the token and you're able to start downloading the weights from there we're going to take this code generation data set so this is the code alpaca 20 000 example and you know we do a little bit of pre-processing here because we want to create a custom Training and test split of the data and when we go ahead and print it out you know here's the first 10 rows of the data you can see that the format is you know hey create an array of Link 10 and then here's the output which is what we expect the model to produce we also have a column here called split which says you know zero if it's in the training split two if it's in the test split Etc and then you'll also notice that there's an optional column here called input which is used in cases where the instruction is more like a template instruction so for example the instruction here is write a replace method for a string which replaces the given string with a given set of characters and then the input you know string equals hello world replace with greetings and then it's going to give you this function that does exactly this operation right so this is again all very standard process for the alpaca style data sets out there yeah all right so um one thing to point out is that uh there are some uh like the distribution here of the data is actually pretty good as well so one thing that came up in the questions was you know how do I get a good data set how do I make sure that um you know my model learns something it isn't just kind of producing garbage and it's very important in general in all model training processes and it's true in fine-tuning as well that having a good uh a good data set is very important and so you know here's just a kind of quick breakdown of you know what we're looking at here so you know number of characters in the instructions number of characters in The outputs number of characters in the input you can see that there's definitely a pretty high bias towards um smaller Snippets of data Etc but having that variation ends up being very important there is still you know a pretty nice distribution here for the model to be able to handle you know future unseen scenarios right so just in general a comment that when you're training a model when you're doing fine tuning definitely make sure that you get as much breadth as possible so that you're not Indianapolis situation where your model overfits right so standard best practices still apply now one thing that's really cool about Ludwig is that we can do a zero shot and a few shot inference as well to kind of get a sense for how the model does even without any fine tuning and zero shell learning for those who aren't familiar is just basically where you tell the model and plain English or whatever language the model is trained in hey do the thing I'm asking you to do so for example you know write a function that you know reverses a string or something like that right and I don't want to get into too many of the details here in the interest of time but this is all done in Ludwig by this prompt template which lets you specify the instruction so in a zero shot way I can very simply say you know here's an instruction paired with an input Write a response give it the input which comes you know this is a column in the data set this is also a column in the data set and then I just specify these generation parameters which will then go ahead and evaluate the model on on this data set right without any training at all so this is a zero shot run and if we just kind of look through the outputs you can see that um you know it says for example instruction write a function to remove all white space from a given string and it outputs my string hello world so if you're familiar with or sorry hello world right so um that's actually not the worst thing in the world right but uh it does understand that there is this thing remove white space that is a function however um you know it's not really doing the task as uh as we requested it right it's just kind of outputting some uh something that looks like appropriate English or looks kind of appropriately appropriate from a syntastic syntactic or semantic standpoint but not really from a correctness standpoint and so this is you know the main motivation behind fine-tuning is that these large language models out of the box do a good job at kind of comprehending tax comprehending um language but they don't do a good job at following Specific Instructions for doing specific tasks without doing fine tuning and so that's the goal right is we want to go from something like this where we want to where it's just going to Output you know some uh correct English text that kind of looks right but until you until you squint at it right so something that is actually correct which is okay response array equals two four six eight ten for this task of um creating an array that contains all even numbers right so um definitely let you read the the details there on your own um if you're interested in going diving deeper but effectively what we do is we now can specify the full fine-tuning config which essentially is these three additional sections here so we're going to again specify Laura's the adapter quantization and then our fine tuning trainer is going to consist of all these parameters most of these are just defaults but they're kind of spelled out here to kind of hark back to what we talked about in the slides so we're going to use gradient accumulation we're going to set a very low learning rate we're going to use atom as our Optimizer with some of these default hyper params and then last we're going to do a little bit of warm-up of our learning rate to help smooth out that learning process and we click train and then we go ahead and start the training process and you can see that over time we you know not running it here just in the interest of time but this example here we've got about 100 data points takes like five minutes to run or something around that space on the T4 and at the very end you get some Metric output about you know what the loss is you can see the loss is in fact decreasing over time which is what you want and also some additional metrics like the perplexity the sequence accuracy token accurate accuracy Etc and then lastly we can go ahead and take the model that we've trained and produce some sample input and outputs and what we see is that now even though we've only trained it on 100 samples which seems like a very very low amount right we're starting to get something that looks pretty decent right so create an array of length five which contains all even numbers between 1 and 10. and yeah the output here is a variable 2 4 6 8 10. it's starting to look pretty good right so what you can tell is that the model actually does know quite a lot already about what it needs to do it just didn't know how to do it and that was kind of the purpose of this instruction tuning thing is teaching it uh you know how to structure its output given what it already knows right which is it understands the concepts of even numbers and functions and and python code and things like that so yeah that's essentially it I encourage you to go and try yourself and then once you've trained the model these adapter weights are compliant with the existing hugging face API and Ludwig makes it very easy to go and upload uh these adapter weights back to the hugging face Hub as well so you can serve these models share them with your friends Etc and finally if you want to go beyond um you know training on a single T4 you might ask how do I train these seven 70 billion parameter models how do I do distributed training and things like that and certainly Lulu makes it easy to do that as well we've got a lot of good resources on training with deep speed and training on kubernetes clusters and then if you want to manage platform that's what we're building at predabase so definitely happy to to chat anytime about about that as well and with that I'll go ahead and hand it back to um uh Piero to wrap it up if you'd like to say anything else about uh just to hand it off to you pure to close things out but uh yeah happy to take questions at the end as well if there are any sorry I was muted for a second and yeah so thank you very much Travis for for this overview I think um you know hopefully looking at what people were uh telling us and asking in the in the chat there was a lot of excitement for this I wanted to just really quickly um recap what we've covered and just give you a little call to action if you want uh so first of all you know we covered what you want to fine-tune uh your own models um and your own little lamps then we give an introduction to Ludwig with you know all the capabilities around the creative configurations for um building your own machine learning models um we addressed the memory bottleneck issues um we have you know all these techniques to squeeze the 7 billion parameter models into something that works inside the 16 gigabytes of RAM like half Precision quantized training or adaptation Laura and Q Laura and uh hopefully the demo was interesting um so the very last slide here that I have is check out Ludwig we have you know a lot of documentation on ludwig.ai uh you can you know find it on GitHub there's more than 100 9000 GitHub uh more than 100 contributors so if you want to participate in the community we also have a slack channel that you can you know join and uh it's all open source and it's backed by the Linux foundation and if you're interested in customizing your private and privately hosted llms in uh in the cloud with a platform that supports you along the way um check out the pretty base free trial and you will not be disappointed yeah this is all I wanted to cover and happy to you know the questions and uh answer some amazing thank you so much um for both uh Travis and Piero amazing session we have so many questions we can't get to all of them but I'm sure we definitely have for the first time at one of these events we actually have a good amount of time to go through them so very very well uh well presented uh so the first question is what data set preparation methods for fine-tuning do you consider the most effective maybe I can take this one um so there are I think there are some little finicky aspects of entering galilems that may not be um from the data perspective that may not be super Apparent from from the get-go and um I think you know addressing them is important on one hand one thing that you for sure want to make sure that you do when you prepare your data you want to make sure that you for the same prompt you have like you don't repeat prompts there's like one prompt and one answer to that prompt or one you know um continuation of that prompt uh the reason is that because of the relatively smaller sides of these fine-tuning data sets if there is um repeated prompts with different answers that may be confusing for the model and I would say there's a lot to be said about how you know to um tokenize the data and you know add specific separators um there's quite some literature about it but in short I would say make sure that when you prepare the data you um leave enough tokens in between the inputs and the outputs and you make sure that there's a separators like you know you can use return characters or you can use like um dashes or pound signs or things like that just to make it very clear where the input starts and the uh ends and where the output starts and the output ends and also at the end of the output make sure that you leave like um signs or white spaces even before and after because models can be finicky and if the if those characters are not present there it could be possible that the model will have a hard time identifying when it needs to stop so um just be sure to follow these basic instructions and you know the fine tuning will work right away perfect uh well that leads us to our next question brag versus fine-tuning which strategy works better and for when maybe I can give a perspective on this um like I was sharing that very first slide where I had that graph uh with the actually maybe I can show it again if if that's fine um because I think that could be revealing I would say at least from my point of view that's actually pretty useful so let me share it once again here we go so um this is it's a little bit of a number simplification because there's multiple Dimensions to this but um I would say that the amount of available data is definitely a factor to consider um rag in particular for doing specific tasks like for improving classification or for improving generation of of you know summaries or or other things like that or or it's particularly good when you don't have a lot of data and so you can rely on that as a you know stop Gap solution before you have enough data to actually fine-tune a model um the other aspects to consider are that um usually to get a similar performance that you will get like the the Delta between fine tuning and rag is also in the size of the models you can fine tune a substantially smaller model to be very effective uh in red you don't have that level of control you just need to use whatever model is available to you and so you know there's a difference in the inference time in particular if you need to also add a component for retrieving the data points and examples that you will need to fit into the into the prompt right which you know you can use um Vector stores which are pretty fast but still they add to the to the latency of each single query that you run and then I would say there's a third aspect which is um do you actually need to do any attribution to the sources where the information is coming from for instance if you're doing the question answering task um and just providing an answer is not sufficient you also want to provide the the documents the links to the documents where that answer is coming from well in that case rad is the best solution maybe for the you know generative model you can still use a fine-tuned model on your data but uh you could like combine fine tuning and retrieval augmented generation but if you need attribution definitely you cannot do away with rag right now and one uh one thing to add on I think Pierre had a really good summary there uh one thing we've definitely noticed um some people running into gotchas with uh fine-tuning in the past is that sometimes there's a desire to use fine tuning to teach models new facts like you know your fine-tuning data set might be like you know the president of my company is so and so or whatever right and that's one area where currently rag does typically do a better job than fine tuning is that kind of injecting new information into the model component typically to get a uh and sorry rag stands for retrieval augmented generation so basically means um using an information retrieval system like a database to fetch context and insert it into the prompt but basically the idea then being that if you want to teach a model new facts uh usually rag ends up being the the best way to do that for the cost because the alternative typically ends up being training model from scratch or there might be other ways that you can you know do it with with successive pre-training that end up requiring you know millions and even billions of tokens in some case so um while fineching is very useful for like predictive tasks and uh like domain annotation tasks like building code generation or changing the style of the output or outputting Json that kind of teaching facts piece is probably the one Achilles heel of fine tuning today absolutely I think we have time for maybe one more question maybe two what hosting services do you recommend for hosting llms specifically llama obviously Prada base yeah so I mean I definitely I think you know we we kind of started the company um with the idea of providing like the Best in Class art infrastructure for making this whole process of fine-tuning and serving as simple as as can be and you know eliminating the headaches of being like oh this model doesn't work on this GPU or this model you know has Adam memory areas with this batch size so that's certainly you know one of the key value props of protease and I highly recommend checking it out if they're if they're looking to solve that pain point absolutely and let's see how could we generate embeddings from the model rather than generating new text yeah maybe I can take this one um the there is in Ludwig so what we've been showing was the um creation of the Ludwig model object and then calling train on that object um obviously that same object also has a predict um function it makes it possible to run predict on new data and there is also a collect activation function where you can specify which layer in your model do you want to collect the activations of and by default those are like the embeddings that you would use for like a downstream task or to put into a vector database or something like that and so it's just a different call that you can make on the same object to obtain the embeddings perfect well I think that means our event comes to a close we are out of time and I know our community learns so much so thank you so much Piero and Travis is there any final words that you wanted to say about Ludwig or product base I would say you know I really thank you for for the you know for the chance to chat with this community I mean I I loved all the questions so really really thank you for that and um I wish I just people to check out Ludwig is open source there's you know it's really easy to do people install Ludwig and try it out there's also really nice um documentation on online we put a lot of effort into that so hopefully it would be great and if you want to contribute to it you know there's a slack Channel you can you can you can join and check out the party base there's a free trial so you can just try it out even even before making any commitment so check it out perfect well thank you everyone for attending our event we encourage you to keep learning by signing up for courses and staying involved with the Deep Learning Community we currently are looking for feedback and our focus groups for our courses so if you're interested in contributing towards the community please be sure to fill out the survey that we're dropping in the chat otherwise we'll see you guys next time thanks everyone
Info
Channel: DeepLearningAI
Views: 59,557
Rating: undefined out of 5
Keywords:
Id: g68qlo9Izf0
Channel Id: undefined
Length: 59min 53sec (3593 seconds)
Published: Tue Aug 29 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.