Building with Instruction-Tuned LLMs: A Step-by-Step Guide

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
foreign [Music] hi everyone and welcome to building with instruction tuned llms a step-by-step guide we appreciate you taking the time to join us for today's event we're so glad to see you tuning in from all over the world if it's late thanks for staying up to join us today during our event you'll learn how to differentiate between instruction tuning and fine-tuning of llms we'll see how to use the latest models data sets and training tools to both perform and struct tuning ourselves and to pick up instruction models off the shelf to build powerful llm applications designed for specific tasks through data Centric fine-tuning techniques we're going to get right into it today if you hear anything during our event that prompts a question then please follow the slido link in the description box on the YouTube page and put yours in also upvote any that you see in there that you'd like to see answered at the end of the presentation my name is Greg Loch Main and I'm the head of product and curriculum at Fourth brain I'm excited to welcome my friend and colleague Chris alexic to the stage today as we'll be working as a team to deliver this lesson during the day Chris is the founding machine learning engineer at Ox get ox.ai and by night he works as a fellow instructor and curriculum developer at Fourth brain he's also a solo YouTube Creator who builds weekly with the latest and greatest generative Ai and llm tools as you'll see him do today one quick note on the flow of today's event I'll be sharing slides and high level Concepts as an introduction to coding demos that Chris will run but first we'd like to get started by sharing a motivating example together instruction identify the odd one out and explain your choice input Orange green airplane Chris can you walk us through the responses that we get here yeah you bet so I so the first you know half year we have are not instruct Tunes that's our base model its response is Orange is the odd one out and the explanation is that orange is the I went out because it's the only one that is not a plane which is kind of uh nonsensical and uh doesn't doesn't really help us at all and then for our instruction tune model we have a much better response which is response airplane is the odd one out and then it gives some cogent kind of uh explanation here airplane has nothing to do with the color spectrum it uses aerodynamics and other techniques to fly it is not on the ground and it's made of metal and durable materials uh so it's the the improvement's pretty pretty clear between the two substantial Improvement on this and many other questions that we can get with instruct tuning super cool to see this we're going to see lots of examples when we see you back in the demo that's coming up in just a few minutes Chris see you soon so today I want to provide a little bit of context before we jump right into the demos we're going to talk about the big picture of llms and the big picture fine tuning versus instruction tuning then we're going to see some instruction tuning we're going to see how to leverage some open source state-of-the-art models and data sets as well as fine-tuning techniques and then we'll also see how we can take an instruct tuned model off the shelf and then fine tune that model and really dial in the input output schema to make it very good at a specific task okay so the context here is that if we look at the lineage of gpts if we look at open ai's GPT gpt2 3 all of the llms that we've seen they're really built on this Foundation of unsupervised pre-training this is really leveraging a self-supervised approach where there are no labels to the data and we're essentially scraping in a ton of information from the internet and beyond that unsupervised pre-training First Step then we do some supervised fine-tuning that's what allows us to really dial in how good we are at these benchmarks and the benchmarks are the classic NLP benchmarks they're the question and answering the information retrieval the sentence completion and what we've seen over time is that if we just use more compute and more data we can do better and better and better at each Benchmark that exists and even include more benchmarks and get better at those as well so this is sort of the big picture of the llm now when we go use the LOM we want to make sure that it's good not everything anymore not at every Benchmark but at one specific thing and that's where we get into this idea of prompting or prompt engineering this idea of zero shot learning is simply writing a prompt and trying to get the llm to respond to that prompt as we saw in our motivating example a few shot learning technique would be giving the llm a few examples maybe one maybe several that would all be under the guise of few shot learning and then when we talk about fine-tuning in this context we're talking about we're talking about fine-tuning for a specific task so this time maybe we have many examples maybe it doesn't fit in the context window anymore and what we're really doing is we're changing the input output schema so it's no longer a model that does everything but instead it's a model that does one thing and this is sort of another way of thinking about getting started building applications whether we're using xeroshot or whether we're using few shot we're really exploring that space of all possibilities within the llm this is cheap and this is quick and this is a great way to just develop your MVP now as you want to really explore that particular area of the llm the latent space within that area of the llm what you're doing is you're carving out a region within the llm that you want to work within and so there's a lot of ways you might visualize that I offer one here that was created with mid-journey so if you want to sort of imagine that you're in this crazy space and you're trying to find one area of it that your application resides in that's really what you're going to do when you're dialing things in with the fine tuning on top of the llm okay so to sort of wrapping up this idea of instruction tuning versus fine tuning before we get into it we're really asking what task do we want our llm to have super powers on that's what this fine tuning of the input output schema when we're building apps is doing whereas when we're talking about supervised fine-tuning done on large language models in general that's more akin to the instruction tuning that we'll see today whereas instruction tuning is really a way of doing better not just on the classic benchmarks but on being aligned with humans being aligned with what we expect when we give instructions or Direction and importantly being aligned when it comes to bias truthfulness toxicity and other metrics and evolving benchmarks that we see coming onto the scene all the time so we're really saying okay I want to develop my llm so that it could be any superhero aligned with any task that I might want to help a human out with that's what we're doing with this instruction tuning and then with the fine tuning of the input output schema when we're taking that General and making it specific we're now saying we want to Define how a user will interact with our application and we want to make sure that it's really clear what that interaction is like and what that superpower and what that single task that we want to get really good at is so in short instruction tuning enhances supervised fine-tuning it's really a subset of all possible fine-tuning it's focused on alignment with humans and it's concerned with following instructions and then fine-tuning that input output schema is all about dialing things in for a specific task we will see both today and first off we're going to start with instruct tuning now we picked a couple of pieces of technology that are particularly relevant today that we want to show you how to do what how to work with Dolly 15K is a data set that was released with Dolly 2.0 the open source llm and it contains 15 000 high quality human databricks employee generated prompt response pairs this structure here is instruction context response and each of these pieces of data falls into a different category of instruction so you might see creative writing closed or open question answering summarization and on and on we see a quick brainstorming example of what to put in our PB J if we don't have any J and these can also be used commercially we can leverage Dolly 15K we can leverage Dolly 2.0 for commercial purposes what we're going to do is we're going to leverage open Llama which is a reproduction by openlm research out of Berkeley that essentially recreates metas llama which stands for the large language model meta AI as of May 22nd just a little over a week ago they released a new checkpoint for the open Llama 7 billion parameter model they expect their full 1 trillion token training run to be done here pretty soon this is trained on the red pajama data set the full 1 trillion tokens is associated with that data set and this again is a commercially available and can be used for commercial purposes model so we're going to use the open Llama model in addition to the dolly 15K data set and interestingly today what we're going to do is we're going to leverage a really new Edition of how to do fine-tuning very efficiently and this is called Q Laura so this is taking sort of this big idea that is sort of an emerging best practice with llms that Downstream tasks meaning more specific tasks have intrinsically lowered Dimensions than the higher Upstream tasks so all possible Benchmark tasks a lot of Dimensions one specific task not very many dimensions so when we're doing fine tuning we can get away with a lot less compute and we're going to Leverage bits and bytes a library we'll hear more about for quantization we're going to leverage hugging faces parameter efficient tuning methods and we're also going to know because we'll come back to this in the second demo that Q Laura it was released on May 23rd so just again a little over a week ago is actually an improvement over the Laura method the low rank adaptation of large language models method for training and so Chris is going to walk us through all of this today he's going to talk about quantization he's going to talk about how to think about each of these libraries and you're going to get a real feel for what's happening at the state of the art in code so with that we're going to drop the link in the chat for you to follow along as Chris shows us how it's done Chris hey thanks Greg So today we're going to be looking at exactly what Greg said which is supervised instruct tuning uh open Llama using the dolly 15K data set we're going to leverage a few very powerful libraries to get this done uh effectively and efficiently the whole fine tuning process can actually be run in collab with collab Pro that's kind of the uh the amazing power of especially Q Laura so let's just uh start going through this and and see how how we can instruct tune as Greg said that's a instruct tuning is a subset of fine tuning so we're using supervised fine tuning to let our model get better at instruction following tasks first things first we got to get our dependencies as always we will want to ensure that we're pulling right from the githubs for these dependencies since uh some of these libraries are pretty new and they're evolving very quickly they're resolving bugs they're adding new features so pulling Straight From the Source is just going to be easiest to make sure we have the most up-to-date version uh in order to get our model to get better at instruction following we have to show it a lot of instructions and we've chosen to show it instructions from the dolly 15K data set now this data set is full of 15 000 instructions uh it they have three possible fields of a context a instruction and a response they also have a category column which we won't be leveraging today but the idea is that this method of fine-tuning for instructions works better if we let the model see many different varieties of tasks and so the category shows us tells us which task that particular row belongs to our instruction is straightforward enough it says what to do context is an optional field and then our response is the output there are 15 000 rows in this data set however some of them are very long too long even to be useful so before we really get started we want to just pair the data set down to sequences of more reasonable length and so we go ahead and do that just using the select method from our data sets Library we pair that down to 14 000 rows and we see it has a much more manageable range of a sequence lengths um after that we're just going to do our train test split we are going to use this test split for evaluation not for testing but we have 1.4 K rows worth of evaluation and 12K worth of actual Trading now in order to get going we have to convert our uh instructions from these three separate columns into one unified column that follows our desired prompt template so that's all this formatting function is doing we're basically converting it into two options one is the context option and one is the without context option very straightforwardly the reason we do this is because the library we're relying on to do the supervised fine-tuning expects it to be in this format um so that's the reason we have to convert it we just mapped this function over our data set and then we can see that we have this new column called text which contains our full text instructions so you see here below is an instruction that describes a task Write a response that appropriately completes the request we have an instruction and then we have a response there you go now that we have our model in a more reasonable state or our data story in a more reasonable state that the trainer we're using expects we can go on to start setting things up on the model side uh we're going to be using as Greg said the open Llama 7 billion 700 billion token preview um so uh you know as stated this is being worked on constantly so you know next week there's probably going to be the one trillion version out uh the idea that I want to focus on here is that you can run this whole example in collab Pro you don't need anything more than that uh and that's to fine-tune a 7 billion parameter model you can definitely get even more extra than that if you wanted to you can go up to 33 billion parameters but for the sake of this uh demonstration we're just gonna stick we're just going to stick with the 7 billion parameter model um and then once we have our model picked we need to set up our uh loracon fig now as Greg expressed we're using Q Laura the actual Laura part of Q Laura is the same so we're taking our large weight Matrix that occurs on the attention layers and then we're we're going ahead and we're turning those into smaller matrices that uh that they're you know that can multiply together to represent the larger uh original weight Matrix the reason we do this is because it massively cuts down the number of parameters that we need to train so instead of having like a you know an absolutely huge like thousand by thousand Matrix we're cutting that down to a 16 by thousand and a thousand by 16 Matrix which has a significant uh reduction in trainable parameters uh we have a number of hyper parameters in our Laura config the first is rank um the Laura paper the cooler of paper both talk about rank if you want to learn a bit more we'll talk about it in the next demo as well but for now we're just choosing 16 because it's a kind of a default value there's not a particular motivation here you can play around with all kinds of ranks including rank one uh and and find that the the results are pretty pretty awesome even at rank one uh next these two values we're just kind of leaving as default same with bias this task type is entirely dependent on the the kind of thing that you're training so if you're going to train say for causal which is what we're doing today we just have to have causal LM if you wanted to do sequence to sequence you would want to have to have uh sequence to sequence as the task type so just pay attention to what you're trying to do this dictates what you set your task type to be and then we add the Q to Q lawyer right so this is the bits and bytes config uh that's just uh been put out by Tim detmer's a guy who does bits and bytes uh is the Q Laura paper right so what we're doing here essentially is we're quantizing our uh information down to four bits which is really tiny right so compared to the full 32 bits uh four bits is is like nothing you'll notice a few other parameters here so we have this use double quad as well as this Quant type so the paper introduces uh the idea of this new kind of uh float which is called a normal float which Builds on some other methods and then this double Quant piece enters because when you use this quantile uh quantization method you have some kind of overhead that's introduced uh sorry when you do the uh the the block wise quantization there's overhead that's introduced and this double Quant helps us quantize that overhead so that even our overhead uh is is in a a smaller format and then we have our compute d-type which is our B float or our brain float the idea here is that the model's weights are frozen and they are stored in this nf4 format which is that normal float for um and then when we have to do computation on them we convert them or we unquantize them dequantize them sorry into these this brain float 16 so that we can do our computations you know safely and stably but the idea here is that we shrink we do two layers right we first of all we massively reduce the amount of parameters we're using with Laura then we also massively reduce the size every one of those parameters takes with that uh with that quantization method here next up we just set up our model um we have our model ID which is bloom or sorry which is the open LM model and then we pass it this quantization config so that it's all in four bit we scroll on down we're going to also add our or load up our tokenizer we are going to add this uh special token of pad you don't have to do this for this particular model we've chosen we could use Auto tokenizer but when you're using a llama tokenizer normally you do have to add some extra tokens so we just wanted to Showcase that process if we look at our model we can see everything's in 4-bit we're very happy about that and uh you can see our layers here the Q Laura paper actually trade it uses the Laura technique on all of the layers we're going to not do that today we're just going to use it on Q and V that's the default for Laura again this is just to keep things manageable for collab to ensure fewer people have problems with the uh with the the fine-tuning experience next up is a big piece of technology uh that's fantastic this is our supervised fine tuning trainer uh this is absolutely amazing it's from this TRL uh you know Library which is meant for that reinfor rlhf process however they also have this supervised fine-tuning trainer which works beautifully for what we're doing today so we're leveraging that um it handles a lot of what we need it to handle behind the scenes we don't have to see it which is fantastic all we're doing is passing our original model we're passing our formatted data set we're passing our test set and then we're adding our hyper parameters we are taking advantage of that paged uh Optimizer this is something that helps prevent out of memory um it's a really fantastic thing I definitely recommend reading about it in the in the paper a q Laura paper but it's just another optimization that helps improve our experience and then our learning rate uh you know just kind of setting standards you'll notice we only train this for 5 000 iterations it's not very long the data set itself is you know a 12K examples so uh you'll see all the results we're talking about today we're done with this process so you know imagine if you threw more compute at it and more iterations more time I mean uh it can get it can get crazy and then uh here this is a very important particular uh parameter this is what points at our newly created text column in our data set this is actually used behind the scenes in this sft trainer to create the uh packed supervised data set so this is a critical step it's going to take that row and then process it into uh our supervised data set and it's going to be a packed data set which means that say we have many small examples we stitch them all together in a single uh in a single you know trainable row and then we pass that through the model so it's just to help have more efficient training uh when we're when we're doing this fine tuning process our max length is just dictated by our our tokenizer and our model selection so that's it and then we pass our Q Laura config which is what lets all that magic happen behind the scenes and then we call Dot train I mean this is you know it's it's the it's the part that does the magic but it's probably the least interesting uh you just call Dot train to trains you love to see it uh you'll notice that the training you know trading loss goes down over the course of our training that's to be expected we're training a model um and yeah again this is only on 5K iterations and uh we we still see a a nice little learning curve there next up we just export our model to the hub this is just to make sure that we have it you know model doesn't help if it does nothing outside of the notebook so we want to be able to export it to the hub so that we can leverage it in future projects when we're building some applications with these models we're just going to reload it and then we're going to look at an example so the benefit of instruction tuning right we have the instruction convert the text into a dialogue between two characters and the text is Maria's parents were strict with her so she started to rebel against them when we pass this to the base model we get uh just kind of rubbish output here right we just have it just repeats itself uh it's not doing anything useful it's not productive uh but when we use the instruct tune model we see that it actually delivers us what we would like which is a dialogue between two different characters uh we have this character of Charles and Maria they have a uh you know back and forth that you can understand it's it's in English and uh absolutely fantastic so uh with that example though I'm going to kick it on back to uh to Greg to go through uh some more of the the high level stuff Chris awesome stuff love to see that in struck tuning seeing lots of great questions in the chat definitely drop those in slido and Chris will get them answered at the end um we'll see you back in a little bit to show us how to do some fine tuning of that input output schema Chris thanks a lot man and in case you missed what Chris was kind of going through there some key takeaways are that what we just did was some additional supervised fine tuning on top of that open Llama model we used 15 000 data points from Dolly 15K only 5 000 steps of training and the total cost of this was about 75 Google collab compute units that is less than a month of Google collab Pro this is you know this is basically nothing you know for all intents and purposes of the scale of the things that we're doing here we used four bit quantization and we are going to use 8-bit quantization in the next one using the classic Laura technique and the rule of thumb to take away from this is that you know we saw the conversation with Maria we saw orange green airplane in general these instruction tune models are going to be better to pick up off the shelf when you're building AI applications then the non-instructed base models okay so recall that this instruction tuning was really about following instructions it was really about doing good on new benchmarks and where we're going now is we're going into this space of okay now that I am going to pull an instruction to a model off of the Shelf I'm going to do some additional fine tuning this is now not going to be supervised fine-tuning that we're going to show but rather unsupervised fine-tuning where we're going to fine tune the input output schema and really Define the way the user interacts with our application so we want our application to get really good at one thing this is the idea of fine-tuning the structure it's really important to think about how the user is going to use our app what goes in what comes out this is sort of back to Classic machine learning 101 when we talk about this level of fine-tuning and we want to make sure nobody gets poked in the eye when they're using our app and that's you know really important that's where instruction tuning can help out but that's also where there is no substitute for thinking it through getting the right data even if it's a relatively small data set and really designing the right application from the outset so how we're going to do this is I'm going to sort of demonstrate the application I'm going to pull a an instruction model off the shelf and then we're going to see how Chris can fine-tune an instruction model like Bloom Z or blooms as we've heard from hugging face presenters in the past on this deep learning AI Channel and then we can use what's becoming really an industry standard method of training this PFT Laura approach for fine-tuning our AI marketing assistant so the big idea here is that let's imagine we're in a context where boss says leadership says the board says that writing direct email marketing copy is now considered tedious and it's something that really should be streamlined for the Department we believe we can create a fine-tuned AI marketing assistant to really help generate marketing copy for emails and for other marketing activities that's going to be in the same voice and tone of everything that our company's been putting out there's obvious value here it's going to save people a lot of time and one key note here is that in our example we're going to use data that's synthetically generated using open ai's gpt4 however if you are going to do this in real life you would want to leverage data from your company in your marketing voice and tone and really curate that data in a very data Centric way to ensure that your outputs were really dialed in and aligned with your brand but for us we're just going to take some synthetically generated data for marketing emails for made up products and we're going to Leverage the Bloom model in particular the bloom Z Model which is the instruction tuned version of the Bloom model which is huge model it can do tons of languages we're just going to focus on English and we're going to just note here that the instruction tuning was done on Bloom's XP3 data set we've got a number of links here as well associated with this but the big idea here is that there's additional tasks just like the categories we saw in the dolly 15K data set that it's helping Bloom get better at so we're going to take Bloom Z off the shelf and we're going to use pep Laura similar to how we saw in the previous example and we're gonna have Chris walk us through each step one at a time Chris take it away from here demo two you bet okay so demo number two we are talking about uh doing a similar process however this time instead of leveraging the uh you know the uh supervised fine-tuning method we're just going to use unsupervised fine-tuning we're not going to have any Targets or labels so we're just going to let that data flow through our model uh we're going to leverage a few different libraries again bits and bytes Transformers and puffed pepft really does a lot of work here in letting us be able to do these kinds of tasks in collab you know working in that limited compute space is absolutely fantastic we are going to be again training like Greg said on that data that has the um you know marketing information the it's it anyway we're gonna Greg's already talked about that so we're gonna get into the actual training process here uh number one of course got to get your requirements uh without the requirements you know we can't we can't do anything so gotta install those dependencies we're gonna do a much more manual process you'll notice you know in the first notebook the sft trainer was doing a lot of the work for us behind the scenes in this uh notebook we're gonna do a lot of it ourselves so first things first just make sure that we have GPU available uh like the last demo in case it wasn't clear this is collab Pro with an a100 GPU so both of them are going to be using that a100 um you can definitely go smaller with a smaller model but for this process we're going to use the a100 we're going to use the bloom 3B model we're gonna load it in 8-bit so we're still doing some reduction in order to get our model to a reasonable size and uh when it comes to our tokenizer we can just use the Straight Up Auto tokenizer from Bloom Z now the difference between the two models so if you remember from our the last demo that model had the attention layers and there was our q proj v project K project you know uh the blue model is a little bit different the self-attention block actually has just this query key value module that we're going to Target you know you can see here that we have 30 times these Bloom blocks in each of those Bloom blocks has the following uh you know layers in it when we target this module instead of just targeting one of these layers we're actually targeting every layer called this and so when we do our Laura process we're not just changing one weight Matrix we're actually impacting all of these so that's gonna be 30 of them um which is what helps us get that massive reduction parameters when it comes to our float uh you know for stability we need our small uh weights to be in float32 so we just do this this is to ensure stability that's it uh we don't have to worry too much past that our other parameters that we set before training just makes the training process more efficient and then of course we're casting our output to float again it's a stability related uh reason there's not much to say about it other than you should do this when you're using this technique now we actually want to start talking about Laura I wanted to just take a brief pit stop here and talking uh about rank when it comes to Laura so our rank is basically how much are we reducing that original matrix by so what what dimensions will our new decomposed Matrix pairs have and you can see that even with rank equal to one which is means that one of the dimensions of the the decomposed Matrix is one uh we have fantastic results right so rank is really something that you can play with uh you know definitely make sure that you're you're keeping track of your metrics you don't want to degrade model performance too much but you know this rank does can be taken to the extreme and maintain a lot of the original uh you know function compared to some higher rank selection uh when it comes to Laura so this is huge absolutely fantastic uh one thing to note as well in the cure a paper they did apply the QR process to all of the uh weights that we talked about in the Laura paper itself they only apply them to the attention layers specifically they chose q and V but with Bloom we only have access to the one module which is our query key value layer so let's set up the lawyer config we're going to choose a uh you know rank of 16 here again just a default could be anything up to you to decide what you want test it evaluate it as you're going through training this is a hyper parameter you can you can tune uh Alpha drop out bias these are just default values you can play with them a bit they don't have a significant impact um but it's you can definitely tune them if you wish to you'll notice we have this extra parameter here which is our Target modules again we need to make sure we're targeting the correct weights that we want to apply the Laura process to and so we point that at that query key value modules we discussed earlier and the task type again is causal LM for people who are like causal what does this mean uh we're just talking about the idea of plop and down tokens one after the other so it's a generative task uh you know causal can only look back can't look forward it's not a mass language model so um that's the idea of a causal language model if you're just hearing that term today uh terminology is evolving very rapidly alongside the the technology then we have to convert our model from a normal model into a actual PFT model and we do that with the very complex get pepft model function where we pass the model in the config and it spits out the loraphied uh model if you want to call it that this is what I really want to draw your guys's attention to though this is huge um you know the original model called a 3B means three billion parameters and then we pair that down to less than 500 million Which is less than a percent right so we the this is why Laura is so effective we go from 3 billion to less than 500 million uh Which is less than a percent of the total tradable parameters absolutely huge uh this is why we're able to do this stuff in collab three billion parameter modeled collab you know what I mean like a year ago it would have uh been a little mind-blowing we're just using synthetic data you know this is data we generated using uh GPT so it's uh you know we can't use this application commercially or anything but just to demonstrate the idea um I'll just walk you through the data set really quick we have 17 rows that's only 17 rows right uh each row has a product a description and a marketing email we're going to use our uh product something like Smart Eyes that's what it's called a description glasses with real-time translation sure and then it uh has a gbt generated marketing email which includes like emojis and hype language and stuff like that so um and again this is only 17 rows we're in collab and uh we'll see we'll see what happens just like before we want to get our data into this format right our instruction our product our description and our marketing email this is pretty important because we want this to flow through that model again we're just trading it on the generative task right now it's unsupervised so we're not we don't have a Target here right we're just letting that flow through the model and uh so the format we let it flow through and is important so that we can reproduce this format and then just don't give it the email right so we'll give it all of this information except the email and due to the fact that it's been trained on these examples it will produce an email in the format we hope and expect so just another example uh to make sure this is a clear point you know if we wanted to go from say natural language to SQL we would have our instruction which would indicate what we're doing then we would have context which is our you know natural language and then our SQL which is the actual SQL representation of our context this is the idea again we just need to set up our prompts so that they're they're reproducible we want to be able to feed everything but the response into the model and get out what we expect uh so hopefully that's clear uh we're just going to map that prompt template to our data set that's pretty straightforward again only 17 uh you know rows we're just going to use the traditional trainer again we're not uh no no supervised fine-tuning here so we can use the normal trainer we pass in our model we pass in our training data set which is our 17 examples we've beefed up the uh trading batch size to six here just because uh we have extra GPU memory with the a100 uh if you had a smaller instance you could use a smaller batch size though there's no guarantees at the point to work depending on the GPU you get from Google uh if you had like really large context text you would want to avoid using a large batch size as it would hog up more memory so if you have like resumes or papers you want to reduce that batch size to stay within the limits we're only actually trading this thing for uh 100 steps and we're using a rather aggressive learning rate you know the rest of these uh parameters are pretty much just stock and Bear this particular parameter is important because we are using a causal language model which again is plopping those tokens down in order uh we have to ensure that we set MLM to false for our data collator for language modeling since uh we're we're using a causal model we're not using a mass language model so we must set this parameter to false that's about how it goes uh we just turn off warnings with this line and then and then it's dot train right and I just want to show you guys in 100 steps we go from kind of this you know two one loss as we train through the 100 steps only 100 steps on 17 samples so it's definitely some overfitting involved please uh you know that's that's important to understand but the idea is that we are learning the task pretty well uh I mean we we see a good loss curve so very happy about that again the next thing we're going to do is push these to the hub if you want to be able to deploy it right away with one click deploy you will have to merge and unload your model but if you're just comfortable with using the adapters uh you can go you can use just the the adapters push them to the hub which is happening when we use model.push to the hub here we're going to reload them so we can play with them and then we're gonna look at a example right so we have the coolinator and then we say this is a personal cooling device to keep you from getting overheated on a hot Summer's Day we ask you to make an instruction or an email and it gives us you know go high with the cooling air your ultimate cooling Journey hey there cool-minded friend do you ever feel like your head just swings off on a hot day I mean the idea here is that we get a a decent marketing email considering we gave it 17 examples that we traded for only 100 iterations uh you know we get the appropriate Emoji you'll notice like we have Sun themed emojis and water when it comes to cool uh the idea here is that the the fine-tuned task is able to be learned quickly and it's much better than the original right going from zero to this is a massive step and that's what we're trying to Showcase today so uh with that example out of the way we're gonna pass it back to uh to Greg Chris so cool man that was that was that was awesome uh that was a tour de force of fine tuning and instruct tuning thank you for that second demo we're going to close it out here and we'll see you back for Q a in just a second Chris so in case you missed some of those details that he was going over in that fine-tuning notebook I know it was a lot it was definitely a lot he was doing unsupervised fine-tuning that was sort of the description of the SQL and the marketing email coming out at the end 17 data points and only 100 steps of training less than five Google collab compute units and 8-bit quantization in this case so you know pennies we're getting this done for pennies and we're using a really massive model to kick things off with and we could even take the quantization a bit further with the Q Laura if we wanted to in general this process of fine-tuning the i o schema is going to give us single task superpowers we always want to start by deciding who we're building for and how they'll be interacting with our application we want to try zero shot and few shot prompting first with just a couple of examples and then we really want to be data Centric about curating the data that we need in order to get that fine tuning done beyond that what are we taking away from today well instruction tuning is a subset of all possible fine tuning it's really about following instructions and you know giving instruction is the number one rule when you're prompting and it's about aligning with humans and aligning with some of the more subtle benchmarks that we see coming onto the stage today toxicity truthfulness bias Etc fine-tuning of an input output schema on the other hand it really lowers the dimensionality of the llm that we're dealing with and allows us to focus just on one task versus a big set of different benchmarks the rule of thumb is to always pick up instruction tune models When You're Building and this really means that what you're going to do is you're going to fine-tune the input output schema on top of an instruction tuned model which was already supervised fine-tuned after it was unsupervised pre-trained and so there's a lot of layers to this stack of llms as we're building with them and it always helps to go back to the fundamentals and remember airplane is the odd one out that's the other thing that we learned today so we've got some shared resources for this event including some additional resources you didn't even see today in the main GitHub repo we're going to share all the links today including the slides including everything you saw and more with you just as soon as the event is over and we send out the follow-up so thank you any questions reach out anytime we're going to start taking questions now I'd like to invite Chris back up to the stage and we're going to run through the top voted questions on slido so number one would it be possible to share the slides Jorge yes we have the slides ready to share and I believe that we do have them to share in the chat with you now number two what ways Chris are there to prevent hallucinations and ensure that the answers are coming from your embedded data or documents yeah you bet I mean I think this is uh right now the most succinct answer that I can give is putting this in an application that involves some amount of uh you know retrieval process so leveraging something like laying chain to allow you to use Source documents and then providing those Source documents along with the llm's response as far as it comes to hallucinations in the actual fine-tuned model uh there's a lot of research about that it's kind of Deep In The Weeds to get into at this exact moment but I would say the quote unquote easiest way to get it on Rails is to just put it inside an application and uh and leverage that ability to retrieve sources alongside your context uh to to allow users to check or you to check and ensure that the quality of the generations is not fake yeah yeah I like this idea of retrieving the sources alongside of Whatever Gets generated just as a way to sort of fact check whatever the output of the llm is I think that's a great way of going about it one that we're seeing more and more of um so we've got another question I didn't just mentioned sort of Lang chain here and we're getting some some more sort of chaining and indexing questions here how to make an llm learn from a PDF and answer from the given PDF yeah I mean you know the repo that we're linking does provide uh some uh resources of how to get started with a tool called Lang chain which I believe is going to be the lowest barrier to entry method to leverage your fine-tuned or instruct tune models to incorporate your own data we always get that question like how do I add my own data to this right how do I how do I add my own data man and the easiest way honestly is just let the llm do what it's good at which is processing language and then let uh you know classic machine learning do it it's good at which is finding similar things to the query marry those two together with something like Lane chain and I think that's the easiest way to uh to get that experience and there are examples provided on how you could do that in the in the repo yeah yeah so yeah and just to sort to Triple down on that like we did not cover Lang chain today we did not cover indexing using Vector databases we did not cover chaining we simply covered instruct tuning and fine-tuning of the input output schema however when you go to build complex llm applications you're going to want to chain a few different things together and potentially use some really optimized database structures like vector databases but that's for another event not for today although that is how do I put chat GPT on top of my data the number one question that everybody's asking everywhere from Enterprise to everybody who's actually just building these for themselves okay next up how do you handle confidential data in the llm while still benefiting from the training yeah I mean this comes down to a number of different things you can do so you can sanitize your outputs just check them for confidential data ensure you're not leaking anything another would be to remove uh and pre-processing steps all PPI or confidential information realistically though if you are really concerned about leaking potentially confidential information it does involve extra processing steps without removing it from the data set it's hard to guarantee that it can never leak and so I would suggest uh you know some kind of pre or post processing in order to to achieve that goal good stuff good stuff from Andy we've got the question why is fuchsia learning called learning since it is not really an extension of the prompted model yeah it's in context learning so it is the the way you can think about this is with those few shot examples we can have the model perform these tasks that it was never trained on and so we're calling it learning because we are teaching the model how to do the task we want it to do just through those few shot examples there's definitely some interesting discussions around you know in context learning but the the basic idea is that we're teaching it to do something that it can't do and so there is some learning it's not learning the way we might think about training because it's not learning it forever and this and that but it is learning how to do a task it can't do and so so we just call it it right yeah yeah so that example is just something that it wasn't a benchmark that it was trained against it's just something new that we're teaching it that's sort of the big idea of in-context learning is that right Chris yeah yeah okay yeah great questions keep them coming uh when fine tuning with custom data is it better to do it on a base model a supervised model like dolly or add the custom instructions to data bricks that is a great research question um there is not a definitive answer there are some people who believe that uh actually you know bringing the the instructions on an already fine-tuned model is potentially actually not a great uh idea there are some people who say it's like the best idea uh it's it's just a question that's going to have to be answered more concretely through a lot of research right now I would say it kind of depends on what you're trying to do if you're doing what we did in the second example I would just straight up that's he was fine use something like Bloom Z or a pre-instruct tune model uh we're trading on so few examples that nothing nothing too crazy should happen but if you're talking about these big volumes of instructions I would add those custom instructions to your D brick set or whatever instructs that you want and then let it rip on the let the model base model rip on that and learn those tasks nice nice yeah so uh yes this is being recorded so you guys will have the recording no problem uh how should one go about get getting started with llms and where do you see sort of the future of llm's head we heard a lot from beginners today in the chat and in the questions a lot of people a little intimidated by some of the stuff that they saw today how would you recommend a beginner getting started and where should they think about where this is headed yeah I mean I think honestly the best way to get started right now is just to get started on something I think the easiest place to get started is something like you know querying uh chat GPT through python you know the building an application on top of that uh you know and then moving into this fine tuning and everything like that these fine-tuning processes with Laura are great because they can be done in collab a few smaller models they can be done on the free version so I think those are the places where you want to get started and then build your way up to these more complex use cases um you know but ultimately just getting started is the thing to do so find your favorite tutorial online follow it build it you know watch it fail figure out how to fix it rinse repeat until uh until you're you're no longer getting started yeah yeah yeah get on Chad gbt if you haven't yet do some prompting understand zero shot understand fuse shot and then go and get tapping into the API directly leveraging GPT 3.5 gpt4 if you start looking at The DaVinci models you'll actually have some understanding now of how those have been improved over time to create that GPT 3.5 turbo through instruction tuning and so um you know I think just really digging in Chris takes more of a build build build mentality which is fantastic other people I've heard just offering this up for anybody that might be more interested in a more conceptual approach is just really go back to studying kind of the fundamentals and the fundamentals of machine learning proper and then sort of build up with those foundational papers many of which have been released recently and you can sort of start from the old or start from the newer so there's a lot of ways to go about it but there's a ton of information there and it's really just important to get started um I love this question that we have next here oh I've actually got a new top voted question how does the output compare to including 17 examples in the prompt to chat GPT is worse uh of course Chad gbt is an absolutely massive huge model that's very good at lots of things uh the difference is that that uh fine-tune model can run on like a single consumer grade GPU uh that you that you keep in your garage and so uh I think when it comes to Resource expenditure you wind up uh being able to see the benefit of a in in with 17 examples the output's already not tremendously worse than gbt4 right it's just not there yet so you can imagine trading a bit longer using more examples uh you know maybe using supervised fine tuning instead that you can push it to be very comparable and still run on a consumer grade GPU in your garage that's like the you know so I would say the output Compares poorly but the performance Compares astronomically yeah yeah and you know I guess in studying benchmarks one of the huge things to consider is efficiency of the model and Performing the task at hand so how much does inference cost how much does training cost and so you know if you're or if privacy is an issue this is a great dovetail into the next question Chris which I absolutely love is it possible for normal humans to build llms without many computational resources you bet thank you Tim detmer's I think I single-handedly carrying the uh the ability to do that yeah for sure I mean processes like Laura and Q Laura uh bring these like I said you know not everyone might have gbt Pro but we use a seven billion parameter model and it could be uh better optimized right we can spend more time optimizing it uh absolutely uh to to get a run on the free version of colab I mean the the idea is that if you have just the free version with something like you Laura uh you can absolutely you know train these three billion is seven billion parameter models you could buy a RTX 4090 for whatever it is like 2K or something and and you can train these models yourself in your house right so it's uh the barrier to entry for llm's fine-tuning inference I has never been lower uh and the performance shows that it's you know the models run well despite being tiny to to to compute with so yes it is possible yeah great segue Chris last question this one's been climbing the ranks here and I think you segued into it nicely when should we use 4-bit Q Laura and when should we use 8-bit Laura pros and cons yeah I mean the the pros and cons are at this point probably not super well understood um you know killer is a very new process I would say that as always the most important thing to do is Monitor metrics so when you're trading on your Downstream tasks make sure you're evaluating with whatever methods you use to evaluate ensure that you're getting response from your users um but uh right now I would say that there doesn't seem to be a huge performance difference between the two and there's a pretty significant compute reduction so I would lean towards uh leveraging Q Laura at this moment in time but uh again the research isn't there so I'm just saying this based on my experience please uh wait for the wait wait for the research you heard it here first Chris is going to be using you Laura it's uh you know it's up to you which one to pick up we're certainly seeing Laura emerge as a best practice it's been around longer but we'll see if Q Laura takes the spot Chris awesome thank you for answering all these questions thank you for those sweet demos and thank you everyone out there for your participation today this brings us to the end of today's event brought to you by deep learning.ai and fourth brain in addition to our GitHub repo that we shared with you today that includes Lang chain and chatbot examples for even more generative AI app building fourth brain has also put together some resources for all things prompt engineering and fine-tuning that we discussed in our recent Community event perhaps we'll see you at our next Community event on indexing and chaining taking place this afternoon at 3 pm Pacific deep learning.ai is also going to send a follow-up email with a survey for all attendees we would love to hear about how to make these events even better a select 100 will receive a promo code for 50 off a one month subscription to any deep learning.ai course on Coursera once again huge thank you to Chris and to all of you joining us today we'll see you next time until then keep learning bye everybody see you next time thank you
Info
Channel: DeepLearningAI
Views: 37,334
Rating: undefined out of 5
Keywords:
Id: eTieetk2dSw
Channel Id: undefined
Length: 59min 35sec (3575 seconds)
Published: Wed May 31 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.