Lessons From Fine-Tuning Llama-2

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

[Applause] hello everyone can you guys hear me yeah um Welcome to our talk my name is kurosh I'm a tech lead in the AI team here at any scale and together with Arthur we're going to be talking about some of the lessons we learned from fine-tuning llama 2. I hope these insights that we uncover in this talk will be of help to you as well so here's the outline of the talk um I'm going to start by motivating the promise behind open source L Ms and why especially we need to fine-tune them I'm going to briefly talk about how raytrain fits into picture when it comes to llm distributed training and then we're going to cover some learnings around fine tuning problem set up and parameter efficient fine tuning so since the emergence of chat GPT we've seen two major separations in the street Trends on one hand we have closed Source language models this includes models like gpd4 or Cloud V2 from anthropic um these kind of serve as a very powerful general purpose assistant model that is capable of solving a wide variety of tasks but one of the kind of like things that are on top of Mind of people is that they're prohibitively expensive to run in production and also more importantly there's a lot of ambiguity around data governance and how your data is get getting used when you're using these systems um at the same time we have open language models this includes models like llama 2 from meta or Falcon models or mosaic MPT models they kind of have promises on the other side of this spectrum which is they're often smaller and cheaper to run and they more importantly they give you more control to over your data and your technology stack in serving them what is more interesting is that in recent months we've seen an immense progress on the open language models closing the Gap compared to proprietary models like gpd4 this is a leaderboard from lmsys kind of an organization UC Berkeley which kind of keeps track of the the progress that is made on language models by evaluating these models on across a wide range of kind of tasks and then puts them on this leaderboard llama2 models have come very close to kind of like GPD 3.5 and other property models um but one of the kind of problems that exists like in these language models you can categorize them into two subsets they're often like when these models produce like completions what they output is oftentimes not factually grounded they often hallucinate and make things up and there's another category of problems which is they often don't follow the format that you have in your mind or like in intent and to use these language models for this figure kind of shows a spectrum of techniques that kind of try to address these two types of problems on the bottom we've got prompt tuning or prompt engineering and then few shot prompting we have fine tuning which addresses following a form problem um and then we've got retrieval assistant generation which explicitly addresses the hallucination and on top we've got reinforcement learning and training from scratch which are kind of like more complex and only available to a few companies today we're going to talk about fine-tuning and how it addresses the form problems with these language models so why fine-tune language models in the next few slides I'm going to cover a few reasons that show that highlights that shows the benefits of fine-tune language models first thing to point out is few shot prompting is a technique that enables in context learning meaning that we found that like you can in language models you can provide a few examples of desired input outputs and fit them into the context of these language models as input and have them model generalize that same pattern matching to unseen data points but there are often many times that your data is huge and doesn't fit The Limited context window that these language models provide so in this case in these scenarios what you can do is instead of putting these examples into the context bake them into the neural network rates that essentially present the internal knowledge of these language models um another reason to think about fine-tuning is there are a lot of tasks that are hard to describe in words some of these like subtleties go around like formatting out the output is a specific output format that you have in mind or having the model generate something in a specific tone you may attempt to fix these by prompting with phrases like output this thing in this Json format or like put something like the final answer in this integer format that I want to parse later in my software but there are often many times that language models don't respect these kind of like phrases and you may need to provide several examples to kind of reinforce what you mean um in the following a specific tone another example is like you may say something like hey write this in a concise respectful or helpful manual manner without being explicit what these kind of words mean and you may need to again provide some examples what these words mean for the model so with fine tuning we can actually leverage a lot of illustrations and bake that into the internal knowledge of the model it can also save you tokens um there are many applications that you can get away with prompt engineering but oftentimes this prompt end up being too wordy or verbose with many examples but what what thing the thing that you have to keep in mind is if you want to run this in production for every single request and every input token output token that you want to generate you have to fit in the scene the same context scene and you're going to have to perform computation on it so if you have cases where this is too verbose it's going to actually incur a lot of cost during deployment with fine tuning you can kind of implicitly bake that this prompt again into the knowledge of the network and get away with like a cheaper serving cost and last but not least as we show later in the talk with fine tuning you can oftentimes get a faster cheaper model at the same quality for some of the niche tasks compared to let's say larger models or even gpd4 in some cases um so this is a plot that I think you guys have seen already in the Keynotes and other talks here which kind of demonstrates an example of what we mean by Niche test like a SQL data generation how we can fine-tune these small models to kind of outperform other powerful models for this specific task we're gonna cover more about like more of the experimentation side later in the top now I want to just highlight and briefly talk about how Rey kind of fits into this picture and there's a great talk that was presented by June Sean yesterday that dives deeper into how raytrain is a production ready library for distributed deep learning I'm not gonna cover um as much details but I'm gonna just highlight some of the features that makes raytrain great for this type of workload um so what is rate train rate train in my opinion is the best framework for orchestrating multi-process training workload and here is why first of all it provides a very simple API 100 pythonic that you can take existing python code in your favorite framework and just integrate it with great train to distribute it across your cluster um plus it has also seamless integration with other libraries in the repo system like Ray data that provides distributed data ingestion which can be very helpful when you have when you're dealing with large data sets it provides Tools around faster development um for example it automatically sets up distributed environments so that these lower level libraries like Cuda Nico these things can communicate to each other and as an ml developer you have you don't have to think about them and just can focus on your model training and you know lost scares and things like that um another way to look at raytrain is that it is a simple and elegant Java scheduling with features like Auto scaling or support for heterogeneous resources you can actually survive in today's world where like gpus are very scarce and there's like capacity issues at reservation you can put together heterogeneous clusters and get unblocked when you're training something in development and last but not least there is a lot of observability tools built around gray that helps us like easily debug distributed applications and unblock ourselves um yeah so now that we talked about um kind of the infrastructure side and why we should do fine tuning let's talk about what it takes to do fine tuning how do we set up problems for fine-tuning language models so there are two main pillars that you have to think about very carefully when you want to set up a fine tuning problem obviously there is data collection and formatting and I want to really highlight the importance of evaluation so to concrete to crystallize these things into concrete examples we're going to use this natural language to SQL query generation so data set quality is crucial I think you've heard it already from even Adobe stock here um in generative AI data set is kind of the king and you have to invest a lot of time in it to make sure you've got high quality curated data that captures your intention of how these language models should behave so in SQL generation um we've the examples are formatted like this you have like a natural language statement that poses a question about a data set and there is like a table schema presented by a bunch of tables and then um like variable names and what data type they have and then at the end a desired query that you want these models to generate it's very important to make sure these data sets are clean V for this type of study we did a lot of data curation manually went through all these data sets make sure kind of understood what are the common errors in the data set fixed them filter them to make sure for example table names makes sense they represent what the underlying data is um data types match for example the query that is generated so to get these good results you gotta curate your data and I can emphasize it enough just by one like slides next thing that you have to think about the data is um and this is kind of an important one is the way that you kind of format them during training is going to impact how you want to use them like ask the model to do something so training and inference data format should be very consistent with each other so I'm going to give you an example in this SQL generation imagine my training data set I structure all my examples like this write a SQL query to answer this question based on a table schema followed by two new line symbols context two new line against symbols and then the question and then have the model learn how to Output the kind of corresponding query I go ahead and train a model with this but at inference time I come back and ask the model the same question but in a different format like here is a database maybe I don't specify the schema and then I ask it hey convert the following to a SQL command like show names blah blah and then when I see what the model produces it's kind of like wrong in subtle senses like it doesn't it for example forgets the name of the the schema or it doesn't do this order by like descending but the reason behind like this thing is that you have to think about how the model has seen the data before it has only seen the data in this particular format and then you're throwing it at it like a new kind of format of data which kind of gets to converted to new symbols that this model may not even recognize and may generalize may not generalize very well too so it's very important um to kind of have a consistent format when you're actually running inference on training on these models or if you want to have variations in the type of like data that goes into inference you have to have the same type of variation in your data as well so these models learn to be robust to those type of variations and now I want to talk about a little bit about setting up evaluation Pipelines this example is kind of very specific to the SQL generation but it kind of inspires other ways to think about it so for see let's talk about SQL so SQL like you your model output something like select block and then you have a reference output that you want to check whether the model what model generated is equivalent to there are cases there are like this is a contrived example but this kind of greatly captures the nuances behind this task it's very complicated to ensure whether what the model outputted is consistent or the same as the reference output you cannot do character for character matching you cannot do more complex methods like abstract syntax stream matching maybe you have like expressions Math Expressions that are equivalent but look different than this a steam matching method would also not work out well uh what we did here was to use gpt4 actually a powerful model although it costs you can cost and it can become expensive you're doing an evaluation pipeline so it's like a one-time cost that you can pay up front to set up evaluation pipelines that are that you kind of keep consistent throughout your experimentation so what we did here was we asked chat gpt4 to create a bunch of mock tables their like conditioned on for example the the reference output and the table schema where if we ran the reference output against we could check what got um as a as what came out as a result of running that query would match the same thing that would come out as there is running the like the model output against the same mod table so by doing so we kind of curated and handcrafted maybe like 200 300 examples of such unit tests where we could like run all of our experimentations against and make sure we've got like consistent evaluation pipeline in a scalable way when we are experimenting with this like fine tuning tasks so um the takeaway from this is that there are tasks out there that you may want to apply fine-tuning to that evacuating and evaluation may be a hard thing to do but you can leverage these like more powerful models to kind of automate that part and take some of the human effort out of the the loop now let's talk about some of the learnings we had from running these experiments on llama2 models so um this spot was kind of shown in the keynote as well um we have applied fine tuning to kind of several tasks that we thought might be relevant to what other people might want to do with these language models I already talked about the SQL generation task in the in details in the middle that's that's what's shown in the middle on the left side we have functional representation which is just a task where you have a like a honestructured text asking a question or like have a comment about something and then your task is to read that text and convert it to a structured data this is a very common tasks that in like Health space where for example doctors write a lot of notes and you have to kind of parse that and extract it in a structured format um and so that's that's basically what is shown here and we've got another task which is more more geared toward um mathematical reasoning and logical reasoning GSM 8K is a data set of around 8 000 examples of basic math questions followed by some answer and you want to evaluate better language models can solve this type of task so what is shown here is that these darker bars are the performance and success rate of these like models then fine tune the chat fine-tune models right out of the box so you don't do any specialized fine tuning on them and they compare to gpt4 for example they do very poorly they they're not even close to the performance but if you kind of use the training data that is curated for these tasks and then fine-tune these models and then do the evaluation again you'll see that the performance gets boosted so much that it can actually beat gpd4 in these two kind of tasks however in some of the tasks that involve more things than just following a format right math involved requires um more understanding of like reasoning and logic piecing together different piece things about the logic behind the question and although fine-tuning can help get you from let's say I don't know 40 to 50 it is still far behind um kind of performance of more powerful and general purpose models like gpt4 what this kind of presents is the opportunity for applying fine tuning on these four and following fact tasks like dysfunctional representation or SQL generation are the kind of task that the model does not have to really kind of understand the world or how the work functions they just have to learn how to map like a certain format of input to assert another format in the output and this is where like fine tuning can really help um now I'm gonna hand it off to Archer to talk about learnings from parameter efficient fine tuning right thanks karosh uh hello everyone all right so another we have seen the the value of these models let's talk about parameter for tuning so first first of all what is parameter efficient fine tuning um in in full parameter fine tuning what you do is just a continuation of the training but on Specialized data and parameter efficient fine tuning is the same the same thing but uh your only fine-tuning a small number of parameters so this could be a subset of the parameters of the original model or it could be some additional parameters the point being that it has to be very few parameters and there's a couple of techniques that exist out there to do this and one of these techniques is Laura so Laura means low rank adaptation of LMS and you see here on the left side a kind of schematic of the internals of a transformer and on the right side here you see how Laura works in principle so for any given layer for from the Transformer that is dense so for example like a feed forward layer you can grab that layer and you can apply Laura to it so what does that mean um well you have these pre-trained weights and what you do during training with Laura is you freeze them and you set them aside and this will become quite important later so you set those pre you freeze them and you set them aside and then you add an additional Matrix a times B that can be decomposed into two low rank matrices A and B and these two matrices combined have very few parameters compared to the original parameters the pre-trained weights that you would normally be fine-tuning so this is really where the where the trick is here and this can bring you two things um first of all during training obviously there's a a much smaller Optimizer state to be kept in memory and then second of all you're left with much smaller checkpoints and we'll talk more about this later but let's first talk a little bit more about the quality of the models that we gained out of fine tuning with Laura right so this should look somewhat familiar these are the same tasks that Crush talked about earlier so we have the functional representation test SQL generation and the math task and we added another shade here a medium shade to the to the dark shade that signifies the Baseline and the Light trade that signifies how well full parameter fine-tuning does so we added the medium shade here to signify how well Laura does and you can see for the left two tests for functional representation and SQL generation that Laura did basically almost as well as full parameter fine tuning so the relative difference in accuracy here is like one or two percent and we can learn from this already that with Laura we're able to solve some like real world problems uh very well actually better than uh what we got out of gpt4 and um but on the right side you see the math test again where Laura is lacking a little bit behind so for the 13 and 70b parameter models um we're seeing differences of like two or three percent and for the seven billion parameter model um the lack and quality was even greater and our hypothesis around why this might be is that you know like math is generally hard for LMS to do as we know and then Laura is also a more difficult optimization task so since you have much fewer parameters to play with the the optimization landscape is a little more more tricky and this might just add up so something we can maybe learn from this and this has to be seen like in future tasks that we look at is that the performance of Laura might depend a little bit on the type of task that you're looking at right so another thing that we learned with Laura was that it's sensitive to the learning rate so with full parameter fine tuning what you'll find generally is that it's very stable across a wide range of of learning rates and when we use Laura we encountered some instabilities here so a learning rate that you'll see widely used on the Internet is 1e minus four and we use that at first as well and then ran to some of these instabilities and you can see here how just by tweaking they're learning right a little bit we got a much smoother learning curve here in this purple one yeah and then another thing that we did to improve stability was interestingly prompting so what you can do during training and obviously you have to do the same thing during evaluation as career said is you can apply some prompt engineering basically during fine tuning so you you create some helpful context for the model like for example you know you're a helpful assistant this is like a SQL table in the query and stuff like that and then you prepend that to what you're normally inputting to your model and what that leaves you with if you like we fixed everything else here like a seating and learning rate and everything right but what that left us with was um even even smoother um learning curve here the the orange one yeah cool so now um that we've talked about like how well Laura does on these on these problems and that we just might have to tweak it a little bit here and there let's look at the upsides of Laura so first of all as I said in the beginning um the the optimizer state is much smaller right so for the 7 billion parameter model for example that we fine-tuned we were able to fine tune the seven billion parameter model um on a single AWS p4de 24x large instance and we were simply not able to do the same thing with full parameter fine tuning and the other thing is as you can see here the the checkpoint sizes are much smaller so with our Laura settings we were left we're left with checkpoints that are like 40 megabytes for the 7 billion parameter model and 12.6 gigabytes for the full parameter fine tuning so obviously with full parameter fine-tuning every time you checkpoint you have to check one the entire thing right with Laura you're just checkpointing these two matrices A and B cool so this brings us to our sixth learning um so as I said in the beginning you during training you freeze these weights right and you set them aside and you add these two matrices A and B that are your your Laura weights and what this means during during serving is that you you take those frozen weights like the original model and you put it in memory and then along with that you have an array of uh Laura whites that are tasks task specific so this ties in very well with what kurosh said initially about um in order to beat these these open and large and general purpose and very expensive models we need to find small models that we fine-tune and like a niche specific tasks so you can imagine like one Laura one set of lower weights per task here right so what have we learned about Laura now in terms of a trade-off first of all if your sole concern is model quality there's no way around full parameter fine tuning still you'll still have this edge of one or two or three percent of relative accuracy um and um the the training time between the between the two the difference in training time is is really not there so initially we thought like Laura must be much quicker in like fewer parameters fewer things to checkpoint stuff like that but it turns out that if you look at the time it takes the model to converge um as in like wall clock time to a given perplexity it's roughly the same between the two methods and then what we really gained from Laura is first of all the memory footprint that can really unblock you on using smaller instance types in training and second of all the the serving efficiency that's just greatly enhanced right so here are all the learnings that we mentioned today first of all data set quality is crucial training and inference data form and consistency is crucial and we use gpd4 to set up a reliable evaluation pipeline then Laura being sensitive to the learning rate and prompting data sets help with training stability and lastly the large big Advantage is really the serving efficiency yeah so one more thing here there's another talk about these LMS in production by our chief scientist Walid and that's going to be at 3 15 PM in Gate Ballroom B cool thanks everyone for attending thank you

Info

Channel: Anyscale

Views: 3,579

Rating: undefined out of 5

Keywords:

Id: _OIq-9dKkbI

Channel Id: undefined

Length: 28min 57sec (1737 seconds)

Published: Thu Oct 12 2023