Aligning LLMs with Direct Preference Optimization

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[Music] hey everyone my name is di Chan Morgan and I'm part of the community team here at deeplearning.ai today we have really special guests to talk to us about direct preference optimization and really excited to dive in for everything just for everyone's information the session will be recorded and the slides and notebooks will be available after the event for any questions please fill out the slid link we've dropped it in the chat and you can vote on the questions you would like answer the most for today's Workshop we'll focus on the landscape of chat Bots which has dramatically changed in the past year which has been fueled by advancements of large linkage models like chat GPT and Lama while traditional supervised fine tuning has dominated the field re Recent research highlights the potential of aligning language models with human preferences for increased helpfulness and safety this Workshop delves into direct preference optimization DPO a powerful technique used to train Zephyr the current state-ofthe-art for 7B parameter chat models for all of you attending you'll gain practical knowledge on fineing llms for chat understanding the theory and application of DPO with hugging face tools and learn key metrics for evaluating chat model performance for our event partner today is hugging face hugging face is an AI company I'm sure all of you know that specializing in natural language processing and machine learning and is known for its open source contributions and collaborative approach to AI research and development the company is famous for developing the Transformers Library which offers a wide range of pre-trained models and tools for a variety of NLP tasks making it easier for researchers and developers to implement state-of-the-art AI Solutions hugging face also Fosters a Vibrant Community for AI enthusiasts and professionals providing a platform for sharing models data sets and research which significantly contributes to the advancements of AI technology so for today I want to introduce our first speaker hey leis how's it doing happy to have you here Le tunel is a machine learning engineer at hugging face whose work lies at the intersection of the open source and research teams he's the co-author of the bestseller NLP with Transformers book and has previously built machine learning powered applications for startups andent prises the domains of natural language processing topological data analysis and time series he holds a PhD in theoretical physics was a 2010 full bright scholar and has held research positions in Australia the US and Switzerland his current work focuses on building tools and recipes to align language models with human and AI preferences through techniques and reinforcement learning happy to have you here leis thank you for having me din absolutely and for a second speaker we have Ed beaching Ed is hi has been a research scientist at hugging face for the last two years focusing on Building open tooling for llm fine toing implementing State ofth art llm alignment algorithms and building preference data sets Ed is the creator of the open llm leaderboard co-author of zephyr and a maintainer of numerous open source ml libraries his PhD was focused on deep reinforcement learning approaches for planning and navigation and Robotics his current research has predominantly embodied learning and llm Alignment we're so happy to have you here today are you excited to dive in yes can't wait absolutely well taken away great thank you so much all right so um we're going to talk today about um this idea of alignment in llms with the specific lens of an algorithm called direct preference optimization and so for those of you who are hearing these words for the first time this was um a kind of big breakthrough from a set of Stanford researchers last year um who showed that a lot of the complexity that goes into the kind of conventional alignment techniques that were pioneered by open Ai and anthropic um can actually be addressed in a in a much simpler um and memory efficient way and um as you'll see throughout this talk um there are some sort of technical details that are useful to understand and we're going to try and do our best to kind of demystify um a lot of the sort of mathematics that goes into these types of algorithms and also show you some practical examples of how you actually um train um Lang models using various tools in the hug P ecosystem so to get started I thought it's May nice to ask ourselves like you know what why would we want to align a language model in the first place and maybe ask a bit of a philosophical question what is alignment so I think most people here probably have some familiarity with language models and you know that the the first step um when training a language model is a process called pre-training where you essentially feed um a Transformer your network typically a large amount of data from the internet and you get the model to basically predict the next token in the sequence given the previous tokens it seen and so this ends up being like a very very powerful kind of autocomplete model um which you can do tricks like future prompting to generate certain answers but as an autocompleter if you ask it a question that is something like information seeking like you know is pineapple on pizza or crime then this base language model will just autocomplete what it thinks are the next most probable words to that and in this particular example it might say well you know this is one of many questions that will be answered at the pizza party because that completion is from some sort of text that it saw during the pre-training so these base language models um they have their own use but they typically need some additional uh tuning in order to um be able to be used as chat models so the the second step is typically called supervised fine tuning and what you do in this step is you provide the model with um you know a few thousand examples where you've got um a user asking a question and either a human written response or some um llm generated response that provides an answer to that question so in this particular example um maybe the human annotator really hates pineapple on their pizza and they wrote in the annotations that you know this is really bad and so the the model when you ask the question says well you know yes adding pineapple and pizza is a crime under the Geneva Convention and so this sft model has ESS learned from the kind of fine-tuning data and also from the pre-training data that there are some biases uh present there and those biases can kind of be expressed um in this particular way now the thing is you might be a member um of a company like you know Domino's or Pizza Hut and you actually you know maybe your your your um your customers they actually do like pineapple and pizza and so one way you can kind of um sort of adjust your your language model to encode those kind of preferences from your customers is to um essentially provide um your language model um with a bunch of questions you can generate um a few answers in this case maybe yes or no and then you can get your like customers or your human label is to annotate this and then this will give you a way of encoding um some alternative preferences that are not directly present um in the pre-training of the supervised data and essentially what you're trying to do is you're trying to like learn teach the model how you can model um the probability that the kind of preferred response is more likely than the the dispreferred one and you know if you do this process uh correctly you'll end up now adjusting the the probabilities that the model will um emit an answer that is more um aligned um with your human uh preferences typically so in this case you know maybe you say no adding pineapple is not a crime it's a matter of personal preference or taste so the thing to mention here is that alignment um you know nowadays I think is often associated with things like censorship so um most people who have interacted with chat gbt at some point get a refusal like you know as a language model I can't do XYZ um but it's really a broader concept which is that um as humans um we typically have some preferences these are things that you know um are diverse and depend on our culture and um upbringing but those preferences can in principle be modeled um through neural networks and then we can adjust our language models to beh um kind of more in accordance with what um our end users may may want so before we dive into DPO I think it's good to sort of show um the original approach that open AI pioneered um so they called it rlf or reinforcement learning from Human feedback and the the basic idea they had um was this very similar recipe you you do supervised fine tuning um in this case they had humans basically annotating um and creating these these examples these high quality examples of um questions and answers and then they fine-tuned a variety of um kind of GPT models on that data and um last year when when um sort of cat gbt had just come out um the open source Community didn't really have many high quality um data sets uh that you could use for supervised F tuning um but this has really changed and so now there are some some really good examples you have one called open Hermes from technium or dolphin from Eric Harford and these are data sets that um you can use today to kind of convert any sort of Base language model into a chat model and these models if you train on these data sets you'll typically get a very very good uh chat model that you can converse with um in certain ways so what open did next is they said okay we've got this chat model now we need a way to sort of model um the the sort of human preferences uh from from annotators and so so the way they did that is they um basically took a collection of prompts they fed it to the chat model and the chat model um generates a variety of answers and then those answers are rated by labelers from best to worst so this gives you a kind of ranking of like you know which answers are preferred over over the others and then essentially you train a new type of model called a reward model which essentially learns how to predict um you know given some some response from a model what is like the kind of score or the preference score um associated with what the human label is gate and if you do this you've now got essentially kind of like a regression model that you can use to basically rate any sort of um outputs and then this gives you a way of sort of mapping those human labels um into into the the preferen of model so we've got two models we've got a chat model and a reward model and uh just to mention that we have also now many public data sets that can do this um but you've got the the chat model and the reward model and the final step is to somehow combine those two to create a a sort of new chat model that has learned how to um encode um the kind of signal from that from that reward model and the way that openi did it and also anthropic has been using reinforcement learning so the basic idea here is that you provide your chat model with a prompt you will generate an output you can then rate or score that output using your reward model and then you can use that score to basically update uh the weights of your original chat model to be kind of more aligned or closer um in in in in those preferences and so if you're um human annotators are rating things in certain ways that kind of behavior those preferences will then be essentially encoded into updated chat model and the result of this will be something that is like you know chat gbt or or CA fanthropic so mathematically um I don't want to make this too scary but the basic idea um is that you're trying to sort of maximize the the rewards um that you that the model can get so you give it all these prompts you score the outputs and you want you want the model to kind of get better at producing outputs that get high rewards um but if you just only use a reward model um it turns out there's a there's a thing called reward hacking where effectively the the chat model can learn that if it outputs certain words or tokens that have very high reward um it can just keep doing this uh kind of ad nauseum and you just get a kind of garbage model that would just output you know backslashes or you know emojis or something because it turns out that might be something that is rewarded highly so what opening ey introduced was an extra term which they call the kale penalty and basically what this does is this measures kind of the distance of your original chat model versus the model that you're currently optimizing and there's a parameter called beta which can kind of controls how closely you want those two uh models to be next to each other and if you take these two things together and you optimize it with a reinforcement learning algorithm like Po um you get something which is typically a very performant and aligned model now there are certain challenges associated with this the first one is that anyone who's done reinforcement learning um knows that this is typically very unstable there are many hyperparameters you have to tune and uh one of our colleagues at hugging face called Costa he's done this heroic effort to kind of reimplement the original like rhf pipeline from openi from scrp and there were crazy things like you know changing uh one of the parameters in the atom Optimizer had a big difference on the outcome um so that's like a it's a very kind of let's say finicky Beast to get right um and the other thing that's um also kind of a challenge is that um you've basically got two models that you have to deal with so you've got the chat model and the model that you're trying to sort of optimize and become more aligned and then you've got the reward model as well so you've got three large language models that you've got to juggle and you know nowadays the the compute Hardware is getting quite good but if you want to do anything you know at very large scales of like you know 30 billion parameters or more you're going to need a lot of compute uh to be able to juggle this so this makes the whole process uh quite challenging both from an engineering perspective but also from um a stability perspective so the the kind of idea behind direct preference optimization is to say well maybe we don't need reinforcement learning and the question is okay um if we don't use reinforcement learning um what are the Alternatives so the basic observation they made was that instead of trying to kind of maximize this combination of a reward model or a signal from a reward model with this kind of kale penalty the idea was maybe we can just change the kind of objective um in a certain way that directly um encodes or implicitly models the kind of score that a reward model would learn so you effectively don't have a reward model anymore you just have your your your original language model but your language model through the process of this optimization learns how to model um those preferences directly so just to go through this equation uh kind of kind of quickly um the the basic idea is we're going to take uh a prompt and then we're going to have preferences binary preferences of a good and a bad answer um so in this case you might have your prompt is like is pineapple on Pizza a crime and then the good response is no the bad response is yes and then what we're going to do is we're going to um basically compute uh two things we're going to try and compute um basically the kind of log probabilities um that the model is able to kind of correctly um predict the The Chosen response so here what we do is we take um a ratio of two terms the the term on the top is the model that we're optimizing um kind of throughout the the optimization process um and then we have a reference model underneath it which kind of normalizes um that the probabilities so that we don't drift too far away um from our reference in the same way that this kale penalty worked previously and then we have another term which is very similar but now instead of trying to compute the the log probabilities um of predicting the the desired response we try to predict the log probabilities of the sort of undesired response and so if you just take these two terms and you combine them together you're basically looking at trying to kind of maximize um essentially the difference uh two terms where one is like the chosen uh response one is the rejected response and you're trying to make the model get much much better um at predicting the chosen response which should hopefully then align with your human preferences and the cool thing about this algorithm is that it's differentiable um which means that we can just use back propagation um to to optimize a model and this is also why we don't need reinforcement learning because the previous objective that we we showed before um isn't differentiable so um just just to walk through the algorithm you basically take uh your prompt you uh have these two responses of of of good and bad you feed them through two models you have your reference model and the model that you're optimizing and then you do back propop and then that's that's basically it and in terms of pie torch code you can see here that it's you know roughly I know 10 lines of code so this this when I kind of first saw this this was like first surprising that um something so simple uh would work um but when we actually tested it ourselves we found that uh compared to our reinforcement learning pipeline it was much much faster to get models that would converge and um be much better aligned um than previously so just to sort of linger a little bit um on on sort of maybe the most important uh kind of idea here um there's basically this this kind of parameter called Beta And this is the main hyperparameter um that you need to kind of tune when you're doing uh DPO and will will tell us a little bit later um about some other parameters that we found are important but essentially when we um are Computing like an update or gradient update this is essentially when we're trying to take a direction in the optimization of the model um there's effectively kind of like three terms at play here so the first term is essentially um there's kind of waiting Factor um that goes over the loss and this waiting factor is basically measuring the difference between um the kind of reward or the implicit reward of getting the incorrect uh response versus the correct one and so if you're model is like continually um getting like the the kind of incorrect answer was continually predicting the incorrect uh label then this waiting Factor will become large because it's a sigmoid so it will go up towards one and then that will kind of penalize the model um to basically have um a higher penalty uh for getting the incorrect um estimates and then you've got these other two terms which are essentially the difference of just increasing the likelihood that you your model predicts the kind of chosen answer and you're trying to decrease the likelihood um of the rejected answer and you've got a there's a plot here I've shown of some training runs where you can see that effectively what happens as we increase um the beta parameter is we're effectively kind of um pushing the model um to um kind of train in such a way that the difference between the chosen the rejected answers becomes larger and larger and so this Um this can be important because you may have like quite noisy data um that has like you know kind of fuzzy labels between Chosen and rejected you know maybe the quality of them um is a bit mixed and by tuning this beta parameter you can kind of control how much you want model to focus on the chosen responses versus the rejected ones and you know typically that you have to run a couple of experiments at different valys to try and find which is which is the best one but the basic idea is that beta fundamentally controls the kind of Target margin which is the difference of these kind of Chosen and rejected rewards and by kind of increasing that you can get um you know quite different behavior and you know if math isn't your thing then uh Tom Goldstein who's a kind of Grandmaster of memes he showed uh this is roughly what's going on in DPO uh basically you know you just uh kind of do gradient descent on the good stuff and you degrade in a scent on the bad stuff and this is why uh to some extent DP has been so popular because it's a very simple uh algorithm and it means you can basically train models without actually having to train the reward model independently because you're modeling everything implicitly so here's a couple of examples I'll walk through um so one of the ones we trained at hugging face it's called zephier and what we were interested in looking at um with here is that in the DPO paper the the authors had explored um comparisons of DPO and po um using uh kind of small models so small being like around two to three billion parameters and um they had done it using kind of like academic data sets um and in particular they had used a data set for chat that you know wasn't really optimized for performance so the idea that we we thought was like well we know there's this trend uh in the community to kind of create synthetic data sets um that are around essentially using you know chat GPT models conversing with each other or rating each other's responses if we use those data sets together with um DPO maybe we can get rid of human feedback kind of all together and still produce a good performing model and so in this example what we did is we took um a popular data set for supervised funing called Ultra Chap and we use that to train a chat model the initial sft model and then we took another data set of synthetic feedback where you essentially have gp4 rating responses across a diverse number of different models and then we use those two data sets to basically do DPO using the Mistral 7B uh base model which had only just come out a week or two before and kind of one of the surprising things we found is that if you apply this process you get a kind of 7B model that is like you know relatively competitive um on you know various chat benchmarks like mt bench compared to much larger models like llama 72b now the usual caveat here is that these benchmarks um they they have certain biases so for example if you train models on gb4 data because gp4 is acting as the evaluator it tends to kind of favor its own um its own kind of style um but nevertheless um if you actually converse with a zy model it's um it's actually quite quite capable and so this was one of the sort of first public demonstrations that you can apply this recipe to to larger models with kind of more capable data sets and since then the community has really um taken this idea and and pushed it in you know much further directions so for example um some folks at agila they um did DPO on top of the mistr model or the Mixr model sorry when it came out um there's a guy called John Durban who creates lots of very very interesting DPO data sets and he uses this to fine-tune the E34 B model and then of course mistol have been using DPR a lot to align their models and now noose research and technium have been doing this themselves so there's essentially I would say DPO has become nowadays the kind of canonical alignment algorithm uh at least in the open um Source community and there's two main libraries that you can use to to to do this so the one we have a hugging face is called TRL or Transformer reinforcement learning um but there's also AEL which is very popular um alternative and Ed will show you later a little bit about how TRL works so just to sort of wrap up this kind of quick overview um generally speaking what you see in like the academic literature in machine learning is once there's a very popular idea um many many people try to find ways to extend it and improve it and DPO is no exception and one of the sort of first uh improvements that was made um was by researchers Deep Mind um they developed something called IPO or identity preference optimization and the kind of observation they made was that um when you do DPO there's a risk that you end up overfitting uh basically to The Chosen response um in your data set and so if you want to um prevent that from happening you can add a regularization term um in a similar way that we would do this if we're doing any kind of light regression you basically out of weight and they showed in their in their paper that this gave better results than DPO and we at hugging phase also did some experiments kind of comparing the two methods and we see that they're you know roughly comparable there's a bit of trade-off uh depending on on on the on the parameters um and then at the other level um there's a lot of recent work going on around this idea of like iterative DPO um or let's say online DPO and um there's a nice example from snorkel um where essentially what they did was they they took a reward model um from LNA and they used used um essentially the prompts from Ultra feedback to generate responses from their chat model they use the reward model to basically rate and rank those prompts and then they retrain another DPO model and then they generate again on new prompts rate and and rank and then you do this a set of times and you get a model that gets progressively better and better and what's exciting about this is it's very close to how um llama 2 was trained llama 2 was trained in a sequence of progressive improved um models and it's also what we as far as we understand is how um some of the anthropic models are trained they're trained in this kind of online fashion where you're continually improving the model with new data so I think this is quite an exciting Direction and one of the sort of other more interesting branches that we've seen is um a new algorithm called kto or caram and teski optimization and this is from researchers at Stanford and contextual AI where they make the the important point that if you want to collect preference data you need to have essentially a prompt and two answers where one answer is the desired one and the other one is kind of the counterfactual or the one that you don't want and if you try to do this with um human feedback it's very expensive um to collect because you need annotators to interact with um with your chat model and you know pick at every step what is the preferred response um but you can actually if you're clever like they you can actually uh Define the loss in such a way that you only need to collect um basically uh something that is deemed as good and something that is deemed as bad and they don't have to be related to the same prompt and so at huging we're quite excited about this algorithm we think um it has a lot of promise and it's probably going to be the thing that um you know you can hope to see from us uh soon showing you know how it works at various scales um the paper itself is worth reading they they do a lot of great experiments um and what seems to be missing currently is having some kind of good label data sets um to run this scale but this is very very exciting so maybe it's a good time to take some questions Ed yes yes absolutely um so let's start off with the first one just gonna um how do you evaluate the quality of Open Source alignment data sets yeah that's a really good question um I would say in General it requires a bit of what we would call Vibes testing so you just have to look at the data yourself um and I think this is often one of the things that um I I always try to sort of tell people is like look at the data because there's often strange errors uh in the labels um and probably the most popular method today is to use um another language model to do the ratings um so probably the most common example is using GT4 um to basically judge the quality of responses and um what typically we find is gb4 is quite good um on many kind of let's say creative tasks or tasks that um you know require maybe like high school level reasoning um but if you ask it to kind of Judge the quality of like responses involving maybe some mathematics or some code um it can kind of hallucinate and make make errors in judgment and so this isn't to say gb's bad it's just that like any sort of hard task even for humans it would be hard to to come up with the right answer but um currently today this seems to be the the way that you know the community is rating quality absolutely um for this question how much data is ideal for this process for the alignment yeah great question so what we've seen um in practice is um if you're doing this kind of multi-stage process of um sft and then DPO um You probably can get by with you know 10 to 100,000 examples of sft training data and then maybe around 50k examples uh for the DPO step um but I have seen examples lately there's an algorithm called Data um where they show that they can actually use only 6,000 examples for the sft step um which radically reduces it so I would say roughly 10,000 examples for sft and maybe you know 10 to 20 for DPO that would probably get you a pretty good model um the open question is that as far as we know like the other labs open and and anthropic they're using many many many more examples and presumably if you do this kind of iteratively you get you know much better models absolutely perfect well thank you so much leis I think we'll dive into the second section with Ed I'll bring you guys up take it away Ed great thanks so this is going to be more sort of practical dive into how to actually run sft on DPO on your dialogue data set um and one of those things is you have these uh day sets of conversations and how do you actually format these in order to tokenize them and feed them to a model in order to fine-tune um we've got some links in this uh presentation to a bunch of sft and DPO data sets so if you're interested in getting started there there's a good starting point uh and then I'll go through some metrics mainly for the DPO step of how to sort of evaluate whether your DPO is training well uh how to identify problems with your training training and then once you've actually trained your chat bot you probably want to evaluate it you might test it out yourself by chatting with it but you might want some automated way to do that and so I'll suggest some ways of how we do this and some tools available in the community for doing this um before I get onto DPO however I'll first talk about the the supervis fine tuning the sft step because it's important to do this step before DPO we've actually run some experiments where you kind of skip this step and you jump directly to DPO and actually it's very very hard to perform the DPO step we find that the the the model the llm fails to learn the dialogue template is one that one thing that we observe right so I'll get started um yes so to accompany uh this presentation we've created two uh annotated notebooks uh for sft and DPO uh these have been written in a way that they'll run on a on a Google collab GPU so um they they should be they should run I I initially I was thinking we could actually go through this live but actually these models take many hours to train so it's better to sort of run them offline um we don't plan to update these notebooks so we actually have a more up-to-date version of the of our our examples which is the hugging face alignment handbook uh and so check that out and this has examples where you know if you're like GPU po if you've got a consumer GPU with like 24 gig of RAM you can run some low resource examples with Laura and if your GPU Rich there's multi-gpu multi node examples where we do distributed training with accelerate and deep speed uh and we in this alignment handbook we provide configurations uh High parameters scripts for running these uh examples um I should mention that the uh notebooks we provide use Laura which is a technique for fine-tuning on like uh low GPU memory uh gpus and I won't really get into Laura but I put some links there that sort of talk about it if you want to find out more um and so of course the first step when you want to go and uh perform sft is to to find a data set so this is the data set we actually use to to to fine tune Zha and so it's the ultra chat uh a subset of the ultra chat data set and this data set contains like two main uh keys that of Interest prompts and messages uh and so just to give an example of what the prompts look like there they might be questions like what famous landmarks should I visit in in London uh write a program C++ or create a YouTube tutorial on how to make a cake and it includes the like the responses and they might be multi-turn or single turn responses to these prompts and the question is ah sorry yes so if you're interested in data sets uh we we share like a repository of maybe 30 sft data sets uh so go and have a look at them have a look at their data set cards and check out um if they're suitable for the the task you're interested in doing uh so that should be a good place to get started if you're interested in uh starting to build a chat box um and so when you've got this these messages in this data set you actually need to format them in a way that you can give them to model to perform your fine tuning and so in order to do this use something called a chat template and there are a number of different uh chat templates and so the idea of a chat PL template is like you have a sequence of messages so in this case they they're math questions like it's a sequence of math questions that have been asked to chat Bots and you need to format these in a way that you can like feed them to the model uh and so the idea of a chat template is you you kind of uh delineate different sections of this uh these messages with um different tokens and different uh things so for example you have a system prompt uh in green here you have like okay these this part of the messag is from the user and then in blue here you say okay this part is from uh the assistant and so on and so forth um and I should mention this system prompt at the top here is is just blank in this setting but in some settings you might want to give your uh chatbot like a personality some context of how it behaves where we have some examples on the Hub of of chat Bots that PVE like pirates for example or you might have I don't know you might want to math based chatbot and you could say okay you're a math teacher uh behave in a sort of math teachery sort of way so you can give your your uh chatbot some personality and it might be that your data sets might actually annotated with these system prompts and sometimes that can be useful when your training the model to give it bit more context about why the model's responses are behaving in a certain way um there are as I mentioned there's a number of templates if you had to choose one we recommend the chml one it's the sort of most uh suitable one in our opinion um and then to apply a CH template is is very very simple in the Transformers Library so recently maybe two months ago we added um these uh sort of Uh custom chat templates uh to tokenizers uh and what you can do is you can Define like a g a ginger uh string that will template your your your messages uh I don't expect to read all of this but um it's then very easy you just um call tokenizer do apply chat template to your messages a and it will format them in this in this way as I showed in the previous slide with uh assistant user sort of uh delineations uh and then so once your data is now formatted in the appropriate chap template it's very easy with TRL just load the sft trainer in TRL uh load for example in this case in the notebooks we load the Mr l7b model uh we you can provide some training arguments like the learning rate the number of uh number of epoch patch sizes Etc uh and then you can provide this to the sft trainer and call trainer. train and this will uh you know take a couple of hours to train your your chatbot with this a particular data set in the alignment handbook we provide a method to combine different data sets so if you find a number of data sets that are of interest you can use that those examples that to combine these data sets uh in a suitable way so you can train on on many different varieties of data so now you've got your your your initial sft chatbot you want to now perform some sort of alignment to basically um you know to align it to some preferences you might have uh for your particular you know product uh and in terms of the pipeline the steps are pretty much the same so you load a data set but in this case there are some additional keys in addition to the messages and the prompts you have chosen and rejected and so this this is your kind of human Chosen and rejected uh examples um and again we provide a number of these uh data sets so I think there's maybe 15 uh um examples of of preference data sets on the Hub there so again check check that link out if you want to to start have a starting point um and so yes so these data sets they contain a prompt and they contain a chosen response and a rejected response and just sort of looking back at the loss that Lewis was uh was mentioning earlier so the the X uh the Y uh W and the y l in this case are these kind of two pools that are being extracted from your data set um not all data sets on the Hub actually have this format of just binarized or chosen rejected sometimes you can you can have many responses so in this case there are four responses to a prompt and the current DLo implementation in TRL only supports binarized preferences and so often you find that okay one the preference of one response is higher than another and how these these have ratings that apply to them and so what we did in like the zepha the case of zepha is that we chose like the highest ranked response and then randomly selected one of the other ones and we felt that that was probably gives you a more diverse range of data than say always selecting the top two you kind of have examples where the margin is is is larger and smaller and that provides a more diverse range of training data for the model um you can of course have multi- turn uh examples so in this case uh many data sets have this and typically it's the last response that is either that differs so it's either chosen or rejected there are other data sets for example the open Assistant data set which is like a big tree of conversations so you can have many different combinations of of Chosen and rejected there uh there are ways of processing this to to do DPO but it's kind of a bit beyond this this talk um and in terms of applying the T chat template these two uh sets of of conversations are kind of treated as independent you tokenize them independently uh you do your forward path through your model and only when you calculate your loss you ever actually uh need to be aware of the fact that these conversations are linked together with Chosen and rejected um applying the chat template is very very similar to the FST sft stage you just uh apply it to your chosen messages your rejected messages and your prompt and and the DPO trainer will deal with uh the intricacies of uh the forward pass of the Chosen and the rejected um so when you want to then uh run your your DPO you can load from TRL the DPO trainer uh you can provide the model ID of your sft model again you can provide a number of training arguments the learning rate uh grin accumulation steps number of epoch batch size Etc and there's two additional parameters that you provide uh to the DPO trainer uh one is this beta parameter and this beta parameter sort of roughly speaking controls how how much you can deviate from the base model if beter is very high you penalize uh deviating for the base model oras if beer is very low you don't penalize it as much uh and we recently published uh a blog post where we compare compare different beta values and also different uh loss types so in in DPO we've implemented uh in the DPO trainer we've implemented the DPO loss the kto loss and IPO and uh there there's a kind of newer version of kto that will soon be released um that can work in an unpaired preference setting and yes so check out this blog post it has some examples of these different algorithms with different beta values and uh their performance but as Lis mentioned if you choose the right H parameters they're kind of comparable in most cases um so I just wanted to sort of highlight some tips uh when you go and run these algorithms uh in terms of what the sort of things you should be looking for and so one is beta itself and so we typically test from 0.01 to one um we we found that kind of very small values seem to perform well so if you're going to do a kind of scan it's probably better to do not just a uniform range you know focus a bit more uh on on the lower values um in in general we find that like the learning rate uh is much much smaller than the F sftd step like 5e minus 7 is what we used I believe for uh the zepha model uh so it's a very very small learning rate seems to be the most appropriate um and typically yeah when you're doing uh pre-training or sft you try and have a large the biggest match size possible that will fit in your GPU memory uh and what we found in with DPO is there's some sort of tradeoff between the global batch size and the number of epoch the globe lat side is like the batch size across all gpus uh and it's something that you need to to take in account uh it's not just as trivial as you know maximizing your B size to fit in your GPU um Optimizer is an interesting question as well so in the DPO paper uh they were using RMS prop um whereas uh we found that Adam actually seems to perform better uh schedul we found that cosine is better than linear and a very interesting thing is you can train an sft model evaluate it on one of these kind of uh uh automated evaluation benchmarks which I'll mention in a few slides uh you take the best sft model and then perform DPO on on the data set and you don't always get the best DPO model from the best sft model sometimes a worst performing sft model will produce a better DPO model on the same data set so it really makes sense to go through your whole pipeline sft plus DPO with your different High parameters to really be sure you're getting the the the the best results and this was a really kind of surprising observation when we when we saw this and just to come back to Laura we found that uh some experiments with Laura appear to regularize uh the the model during training compared to a full finetune um and what do I mean by that so I've actually got an example of plot of a model that we were training and um I I apologize this is a bit pixelated but I can uh just of briefly introduce it so on the right here is like the DPO loss during training over three epochs on I think it's the ultra feedback data set and so in green uh is the Laura model that we train and in pink is the full fine tune with the same hyper parameters it's just not using Laura and what's interesting is we observe that after the first Epoch the pink loss just kind of shoots down to zero at least the training part of the loss whereas conversely the the the evaluation loss kind of Peaks up and um so clearly like like we're overfitting to the training data center um whereas uh the Laura model uh has fewer free parameters and so it has is kind of regularized in a way and because it has fewer free parameters you can't overfit as much to the to the training data set and your evaluation loss like continues to decrease or at least plateaus um and we see the same thing on the left here I show the the accuracy of DPO because your model can be used to predict a reward for a Chosen and rejected example you can calculate an accuracy on each on each example or the number of examples in the batch and we see the same effect that your uh full fine tune in in sort of pink uh overfits very quickly after the first ebok whereas this like regularization with Laura uh doesn't doesn't see this problem and this actually translates to a better performance in Mt bench which is we thought was really interesting it's not just these plots that demonstrate this um Talk talking to talk about a bit about the metrics that you look at during DPO training um so one is the accuracy as I mentioned you can predict the reward for your Chosen and rejected example in your in your batches uh and then you can look at the number of times your your reward for your chosen is greater than your reward for your directed and and calculate the accuracy and so you see here for for like a good uh normal training uh your accuracy goes that's like 82% on the on the evaluation set which is kind of reasonable it's kind of what we see across many different uh training cases uh you can also look at the rewards uh in your batch uh during training and while the rewards are interested interesting like the rewards if you're chosen should be higher than the rewards for you're rejected which is the bottom right what you're really interested in the margin so the difference between the two and so the margin should be increasing during training it should be that identifying that uh The Chosen have higher rewards and then rejected in a batch and so this is what we see here and you see like the blue is the evaluation uh data set and we see that these these margins go up during training and um of course not everything will go well during your DPO training you'll have some runs that don't work and it's important to look at reasons why and so in this case like it's not entirely obvious from these metrics what is going wrong here and so you see that kind of your accuracy goes up plateaus and decreases which is a bit unusual um The Chosen uh rewards plateau and go down uh they reject towards plateau and go down which might be okay they're rejected so they should be going down and your margins kind of are above zero but only just uh and they're quite sort of noisy and so this this run there's clearly something wrong with this run or there appears to be something wrong but we weren't sure sure what and so it's really important just to go back and look at the loss and so here it's really obvious like the loss is super high it's really spiky and it turns out in this case the learning rate is just far too high and it it was probably related to how we use the scheduler so we train here for three Epoch and uh maybe our initial learning rate uh was too high and if it was one Epoch it would Decay quick enough to Res like uh to resolve this problem but because we're ret trining for three it's high for a long time uh and so you kind of um overstep when you're optimizing and get these spikes um so if you see issues like this just lower the learning rate it's kind of a classic uh issue in supervised learning as well um yeah so you've trained your chatbot You' found some high parameters that kind of work uh your loss goes down Etc and then you want to go and actually evaluate it in a chat setting uh and there are a number of ways to do this and so the two that we look at internally are empty bench which is like a multi-turn dialogue data set which tests things like uh reasoning uh math uh other examples like summarization uh creativity and there's a bunch of examples in that code as well and uh it gives a score basically between zero uh and 10 uh and you see like the the the the the most dat of the art models like GPT 4 uh are pretty good on this this Benchmark they get like 9.4 so they're they're extremely good um I should note that internally the empt bench uh evaluation tool uses gp4 to evaluate and judge how good a model is uh and we actually have seen some examples particularly in code and math where gp4 the model will provide the right answer the model you're evaluating and gp4 will actually say the wrong uh so we're getting to the point now on in some cases where you get models that actually kind of producing a better result than gp4 which is quite interesting it's not obviously on average it's not the case but there are some examples where we see this um and here yeah we see a bunch of models kind of plotted uh uh uh with their Mt bench score and the second um Benchmark we we look at is alpaca evil uh and um basically that rates your model against uh gp4 turbo and so the best the best model in this case would get is GPT turbo which would get 50 basically because it's just doing as well as itself and we're seeing like the win win rate now for like the the the Gwen models is like 27 which is pretty good it's a bit strange this plot because the y- AIS is not is not is nonlinear it's like a log scale so even this actually looks closer than it actually is so it's only half just over halfway there but certainly getting there and the the speed at which these models are approaching the qualitative D4 is quite astounding so I think we really seter like in September of last year something like that and we were down I don't know 7.3 and uh I don't know 15% 10% and now we're far far closer to to the gp4 level so it's really kind of great that the the community is building these models um there are a few other benchmarks that I should mention and so obviously like the the best evaluation you want to do is human evaluation if you can afford to do human evalu ation that's what you're doing when you've got your final model and your sure you your you know internally it's good you should go and have humans actually evaluate it and make sure you're happy with it in terms of alignment and also you know the quality of the model's responses um and there also another a few other uh benchmarks so there's obviously there's the open LM leaderboard you can use this to evaluate a chatbot and there certainly are chatbots are up there but it's not really a chatbot focused uh leaderboard uh there's a lot of questions about leakage now where people whether they mean to or not uh will use a data set which contains some questions from the leaderboard uh and there's also questions about overfitting to the leaderboard so this tool I think with with time will get less and less useful um but it's certainly is is relevant for the moment as I mentioned uh there's there's some links here to Mt bench and alpaca evl and how to actually use these evaluation tools uh there's a Lama index which is more about retrieval and so if you're interest in ret this is a very good tool for evaluating models uh and there's also this chatbot Arena by lmis and so I think if you want to actually have a model evaluated on this you have to ask lmis to add it to their uh pool of models but it's certainly a good model to a good uh tool to use to evaluate your model um there's this other plot I link to a tweet here up the top there of the authors of this and um this is a really useful tool to kind of compare different benchmarks and so so here you kind of see the the correlation between different uh benchmarks uh and so this kind of be really useful so for example empty the the top one is like um the top left is uh human evaluation so you see like empty bench has a 0.9 correlation with human evaluation so it's showns that it's quite a quite a good tool if you're you know it's quite a good proxy for human evaluation um yeah and there's a bunch of other different metrics and uh and tools in this plot um I think that's it for me I'd be happy to answer any questions if there are any amazing okay let's dive in there definitely quite a few questions to get into um perfect so how have you used DPO or seen successful applications of DPO to optimize metrics and image generation diffusion models such as L Pips Arc face loss Etc um so this is isn't really my domain of expertise but I know in TRL we have like a ddpo trainer which is for applying DPO to diff Fusion models maybe you can say a few more words on that leis um no I I think um I know that the um Creator one of one of the authors of DPO rafalo he uh published a paper just recently um using DPO to create really really good uh like image Generation Um and I believe that implementation is is is currently in the works so um there should be code and models coming this year okay perfect um how do you know if you should create your own data sets for DPO alignment versus open source I think my my opinion on this it really depends on your your product or your application uh and so you might have a very specific application like like for example a call center I could think of that being a really useful place for having a chatbot uh to either help someone who's on the call or to use as like uh you might have a text to speaks model that's working with with the client and in that case you might want to have a specific alignment data set for that that use case um but yeah obviously it really depends on your domain about application there there are a bunch of data sets out there and there'll be more and more uh coming in the future and we're also seeing a lot of synthetic uh data sets being generated so where um researchers used gp4 or these other uh models like Gemini to create uh synthetic responses to uh prompts and you might be able to use this for your application as well so you might not need to go and ask a human to help make this data set perfect uh are there any open implementations of a self-learning reward llms like meta and NYU research specifically surrounding self rewarding language models yeah so we read this paper it's a really interesting paper that we're doing some work internally around this uh so as soon as we have something we'll either add it to the handbook or to TRL uh when we when we have something but yeah super interesting yeah absolutely okay I think the last question for now any thoughts on how scaling model size will impact DPO versus RL debate noted in paper they only did DPO with 7 billion parameters so this is a good question and we were wondering this ourselves and then there was the ULU model which was released shortly after and that was like a 70 billion parameter model so like 10 times larger uh so the size yeah I think it's a Lama model or yeah so it's large uh it seems to work at scale uh there was a lot of question about whether this stuff works at scale and I would say it's probably it's much easier to get DPO working at scale than than RL um yeah maybe uh just to add a point there so so there are valid applications where using reinforcement learning makes a lot of sense sense um so the most obvious one is um if you want some like in the loop feedback to the thing that you're optimizing so good example is that if if I want to train a really really good code assistant I want the assistant to be able to debug code in a very similar way to like how chat gbt does and that debugging can be done by having a kind of environment like a code environment where you provide prompts it generates code it runs the code actually on a on an interpreter and then get feedback from the stack trace and that's a great example where reinforcement learning can really help the model get better at producing correct answers for code that will compile and run and that would be kind of hard to do with DPO because DPO is typically offline you generate all your preference data in one go and then you train on it perfect great well I think that's all the time that we have today uh for questions thank you so much Lewis and Ed this has been a great session I know our community has really much enjoyed it uh we will send out the slides and the notebooks after the event they should be in the YouTube video description and yeah we'll see you guys next time thanks everyone bye thanks everyone bye
Info
Channel: DeepLearningAI
Views: 14,182
Rating: undefined out of 5
Keywords:
Id: QXVCqtAZAn4
Channel Id: undefined
Length: 58min 6sec (3486 seconds)
Published: Thu Feb 08 2024
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.