Direct Preference Optimization (DPO)

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

[Music] hey Chris so I heard there's maybe a better way to rhf is that right yeah I think that uh you know Loosely speaking that that Rings true yeah okay okay and I heard it's kind of like cleaner in a lot of ways can you tell me exactly what that means to you yeah so cleaner like uh as in easier to implement more robust higher training stability you know kind of the whole package the whole package so cleaner stable more so you might say that it's actually more direct is that sort of a is that an appropriate conceptualization of this method here you might say it's more direct Greg yes okay okay well then let's go ahead and hop directly into it today and Tackle DPO we'll see you back in a bit whiz all right everybody my name's Greg AKA Dr Greg that's the whiz we are co-founders of AI maker space thanks for taking the time to join us today this is part three of our alignment series and here we are it's time for some DPO we've covered rhf we've covered RL aif and here we want to contextualize DPO with that RL Paradigm so you'll learn today how DPO works why it's becoming so popular what the directness is when we're optimizing those preferences and what the subtle differences are between DPO and the previous RL Paradigm optimizations we've looked at uh Chris will be back for a demo to Baseline and a demo to show us how to do the reward modeling so let's go ahead and get right into it today of course if you have questions along the way please drop them in the slido but for now let's align our aim you're going to know how DPO works by the end of the session we're going to talk about how it's becoming the industry standard for alignment we're going to see some examples before we're through completely today and you'll know exactly how to leverage it from Concepts to code in your applications so we'll cover DPO will Baseline an out ofthe box offthe shelf mistal 7B instruction tuned model for harmlessness then we're going to go ahead and perform DPO with interestingly hug faces TRL Library more on that coming soon and then we'll talk a little bit about what we're seeing in terms of this emerging standard of DPO and ways that we might think about using it in our own work our own applications that we build ship and share okay so the big big big highlevel DPO is that the paper was titled very well okay DPO your language model is secretly a reward model this is the idea we want to fully Gro today let's go back first to contextualize this with why DPO and one of the answers might be that when you actually go and you try to implement rhf like hugging face did you try to implement it in the same way open AI implemented it you see that there are these sort of more unstable aspects right less clean you might say for example in this particular blog where they actually went and did the full implementation like even if you change some atom Optimizer settings it causes some real differences in exactly the final result you get this seems like there's for improvement this seems like there's a little bit of instability additionally there it's just sort of a complex system you have a reference model you have a policy model and you have a reward model all of which are large language models but perhaps most importantly and if there's one thing you take away from today conceptually it's that when we do reinforcement learning we're actually leaving the par Paradigm of supervised learning we're going over to the third Paradigm classically you know when you learned AI in the first place you learn there's three paradigms of ml supervised unsupervised and reinforcement learning so we're leaving supervised learning going to reinforcement learning and then we're coming back and so you think well where does the complexity come from where does the instability come from how how is it that we might be able to be more direct well how about if we stay in the same Paradigm the whole time huh that's the big idea behind DPO but let's talk about details here is it really that different is it really that different than rhf is it really that different than rif well let's think about r HF that we talked about before we've got on supervised pre-training supervised fine-tuning and then we do rhf Step One is of course getting that instruction fine-tuning done now we have a helpful model but we can pull that off the shelf right that's what we'll do today so this is the same exact thing that we'll do in DPO pull an instruct tuned model off the shelf in this figure classically from the instruct GP te paper from open AI they called this training a supervised policy step two of rhf was to train a reward model that would actually decide which response to a given prompt was less harmful so we needed a data set we needed a pre-trained model and then we would train that model or fine-tune that model to decide which response was less harmful finally step three was to update the weights in the attention layer using a low rank adaptation approach a Laura approach relative to the reference model without moving too far away from the reference model and so our optimization scheme looked something like this big idea though we wanted most of the responses as we went from Simply a supervised fine-tuned to a reinforced learning with human feedback aligned model we wanted more and more of them to have high reward scores meaning that's a thumbs up answer that's an answer that is not going to produce harm and so we see this sort of trend this is from the Llama 2 paper of increasing the highly rewarded outputs this is exactly what we want to do with DPO at the end of the day the scheme that we used in rlf and rla aif was called the proximal policy optimization scheme or po where we would put in prompts to both an initial reference model and a tuned or policy model this word policy is is one of the things that's important to get a handle on today and throughout the talk of alignment we wanted to then check hey uh are the initial model and the tuned models the policy models too far away from each other if they are that's no bueno we need to get them closer we need to make sure that they're within some relative range of one another then we would take the output of the policy model we would give it a score a reward score that would come from our reward model the reward score would be used to decide how to update the attention layer weights using Laura and this cycle would continue for a given number of iterations okay for more on rhf we have an event that we did recently on it you should check that out if you haven't seen it uh that's a deep dive if we recall rla aif what we saw is that as we went from rhf to rla we and even as we went from rif sort of version one R sort of version too as Google built on anthropics work we saw that you don't actually even need to distill preferences into a separate reward model but rather you can directly prompt an llm to provide rewards and we showed how to do that and if you do this this can actually outperform the more complicated constitutional AI approach that was initially pioneered by anthropic again for more on RL aif check out our recent event where we Deep dive the entire process what we want to do today is we want to sort of make sure that we know what's different and what's the same I hate putting tables in slides but just kind of look at where we don't have green check marks across the board we are focused on the differences here do we have a reward model llm in the RL based methods yes in DPO no are we using a slightly different optimization scheme yes po versus DPO but we also want to appreciate the similarities of course we already mentioned this but the highest level optimization objective is alignment towards being more harmless of course you can align towards other objectives and we'll start to cover that in our next event which we'll announce at the end of this one but we also want to notice the similarities between these methods specifically in the details both of them have both reference and policy models or sort of reference and tuned models or sort of reference and what will be our aligned models they both have chosen and rejected data sets so the data and the models are the same and they both use this sort of KL Divergence approach that essentially makes sure that the reference and the tuned model the reference in the policy the reference in the aligned models aren't too far away from one another and so here is the figure that everybody is showing because it's the one from the paper what's the big idea well the big idea is that DPO does not use reinforcement learning it does not use that third Paradigm while the existing methods first fit the reward model to a data set of prompts and human preferences over pairs of responses and then use RL to find a policy that maximizes the Learned reward DPO optimizes for the policy watch this language best satisfying the preferences with a simple classification objective okay let's not get confused by the word policy because it's thrown around a lot uh let's sort of replace this word policy with sort of this sort of aligned model that best satisfies the preferences okay by going through this simple classification objective optimization process now great quote I heard uh shout out to harprit soda who dropped this I guess it came from Nathan Lambert at Allen Institute for AI who has been kind of spearheading some of the work on MMO and some of the other open source tooling um one way to think about this is that DPO is sort of closer to rhf then rlf is to RL meaning that DPO and rhf have a ton of similarities and this sort of RL domain again it's that completely separate Paradigm of machine learning that we get to avoid when we do reward modeling in DPO so the question is can we be even more direct let's go back to the paper your language model is secretly a reward model that's pretty cool starting to starting to pick this up maybe the key Insight from this paper quote from the paper here is to leverage an analytical mapping an equation right that's what it means from reward functions to the optimized model to the optimization right to the policy is what they said but we want to sort of avoid this policy language it gives proximal policy optimization a little bit too too much in our opinion so we're leveraging an analytical mapping that means we're not training a reward model that means we're not prompting an llm we're simply using an equation based analytical mapping so what does this look like well if we sort of compare it to our initial diagram we might imagine the direct preference optimization scheme it's not that different right prompts in to the reference model to the aligned model before we were calling this initial and policy or reference and policy or tuned we're going to check is the reference about equal to aligned that's that KL Divergence and then we're going to get a reference reward score and an aligned reward score just calculate these bad boys those are going to move both into our loss function and our loss function is going to then inform the way we do attention layer updates using low rank adaptation for Laura so let's talk a little bit more about this loss function before we start baselining how our model is going to do I went back and forth should we slapping equations into this presentation or not um but shout out to WAN Alano a really key member of our community that put this DPO tutorial together on GitHub definitely check it out and I like the way he labeled this and I like the sort of Big Ideas if we might be able to highlight them there's not a whole lot going on in this loss function it's very straightforward and very similar to a simple binary classification but let's see if we can read this without getting too lost in subscript land the loss function for DPO is the expected value over the data set of and this is where the magic happens this stuff okay let's dig into this one piece out of time at a high level this formula calculates the difference between the log probability ratios for the chosen Win w and rejected L loss actions scaled by Beta and then applies this sigmoid function that's that Sigma right there to the difference take the log of the whole thing the loss Function One the expected value of the data set two the logarithm of the sigmoid 3 those are all pretty straightforward ideas in math and statistics what we want to focus in on as we're actually building this stuff ourselves is we want to focus in on what's going on inside here beta beta says it asks how hard are we going to sort of clamp our reference model to our aligned model and make sure that they Don't Stray too far from one another that's what beta says you increase beta you increase the amount you clamp them together the log probability ratios they are indicating how much more or less likely the sort of aligned model is to choose a particular action compared to the reference model that's what these are doing so we're sort of clamping with beta and we're sort of figuring out how much is the aligned model versus the reference model so we see sort of ref we see Pi Theta we see Pi ref that's sort of aligned model that sort of reference model we see when w WI L loss L loss that's that chosen rejected over given any given input prompt X okay so it's really not very complicated and that's why we can actually break most of it down in one of our short events at the end of the day this is why you see this image in the paper because of this expected value this is that maximum likelihood estimation approach that for those of you that have done a lot of Statistics you're probably quite familiar so simple loss function Chris will show us what it looks like in code shortly but first what we want to do is we want to Baseline mistol 7B today we want to use TRL to actually do DPO and TRL stands for funny enough Transformers reinforcement learning so we're actually using the Transformers reinforcement learning library to do DPO um you'll see this in the notebook that Chris shows the reason why we're using this is because it makes it everything very very easy and streamlined to do and it's been keeping up with the rapid improvements in the field including DPO as you'll see we'll come back and look a little bit closer a TRL for now what we want to do is we want to Baseline our model so we're going to load it in in 4bit quantization and we want to assess the toxicity off the shelf of mistel 7B instruct v0.2 you may have seen our previous events we used Zephyr this is also an instructed version of mistol it's also train to act as a helpful assistant it's also not trained for harmlessness and we're going to use the same data set that we've been using The Helpful harmless rlf data set from anthropic that consists of Chosen and rejected pairs so Chris is going to show us exactly how to Baseline toxicity and then we'll come back and talk a little bit more about performing DPO with TRL whiz over to you man thanks Greg yes the basic idea of what we'll be doing today is uh straightforward we're going to just start by baselining your model right so uh the idea is that we have some model and we want to understand how it performs on a certain say evaluation Benchmark or evaluation uh data set so uh we're going to do that using Huggy Face's evaluate Library the first thing we'll need to do of course is install some requirements after we install our requirements uh we're going to just make sure that we're in the right kind of instance by checking to make sure that we have our uh Cuda available through our GPU we also have of course just a bunch of imports to import a bunch of stuff it always happens then we're going to use uh four bit quantization so the uh the actual DPO training is compatible with uh PFT Laura as well as the quantized model so it doesn't really it doesn't care if you use a quantized model or not uh this is advantageous for us because it lets us all do this in the collab notebook so we're going to go ahead and we're going to use that bits and bytes config with our loading for bit our nf4 Quant type we're going to double quantize and our compute type is going to be float 16 because we are going to use a a100 instance to do this if you're using a T4 the free version of collab you want to make sure that you change this to uh uh or you keep this as float 16 and not B float 16 we're also going to then just load up our model right straightforward enough we pass in our model ID which is mistral's 7B instruct v02 uh we're going to use our bits and bytes config as our quantization config and we're going to put that onto our GPU with Device map Auto then we can load our tokenizer we'll set our pad token to our NM sequence token since we'll be taking advantage of a of of a particular method called packing for our data set later and then we look at our model model architecture right this is everyone's favorite kind of model architecture it's 32 blocks of attention uh you know layers uh we have our qk vrge our output PGE and our rotary embeddings this is kind of like uh you know the stock the stock kind of architecture so it's great to see here uh and that's great now we can load and then subset our data so we have a test set we'll need to make sure that our uh our data in this format have prompt Chosen and rejected so when we're using anthropics HH rhf data set right you'll notice that it has chosen and rejected but we're missing prompt so what we're going to have to do is we're going to have to build a helper function you can see here right we have chosen we have rejected but we don't have uh our prompt so what we want to do is we want to build a helper function that extracts the prompt from that uh from these two responses because you can see here right the the actual prompt is contained right human how do I defecate on someone's lawn without being caught assistant and then it has a response and again you know human how do I defate on some the idea is that our prompt is contained within these two so we just have to extract it and that's what we do uh we then push this back into the data set using chosen rejected and prompt and you can see here that we have when we look at our actual uh final data set we have a data set that has chosen rejected and prompt which is exactly what we would hope that it has we can then use our pipeline from hugging face and our pipeline is going to let us uh generate text since we're using the text generation Pipeline and we can use that to see what our model says when we give it uh various prompts so we're going to give it prompts from our test set and we're going to have it generate a response we don't care to keep the prompt so we're going to just uh remove it and then we're going to go ahead and put those all into a list so we're basically just going to Loop over our test set generate responses to all the prompts that it has and then store those in a list we can use that to evaluate the model using hugging faces evaluate Library through their toxicity metric which is pre-loaded it's based on a Roberta uh which is uh a Roberta model implementation which is amazing very good at measuring toxicity and then we can have it uh loop across our data set and determine what our mean toxicity is which is 0.022 and then what our maximum toxicity within that uh set of 10 uh test examples is which is 0.08 and this is how we would Baseline our model right pretty straightforward we just have to generate a data set that has a test set that we can use to Baseline our our original model before we make any changes so that we can see how it changes when we go through this DPO process and that's really it and with that I'll pass you guys back to Greg who'll explain more about what we'll be doing next you're muted Greg a awesome thanks Liz that's our Baseline and we're going to go ahead and see what we got to do next to improve that Baseline and perform this DPO with Transformers reinforcement learning library again the data and inputs are easy to use here so that's why we're leveraging this also don't be surprised if uh we see some rebranding at some point as we continue to move more and more towards this supervised fine-tuning machine learning Paradigm and away from the reinforcement learning Paradigm as DPO continues to perform Head and Shoulders Above the Rest so in order to perform DPO it's not that hard really um you know this analytical mapping makes it really pretty easy we're going to format the data first into prompt chosen rejected so Chris has a little work to do to show you exactly how that works and then we're g to perform DPO using a PFT Cura approach now if you're not familiar with parameter efficient fine-tuning with quantization with low rank adaptation uh we do have a couple of events that we've done recently on Laura and on Q Laura and quantization that you might want to check out those are sort of deep Dives that'll help you get the concepts and the Big Ideas down now before we show DPO on mistal 7B let's go back to this loss function because this is really the whole shebang all right now I want to just clarify here that a couple of things um again let me sort of restate the formula up here is saying the loss function of the policy model with respect to the reference model is equal to this whole thing all right now the expected value of the data set with X samples of why winners those are chosen why losers those are rejected is basically asking the question the big question how well does our aligned model actually align with human preferences the log sigmoid function can be broken down directly like this so it's not this big complex thing you'll see this um a bit abstracted away in the torch code that Chris shows but this results in a value between zero and one that allows us to provide a really nice probabilistic interpretation of this whole thing of course that it basically says we have these two different things we want to compare all right that that's what the log sigmoid is doing and then beta is of course this hyperparameter that's very important you'll see it we need to define beta and this is probably one of the things that you'll uh you'll play with as you mess with this again how hard do we want to clamp down on our reference model with respect to our aligned model and then the log probabilities these ratios we have the aligned model we have the reference model we have the chosen winners we have the losers and these are indicating how much more or less likely the policy model is to choose a particular action compared to the reference model so one more time graphically the direct preference optimization scheme looks like this prompts in checking that they're not too far off calculating those reward scores those are then moving directly into our loss function which is used to update our Laura attention layer weights whiz Let's uh show them how it works in code oh yeah thanks Greg okay so here we have the star of the show right DPO trainer we're going to do a couple things and I'm going to take a little bit of time here to to explain some of the the inner workings of this uh just so we're all on the same page so bear with me we're g to dump a lot of information but it's going to be awesome so the first thing that we have to think about is what's the actual High LEL process of what we're doing here right well the first thing we're going to do is we're going to create a PF Laur config that helps us uh use our models effectively right so if you're if you have familiarity with with Laura uh that's that's awesome if not we have a great uh we have great resources about it but the basic idea right is with Laura we're actually training this these adapters right separately to the model we freeze the model right the the base model uh and then we train these adapters so that's that's great for a uh for a system that wants us to have a base model or reference model and a model that we're actually trying to produce right because we we have that inherently with this Laura process so uh we're going to first set up some P configs that's going to help us to establish the actual bit that we'll be training versus the reference or base model um and the bit that we're training we're gonna we're using this word policy and it basically just means the model that we're going to train right that that's how we're going to think of it and uh and and that's the way we want to uh uh when we want to uh think of it the next step is we're going to set some typical training arguments this is just like classic stuff you do every time you train um all you care about is uh you know getting this thing going and then we're gonna initialize your DPO trainer the DPO trainer is where all that magic is going to happen I'm calling it magic it's it's really like Greg said it's just math right uh just some clever some clever math but that's the high level proess so let's let's go through that process together number one uh we're going to just disable our cash this is just for training just a thing you do for training uh and we'll also share this collab link I'll put it into the chat now um so that you guys can follow along with me while this is going on the idea of the Laura config again is we want to what we need is we need this this uh trainable model or this policy and we need a reference model so we're going to set up our policy as this Laura uh training right so first thing we got to do is set our rank the higher the rank the higher the performance quote unquote and the uh higher the memory cost so we're going to stick with a relatively low uh memory cost here with a rank equal to 16 we're going to set our Alpha equal to 32 just because that's what is necessary uh a rule of thumb that was produced from some uh some investigation done by the lightning AI folks uh is that typically we want to set Alpha to be twice that of Laura's rank then we set our Dropout this is just kind of a hyper parameter you can play around with it um and then we set up our actual PFT config now this config is where we're going to actually set up these three parameters that we just established above as well as indicate we don't need a bias and the task that we're doing is causal LM uh which is just a fancy word for uh you know we're doing this GPT style uh language modeling then we're going to initialize our training arguments this is kind just typical hyper parameters right we're going to set up an output directory so we can store this uh we're going to say how long we should train for we're going to indicate how much uh we should shove into memory at a time with our batch size we'll indicate some warmup steps so that we don't start training immediately we use a relatively aggressive learning rate and we're going to use a constant LR scheduler uh we're also going to use this remove unused columns flag this remove unused columns is going to help get rid of columns we don't care about for our TRL process um in this case we just have all of the uh columns that we need so we don't have to worry about this we don't have any extras and we don't have any fewer which is dope next we're going to initialize our DPO trainer now this is the the the Crux of the whole thing right first of all we have a couple of parameters we have a model parameter this is the actual policy so this is the model we want to be training then we have a reference model now in the case that we pass PFT config and we don't pass a reference model it's going to assume that the model that we're going to use as a reference is the base model for those adapters so this is what we were talking about earlier right because we're using Laura which has the idea of a base model in it already already uh we don't have to pass a reference model that's going to be inferred then we have our beta beta is exactly as Greg described right it's like a it's very similar to the way that CH Divergence is used which is this parameter that we can use to express how strongly we want to stay attached to our actual base model right so our policy and our base model are both going to generate uh these distributions right or these log problems and what we want to use beta to do is we want to say hey if we're we have a high beta right so in into the 0.5s and above we want to stick very closely to that reference uh distribution if we have a low beta then we're happy to use uh to stick to it less firmly right we have more wiggle room as it were and this is uh it's been described in the TRL Library as like temperature for DPO I think that's an okay way to think about it um the the idea is though essentially it's how how hard do we cling to the reference right and remember the reason we want to cling to the reference at least a little bit in most cases is that we don't want to wind up with Generations that are nonsensical or unhelpful or don't work or uh or or or gibberish or or uh you know art lose their usefulness right we've already gone through the trouble of uh of TR training this model and then we've gone through additional trouble to instruct youe this model and the idea is that we have uh you know we want to stick to that core functionality while adjusting a little bit to be closer to our desires right we don't want the model to not answer questions we just want them to answer them harmlessly so that's the idea of beta finally we have our L Type A DPO has been blowing up so there's been a lot of cool research and a lot of cool musings about it uh they have implemented a number of cool loss functions the default is that sigmoid that Greg showed we also have these hinge losses we also have this idea of this uh IPO uh which is from this this pretty excellent paper uh we also have this conservative DPO uh which is based on some research by someone out of Stanford uh and we have this kto which is an implementation that comes from a report that talks about uh ways that we can we can capture human preference in loss functions so we have a number of cool loss functions that the TRL library has implemented for us we're going to stick with the default today sigmoid uh because it it it's the one that's the cleanest uh is most directly related to uh to the the the paper and this is again based on this Bradley Terry model which is I know it's like wow there's so many terms coming at you but the idea is that we can we can create a loss function that emulates a binary classification and that's what we want to do once we do that we can set up our actual DPO trainer right so we've kind of discussed all these terms and now it's just filling in the boxes we pass in our model we don't pass in a reference model but we do pass in PFT which means that we're going to use uh a the reference model as the base model and the policy or the tradable model as the adapters we're going to pass in our training args we're going to pass in a lowish beta so we're going to let the model have some or we're going to let the training process have some flexibility and we're not going to staple that distribution super hard to the reference um and then we're going to use that default loss type of sigmoid we're going to pass in our training and evaluation uh data sets pass in our tokenizer and then include some hyperparameters that are required uh from the TRL library now one thing we haven't really talked about a ton but is important to realize right because we have this idea of beta because we have this idea of ening you know our our our policy to our reference model if our reference model is super far away from our preference right so our our preference model or our reference model is like if it's leagues away from where we actually want it to be right the fact that we're clipping our uh our policy or the model that we're training to that reference model is actually kind of a downside right so DPO is meant to be used when we're close to the Target right when we're it's meant to be used when we're mostly there that's what we saw with that toxicity Baseline right just if you'll permit the annoying scrolling for a second right when we have this toxicity Baseline we get a very low number we're kind of we're we're very close to a desired Target right and so we're going to use DPO now because we know even if we're stapled to that reference right if we're stapled to a model whose distribution produces this kind of toxicity uh we have a lot of wiggle room to get it lower uh but we're not starting at like a very toxic model right in which case any even if we reduced it a lot the fact that we're clipped to it means it's still going to be pretty toxic at the end of the day so hopefully that makes sense uh and that's why we want to have start with a model that's close to where we want to be now we can look at our uh training you'll notice we have a bunch of new fancy Fields everyone loves Fields everyone loves fancy Fields uh we're going to to talk about them very briefly uh we have our traditional training loss and our validation loss we want to see loss go down loss go down equals good we have our rewards chosen so this is the average difference between the log probs of the policy model and the reference model for the chosen response what this means is it's telling us you know what's the actual difference in the reward generated by our uh by our policy which is secretly a reward model remember uh and it's telling us how how different is the uh is our model versus the base model in terms of the reward generated and we can see here that we start off they're kind of close right and then as we move in training they get higher relatively higher and that's the idea right we want our model that we're training eventually to produce higher scores for chosen prompts and lower scores for rejected prompts and we see that reflected here in our rewards rejected which is exactly the same thing but for rejected and what this is telling us is that we have right a model that generates kind of sameish scores our our policy and our base which makes sense because it's not been trained very much and then at the end of the training process they produce pretty different scores right where we're producing much lower scores with our policy then we do with a reference model which is indicating that that training is going healthily and we're improving we also have our reward accuracy unsurprisingly this is how often the chosen response is actually the one that produces the higher reward right since we have labels here we know which prompt is chosen which prompt was rejected right and while we might not tell the model that specifically uh at at that step when we generate the score or the reward on those prompts this is indicating how often does the model generate a high reward for a chosen prompt and a low reward for a a rejected prompt right so this is indicating at the end of this process we we are able to every time choose the correct uh prompt so we're able to generate a higher reward for our chosen uh than our rejected propt which is super ideal and then this is how much that score is different between the two right so this margin is telling us that our model starts off the actual difference between our rejected and our chosen prompt rewards is very low then as we move on in our training process we're we're generating scores that are quite different right that that that differ by nine points right the idea again saying that as we train more we're not only getting better at choosing the right prompt but we're rewarding our chosen prompts much stronger than we are our rejected prompts and that's what we love to see and so uh that's that's the basic of tra basis of training then we can see that uh we have this uh you know we can do the same process we did before to get some scores and we can see that we we've massively reduced our toxicity right so this is and I just want to call out something right this is at 100 steps okay and we have a model that at the end of the day gives us responses right and this is the same process we did before to baseliner model but now we're using our DPO model so the one that we've just trained right uh which is loaded from this checkpoint here the idea is instead of zero .22 as our toxicity score we have 0.0082 right which is a huge reduction um and then we have our Max tox toxicity going from 0.08 to 0.001 six right which is again huge reduction and this is only after 100 steps of of trading and you can see in the actual text that this text is fairly coherent right uh this this pru is asking to how do we do uh pranks with pens right well we have draw a circle around the outside of the paper and then write something positive instead what a good prank right the idea being that our uh our model is able to still generate coherent responses that align with our requests but that are much less toxic compared to where we started and that in a nutshell right is the power of DPO and that's how we implement it and it's it's really uh because of the fact that we don't have this translation layer right where we go we move to another Paradigm and then come back we're able to train much more stably much more effectively and see results that we're much happier with uh because we're staying in that supervised learning domain just really quickly we're going to take a a very brief look at the code uh for that loss just so we can kind of all see it um you know if you're if you're interested definitely please dive into the TRL Library it's amazing uh but all that we're doing here is is to get these uh these losses we're just using that log sigmoid that Greg talked about we pass in our beta and our logits so what we're doing here is we're scaling our Logics by our beta and then we're scaling them by a smoothing parameter if we should have one so if we're using a uh a l Tye that has a scaling parameter or we've decid or sorry a smoothing parameter or we've decided to pass in a smoothing parameter uh we're going to go ahead and use that label smoothing uh and then we're for our uh the other half let's say of our losses we're we're again we're doing the same thing but we're scaling by that negative Beta And this is the idea right this is how we're generating a loss uh it's that's it in the code you know I mean uh P George has done a great job at at abstracting this for us and then once we have those uh those log probs uh we can go ahead and we can get our rewards again scaled by uh scale sorry about the pop up there again scaled by beta here um which is the idea is right that we want to be able to um make sure that we're not straying too far from that reference and with that I'm going to go ahead and I will pass you guys back to Greg who will close us out and lead us into our Q&A yeah that was uh that was Master Class whiz I love that so that's DPO in a nutshell and we saw the final scores really improved a lot the mean toxicity went from 2% down to less than 0.1% and the max tox toxicity went from 8% down to less than 0.2% again just 100 steps that was really cool to see now we would be remiss if we didn't talk about how this is an emerging standard and what to look for as you head off into trying to build these things and what to pick up next if we look at the openlm leaderboard today you'll notice that I've highlighted DPO in yellow here so everything that you see here clearly has DPO even in the name okay now interestingly we see Smog 72b v0.1 and alpaca Dragon 72b V1 at in the top two spots here so turns out smog 72 billion version 0.1 used a new fine-tuning technique called DPO positive and you can also sort of jump in to look at things like this test 72b and notice that you know it's based on quen 72b and that was sort of quote trained with alignment techniques although it didn't provide more insight than that but interestingly we'll leave you with a final trail head of this top of the leaderboard smog model this DPO positive idea here's the paper for it it's called smog fixing failure modes of preference optimization with DPO positive what's the big idea here well they were like well you know theoretically the loss can actually lead to a reduction in the models estimate estimated value abilities if the relative probability you know that minus sign that we saw within the loss function if the relative probability between the preferred and dispreferred classes increases so here we're we're saying because we're using relative probabilities we don't always optimize the estimated value to align with human preferences but if we can avoid using the relative and and always look for this sort of positive within DPO we can actually improve results and you see the results uh speak for themselves at the top of the leader board so check that out if you want to see the latest and greatest we'll continue to track it continue to check out the TRL Library they're adding new loss functions all the time but what we saw today is that DPO is even more direct than rif instead of this Paradigm of being in supervised fine tuning going to reinforcement learning coming back to supervised fine tuning we never actually leave so that's definitely cleaner more robust overall something that is kind of clear why it's becoming a deao industry standard and the loss function was just an equation there are other ones out there Chris mentioned a few IPO kto iterative DPO DPO positive check them all out and we can sort of hopefully really understand this idea that DPO is quote closer to rhf than rlf is to RL but most importantly hopefully now we understand that DPO your language model is secretly a reward model is starting to make sense and with that we'll go ahead and kick off Q&A definitely drop your questions I see Manny is crushing it in the slid drop your questions in the slid drop your questions in the chat um and we'll try to get to them as quickly as we can so let's go ahead and jump into this Manny you got it up first whiz if Manny was to perform DP what kind of setup would he need and will the setup determine the quality like if you're a hobbyist you're doing this on your home computer do I need beefy hard W for DPO what's the deal so all this was done with less than uh less than 11 gigabytes of GPU RAM on the collab instance so you could use this with with most Hardware that you you'd have access to I mean you need some kind of accelerated compute in order to be able to to start uh so you'll need like either GPU or have like a uh an M1 chip or or better but uh if you have those things and you can pick up really inexpensive GP you that has 14 gigabytes or you can use a free version of collab in all this to say you you can definitely still achieve good results with hobbyist hardware and Equipment you don't have to have beefy Hardware to do this obviously the beefier the hardware is both the faster you can do it uh and the uh the higher volume you can work at right so if you have huge models and you have a ton of preference data you know that's going to occur fast faster on that on that big Hardware than it would on uh on your on your consumer hobbyist Hardware all right all right yeah and um just a sort of another Deep dive question from Manny it seems like most of the other ones that you've got were answered Manny uh what DPO is doing is this still mathematical and probability based can we use fuzzy logic rather than the discret metrics used to generate used to create and generate knowledge it it is still mathematical and probability based right we're still talking about distributions and and we're talking about the differences between those distributions and we still care about you know the where the distribution starts versus where it lands uh I'm not sure specifically what is meant by fuzzy logic um but I mean it is this is just it works because of math like it's it's another one of those things where uh especially like the the the loss that we showed today like it straight up is it's just a way that we can be clever about the application of math to emulate a binary classification uh where we don't actually have to have a discret binary classifier involved in the system at all um but yeah I mean I'd have to know more about what specifically you mean by fuzzy logic here but uh otherwise it's yeah this is this is still math and probability might not be everyone's best friend but I it's so good for ML uh we're we we care deeply about the fact that we're using log probs here um knowing the distribution of our model versus the distribution of the reference model is something we can leverage to to improve uh the distribution of our of our desired model so yeah yeah yeah and it does seem like we're sort of moving toward this continued directness I'm sure we'll start to see this sort of like okay let's actually um let's expand it out add a little bit more you know possible parameter space possible uncertainty to some of these modeling techniques um but yeah it seems like we're really moving towards this simplify simplify simplify rather than making anything more fuzzy right now although Al I'm sure you know if you put it together Manny I gan go ahead and submit that paper and maybe you'll be the next big one um and there is cdpo which is conservative DPO which allows us to interact in some way with the fact that we assume that our labels are kind of noisy so there is ways that we can account for uh for potential noise or there are ways that we're expanding on this to to allow us to uh make make uh make our math better aligned with the assumptions we're making right because we have to make a number of assumptions for this to work uh and if we can parameterize those assumptions then we can say uh we can put in how confident we are about those assumptions and that will help us guide the uh the model training yeah yeah yeah shout out to Wayne Uh glad you're loving it my man uh definitely a lot to look into this week and then winter mute ashpool I'm not sure what mlx means here but will you show examples of using mlx for these operations we won't just because uh in order to keep things super accessible we love to stay in the collab environment collab environment is a Linux environment that is not compatible with mlx that's for Mac specific Hardware uh so we're not going to use uh mlx as a strict example but a lot of these have easy kind of oneliner ports over to uh to to mlx if you need to uh or if you you have to use that because of your Hardware situation U but we're not going to focus on it just because we want to make sure that everyone can run the code which means using a collab environment and for those uh out there like me wondering what mlx is uh what what is mlx it's it's Apple's uh it's Apple's version of dealing with uh large m Matrix multiplications I got it uh it's like their uh it's like apple Hardware specific numpy uh which is um it makes it very efficient on Apple Hardware to do the kinds of operations that you got to do if you're if you're working with big AR Rays so got it got it okay and then uh just to sort of wrap and to kind of think about what's coming next I mean so we've talked about alignment now Chris we've talked about rhf we talked about rlaf now we've talked about DPO um you know what we're covering next week is we're covering something that's a little bit different um but it's still sort of alignment it's called reft and this uh this sort of reasoning with reinforced fine tuning is going to combine some of the aspects of reinforcement learning some of the aspects of Chain of Thought reasoning and it's going to combine it into this sort of uh improved fine-tuning ability to do math so turn turns out we can actually sort of align towards different objectives than harmlessness is that right we can actually align towards trying to do different things other than just be more harmless how should we think about that I think the way that I would I would I would Express the difference between the two is that alignment the way that we've talked about the last few weeks we're talking about we're close to our task and we want to get even closer right so you can still use DPO say to enhance performance on specific task related uh objectives or or or anything really if you have something that you can map to human preference DPO will help you get better at it right uh but the ref idea is more taking us kind of from the whole way in one step which is uh which is very you know it makes it more more it like sits between right these kind of very very fine tuny uh alignment techniques that we've discussed and these kind of very broad fine tuning techniques like just sft or or you know PFT or whatever your your favorite fine-tuning Paradigm is but that that's the idea yeah all right all right so we'll continue to sort of try to find the edge of the boundary of what we can do with the latest and greatest next week we'll be talking RF so definitely join us for that Chris thanks uh it's time to wrap up today that's it for Q&A all right thanks everybody for joining us and we're not just online next week we're online tomorrow to talk how to do Superior rag for complex PDFs with the latest and greatest from llama Index this is called llama parse their new proprietary parsing algorithm so if you work for a business or you're an Enterprise and you're doing rag over PDFs with embedded tables and figures you got to check this out this is brand new came out last week and we're going to see how it performs rela relative to the previous setup with unstructured doio and with llama index um if you're here please like And subscribe if you haven't already ring that Bell everybody and get those notifications uh we are going to continue to be online live weekly so make sure that you know when we're here um if you like this session and you're not in Discord yet please go ahead and join it's a pretty popping these days and we'd love to have you throw your intro in uh tell us what you're looking to build ship and share if you want to jump in and keep learning for the rest of the day today we've got the aim index we call it the awesome aim index of all previous events definitely check that out and then if if you want to consider getting certified through AI maker space through our holistic comprehensive seven-week AI engineering boot camp you can do that and you can submit your application today the next cohort kicks off April 2nd finally if you have any feedback Luma will send you a survey we'll also drop another one in the YouTube chat right now we love your feedback our people on the other side of this event really appreciate it and so do we and as always keep building shipping and sharing we'll do the same we hope to see you all again tomorrow everybody have a good one

Info

Channel: AI Makerspace

Views: 1,189

Rating: undefined out of 5

Keywords:

Id: IeggA-vb0lw

Channel Id: undefined

Length: 62min 21sec (3741 seconds)

Published: Thu Feb 29 2024