Mastering RLHF with AWS: A Hands-on Workshop on Reinforcement Learning from Human Feedback

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
foreign [Music] hi everyone my name is Diana Chen Morgan and I'm part of the deep learning.ai team bringing you all together for all things AI community and events today we are very lucky to have a Hands-On workshop with some special speakers from Amazon web services talking about mastering our lhf also known as reinforcement learning human feedback while we get started we would love to hear from you on the live chat where all of you are dialing in from and what courses you've taken so far at Deep learning this event is designed to equip you with all the skills and knowledge required to excel in reinforcement learning applications and effectively leverage human input to enhance AI systems whether you're a seasoned data scientist an aspiring AI enthusist or business professional seeking to understand the potential of reinforcement learning this Workshop offers a unique opportunity to unlock its true potential we will also be dropping a link in the chat for you to ask questions for the speakers and to vote on all other participants questions this Workshop is also based off the foundational learnings of deep learning's course on genitive AI with llms built in collaboration with the Amazon web services team everything covered in the workshop is presented as continued education from the course and we will also be dropping the link in the chat now to introduce our speakers unfortunately our course instructors as well I'd love to introduce our first speaker speaker Andrea Bart hi everyone Hi ansha is a principal developer advocate for AI and machine learning at AWS she is also co-author of the O'Reilly book data science on AWS you may have seen her speak at Ai and ml conferences events and meetups across the globe our second speaker is Chris fragley hey Chris is a principal Solutions architect for AI machine learning at AWS he's also the co-author of the O'Reilly book data science on AWS Chris is also the founder of the global Meetup series titled data science on AWS without further Ado uh onto and Chris why don't we take it away absolutely thanks so much Diana thanks for having us and thanks everyone who joined us today to learn more about rlhf which stands for reinforcement learning from Human feedback all right um quick agenda for today um this is a Hands-On Workshop um and Chris will walk you through the actual code how to run rlhf how to prepare the data how to use a reward model and how it all comes together to give everyone the foundation all gonna walk you through a couple of slides in the beginning to give the theory and answer a couple of the questions that you submitted before so we already had a peek at those so I'm trying to to get to as many of those while I'm presenting and then once I'm done with the first part I'll hand it over to Chris cool all right yeah Chris everything ready from your side you have an exciting handsome yeah so I will be ready yeah yeah all right let's switch to my yeah let's switch to my slides all right let's get started so I asked Diana mentioned this topic and a couple of the slides I'm showing here are part of a course we just launched together with deep learning AI on Coursera called gerund Fai with large language models here you can see the instructors and also Dr andrewing we're super happy um to work with him on this course and also a big shout out here to our colleagues Shelby and Mike who also take part as instructors and help develop this course here's a quick link to the Deep learning AI site and the course if you want to have a look at it or maybe even enroll after today I think we're around 70 000 enrollments already I'm super happy and thanks everyone who might already have peeked into the course and for those of you who haven't yet let's have a quick look today into one of the topics we're covering that's rlhf also here's a bitly link gllm if you don't want to type the full link but I think we're also posting it here in the chat window all right let's jump into today's topic so in the course we're not only covering reinforcement learning we're really kind of looking at generative AI from an end-to-end Project Life Cycle and you can see here we're starting with defining um the problem talking about the use cases how you can choose the right model and then here in the middle we're diving into the how to adapt and align your model project phase and this includes a couple of different things right if I'm just using prompt engineering to kind of guide the model towards the right outputs and then a big section of the course will focus on fine tuning so how can you customize the model with your data for your specific use case and fine-tuning as I said is about customizing the model also helping it to understand human language the prompts and understanding tasks better also responding in a little bit more natural sounding language potentially but there's a problem with that right human language as we know sometimes has different tone sometimes the models pick up language if it's trained on data from the internet that we don't want to have it replicate to our users and there's a couple of examples not just the tone and models behaving badly but also for example one of the non-official evaluation criteria just a personal one that I do um if I'm working with a new model I'll try to tell me a knock knock joke so I'm German background info I moved to the US and I'm still struggling with knock knock jokes so I'm asking my my Foundation models to come up with one let's say it responds with clap clap I mean it's kind of funny right but it's not really helpful it's not a real knock knock joke um also sometimes the model might respond with something that's completely wrong right um Can coughing effectively stop a heart attack of course not but maybe the model just you know assumes it is and really confidently say that so it's not really honest in this case and also sometimes the models can provide bad advice or even like you know how can I hack my neighbor's Wi-Fi it definitely should not give you tips and best practices in this case it just just like stop and say hey I will not answer this so it can create a little bit of harmful content sometimes it can be um hallucinate create misleading answers and sometimes just not helpful so how can we address this right and those three points helpful honest and harmless by the way are often referred to as the Three Ages HHH so whenever you hear that that's referring to those overall kind of alignment criteria you want the model to be helpful you wanted to be honest and you want it to be harmless all right and this is exactly why we're doing rlhf we're trying to align the model with human feedback to guided towards harmless content honest content and helpful content let's talk a little bit about reinforcement learning from Human feedback and what's Happening Here usually you start with an instruction fine-tune model so the model that already understands how to use instructions in how to respond to different tasks and then you're applying this rlhf technique as a second fine tuning step to align the model further across those criteria again maximizing the helpfulness the relevance minimize any harm and avoid the model engaging in any dangerous topics so before we go into the details of rlhf I just want to make sure everyone here is familiar with the terminology if we look at Classic reinforcement learning you might have heard about the terms we're often speaking of an agent and an environment the goal here is to maximize the reward so the agent will take actions in this environment and then based on the actions Effectiveness towards a goal it receives a reward so it takes an action and the action will put the environment in a specific State and I'll explain what that means with an example in a bid and then based on how good that action was in the resulting State towards the end goal there's a reward and it's much easier to understand if you use an example let's say we want to play Tic-tac-toe the agent would be the player and the objective here is clearly hey I want to win the game the environment here is our three by three board where we play the tic-tac-toe and the state is the configuration of the board at any step we're taking and the action space is the allowed moves I can do so wherever I can put my ex I'll put my circle here so how the agent figures out the best way to win this game is by using what we call a policy a reinforcement learning policy which is often a model that learns what actions lead towards winning the game efficiently and based on all this input like what's the state given the action I've taken and what's the reward the policy will evolve over many iterations and figure out actually by learning right trial and error which one is the right policy and the right strategy to win the game and a whole play out like here like I do the whole iterations until I win or lose the game it's called a play out or if we're talking about language models often called a rollout as well so we're not playing Tic-Tac-Toe unfortunately we want to find you in large language models which is also fun so how does this concept here apply in the context of llms so we still have an agent in an environment but in this case here the reinforcement learning policy is our instruct fine-tuned model and the objective here is not to win a game but to generate text and we want to generate the text in an aligned fashion so what we discussed before if we want to align for the helpfulness that's our goal here the environment is the context that the llm has and the current state is the current context meaning whatever is in the context window right now plus the generated text until to this point and then if you wonder what are the actions well the actions is literally taking the token vocabulary as my action space the total amount of possible actions and then generating token by token and this will give a reward and we'll walk you through how this looks like and then informs the policy here how to adjust the model to make more aligned responses and this is where the human comes into play so somebody has to say is this text really better is this text really more helpful right and usually this is done with humans at scale this doesn't really work anymore so we want don't want to have like our colleagues or you yourself sitting down after every model completion and evaluate the text right so this is why we're training a reward model that takes this place it's trained based off the human feedback so it basically incorporates that knowledge what you would prefer but then the model can make that call in the rhf process and assign a reward score to those completions all right and the whole thing is called in this case of rollout and we're going to talk a little bit more about the reward model here which is um super interesting to know and I I saw a lot of questions also submitted um beforehand you want to understand a little bit more about the algorithms and the reward models so let's let's talk about that all of it starts with first of all collecting the human feedback so how do you prepare this data set to train the reward model let's start with our instruction fine-tuned llm and usually what you can use is either you have your own data set or you can use any of the publicly available prompt data sets and then you will take a couple of samples prompt samples and you create model completions and you'll create like a couple of those like here for example let's create three different completions for the same input prompt and you do that with a couple of different problems and always get a couple of those and then you need to know what you want to align for so let's say you want to reduce the toxicity or you want to improve the helpfulness Etc so you select one and then you're providing instructions to a group of human labelers to then provide the feedback and basically rank the responses so let's say we have a prompt my house is too hot you run it through the model and there's different completions so maybe the first one says there's nothing you can do about hot houses not really helpful or you can cool your house with air conditioning you know it makes sense or it is not too hot which is kind of a weird response right so let's say we want to hear a line for really the helpfulness of the responses you have a group of human labelers and their task is to rank the responses one being the most helpful to three in this case being the least helpful so in this case I think we can all agree air conditioning is a great tip so most helpful um there's nothing you can do yeah not really helpful but it's better than the third response where it says like the house is not too hot basically saying my whole statement here was wrong from the beginning and you're not just doing one human labeler you usually send those completions to different labelers to get some diverse input to find cons consensus here also in the responses and the ranking and sometimes you know a labeler might just misread the instructions and you can see here that there's a little difference in the rating so if you send it to more than one you can find that catch that and minimize kind of the error rate here and and find a good rating all right it's important that you provide good instructions here's just a couple of examples I'm not going through all of them but provide instructions that are really clear to make sure you know increase the quality of this data set and the the labelers know exactly what they should do like tell them really rank the responses here from the best answer to the least helpful you can also give them tips like hey judge on the correctness of the answer the informativeness and also they could ask for example um check the internet search the web to make sure it's correct and then also um they can kind of if they don't have a clear winner they may rank the answers the same but they should only use this kind of sparingly so again here you can give all of the instructions you want to do and make sure the human labelers know exactly how they should rank it all right let's say we have this ranking data now sometimes I'll also called the preference data set and to prepare it now for the reward model training we need to do a little bit of data engineering here so we have the prompt and the completions we also have kind of the average like the rank that we agreed on what we now need to do is to create pairwise completions so you can see here that I'm ordering if you can see the colors purple yellow purple green yellow green so all the possible combinations and I also have the reward I know from the ranking which one was the preferred from each of those two one little weird thing that most of those algorithms do is they require the preferred answer to come first so if you see what I'm doing here in the last step I'm switching the order I'm putting the yellow box first the yellow completion because that was actually the preferred one so that's just like a a detail that many of the algorithms that Implement a rhf expect you to do so make sure you check for that and then always pass in the preferred one first all right now we've prepared the data set and we can train our reward model let's have a look so again the reward model will assign the rewards later during the rhf process and it needs to capture the human preferences so we can scale it and don't need the human laborers in every step here so what we're doing is we're getting the prompt in complete comparison again the first one here is the preferred one and then we're passing in the second option and we're training the reward model here to predict the preferred one and it's the user usual loss calculation here because we know which one it is and then for normal back propagation we update the model so it can always predict which one is the preferred one and again here we're feeding in the preferred one first just a quick reminder so once we um do that and basically you can guess um picking one out of two sounds pretty much like a binary classification problem right so this is exactly what we do here so for example if we're if we're looking for reducing the toxicity what we could do is kind of use the reward model then to say which one is the preferred the positive right is it positive is it negative and we want to optimize for positive to reduce toxicity which we'll also use by the way in the Hands-On party in a little bit so when we're using the reward model now we can feed new prompt completion Pairs and if we're using this as a binary classifier that has been trained to exactly distinguish between positive negative we can see here for example Tommy lost television positive comes back and with the logits which is the model outputs before we apply any any soft Max layer you can see here it's definitely positive here no hate speech in this and this logic value is what we use later on as the reward value you can also see probabilities here that's pretty much 99 positive if we now run like a bad text like definitely um negative you can see here that the positive class is is way down 33 and the negative class here correctly identified um 66 percent again for the rewards will also will always use the logic value and here we can see what we're trying to optimize for the positive is a negative logic so this will be kind of thinking of reward terms a penalty right it's a it's a low reward it's actually a negative reward so we're discouraging the model later to produce output like this all right so this was a quick walkthrough just to understand how to prepare the data now let's jump into the actual rlhf fine tuning we start with our instruct model and this is the model we want to align so this is getting updated again we have a prompt data set either your own one that you created um or you use one that is existing and for example here an input might be a dog is and the model comes back a fairy animal you then pass prompt and completion into the reward model and the reward model is trained to give you this reward value which is the logic value that I just showed you so in this case it's not too bad right it's it's definitely not the toxic response is it helpful yeah it's okay but maybe you want to align it for a little bit more you know positive speech so the reward value is okay and what's Happening Now The Prompt the completion and this reward is passed to the reinforcement learning algorithm that you're using and I'm going to touch on that a little bit later a popular one is called PPO and the reinforcement learning algorithm will now based on the reward update the model and say yeah it was good but you probably can do better right so you're doing a lot of iterations here and maybe the next one let's just go ahead um responds a friendly animal which is slightly better so the reward might go up the next one might be a human companion which I like yeah so reward goes up and the most popular pet sounds nice and you get the idea right so I'm doing a lot of iterations the goal is to maximize this reward score and maybe here iteration and the last one I'm doing because I have a stopping criteria maybe number of steps the completion comes back a dog is man's best trend which sounds really nice and this was going to be then my human aligned model at the end of those iterations all right let's zoom in a little bit more on the RL algorithm part here I mentioned before um you can pick your algorithm a really popular one being used in rhf implementations right now is PPO PPO stands for proximal policy optimization in the course we have a whole optional bonus section where we're diving a little bit deeper into how PPO works it's the whole world in itself so for today I'm just gonna give you a high level overview and description here but if you're really curious to learn more I'm diving into the different loss functions how the clipping Works um I definitely recommend enrolling in the class and checking that video out here today I just want to give you a high level overview so PPO uses two stages in the first stage it uses the model that we have at this current step whatever whichever we're in and creates a couple of prompt completions it's called experiments and then it calculates the rewards for those and in the second step it then updates the model based on the average um the rewards right like if it's not a good reward it feeds it back back propagation and then updates and there's a lot more complexity to that there is terms around the the actual policy loss that's getting calculated there's a value loss there's an entropy loss so if you're interested in that I strongly recommend looking into the course for today I want to keep it at this level to kind of get through the whole Concepts and also get to the Hands-On part all right is that all you might think that wasn't too bad right well there's a little bit more I want to touch on a little bit more complexity so if you think about how this works one of the potential problems here could be reward hacking so what does that mean let's say we have our model that we want to align and we have this reward model let's say it's a toxicity sentiment classifier it knows positive negative and we're passing in the prompt data set again maybe it's here this product is complete garbage okay that's a negative reward that's nothing we want to say PPO comes into plays updates the model boom it learns but it has to be more positive okay so maybe the next one says okay but not the best product which is better but now it spirals out of control what if the policy now learns hey the more positive terms I'm using the higher the reward remember in reinforcement learning the goal is to maximize this reward so this might be just okay I learned that positive token respond to high rewards so maybe here completion comes back the product is the most awesome most incredible thing ever it's definitely positive but I don't think we want to produce that right and then maybe even getting worse and saying beautiful love and world peace all around which has nothing to do with my actual prompt anymore right it's just it's just going off the limits producing really nice sounding crazy words here but um you see the reward goes up but it's not what we want the model to do right so this is kind of called reward hacking how can we avoid this the way we do this is by keeping the updates in check we're using the initial model which wasn't too bad right we might just want to align it a little bit further but it was good in creating the output at its core that we want so we take the instruct model the initial ones we had and we're using this as a reference model and we freeze the weights so this model is not getting updated it's just our checkpoint of a reference to make sure there's no reward hacking happening in in the one we update and we're feeding now the problems to both models and again the reference model will have like a very neutral response most likely and then the the one we're updating will do whatever it does and we're using now a metric called kale Divergence you might have heard of that it's a statistical measure that calculates the difference between two distributions in this case it's our token vocabulary distributions that the model learns and we're calculating these Divergence and what's happening is here it's getting a little bit complicated but it's added as a term to the reward calculation what that means is in case the model goes off and creates well piece in every completion it's diverging from the initial token distribution right and this will be a penalty term so this gets added to the reward score before it goes into rhf so by adding this as kind of a penalty term it keeps the optimization in check and says okay this was bad what I was doing here because I got just got like you know negatively effect on my reward and we'll see this also in the hands on here in a bit now there's another thing and I know this might get a little bit um complex here as well but there's one optimization I just want to point out and we talk about this in much more detail in the course those models can get really big right I mean we're talking billions of parameters um sometimes 70 billion parameters Etc so fine-tuning a model and updating all of its parameters it's really kind of a resource intensive task time consuming Etc so what you can do is you can optimize this with a technique called parameter efficient fine tuning and it reduces the number of parameters we're actually updating in the model to a minimum and one technique uses adapters which I'm going to show you just briefly in a second here so that means combining the path methods for those of you who have heard about those already with rlhf is super helpful in making this more efficient because in the rlhf Cycles you're only updating the small amount of parameters and keep the base model again here Frozen so super helpful to you to do this um for those of you who haven't heard about puffed just a brief intro here parameter efficient fine tuning there's a couple of different methods the most popular one here in the middle is called Laura low rank adaptation of language models and I'm going to show you what this does but there's also different ones you might have heard about um soft problems and prompt tuning but we're going to focus here um today on Laura Laura what it does is it freezes most of the overall original llm weights and then in the attention layer of our model architecture Transformer based architecture it injects two lower rank decomposition mattresses which are way smaller than the original mattresses that are used here in the attention layers and this goes into really kind of the architecture of the model but overall we're learning only to to update those smaller amounts of parameters that I have in those mattresses and this helps to make this whole fine tuning and in this case fine tuning using rhf way more efficient all right and we're almost ready to jump into the Hands-On part for today so just to wrap this up once you're doing rhf you also need to evaluate how effective it is right so what you can do is you're going to calculate which Chris will show us here in a second so get the labs ready um by calculating a score in this case we're going to show you how to try to reduce the toxicity in outputs and to show the evaluation we're calculating a toxicity score so we have a data set we're running prompts completions and then the score is the average probability of the negative class across all of those completions so that should definitely go down so the lower toxicity score the better and that might be different different ones coming out so maybe you run this before you start early Jeff and you come at a 0.14 score and after doing rhf you might recalculate this and come out at a 0.09 so the toxicity score went down in this example and you know the rhf worked and you can also obviously do a qualitative check where you're actually looking into the completions and make sure that it produces the right results that you are looking for all right I know this was a lot I just want to make sure everyone is kind of on the same page has heard the terms before and now I'll hand it over to Chris who will walk you through the sample notebook and show you how you can actually implement this all right yeah hello there everyone uh yeah I've been watching the questions really good questions I think I'm going to answer quite a few of them as we dive more into the code here so uh just hold up a sec I think and then other people are asking where do you get these Labs that I'm going to show so this is all part of the deep learning AI course if you sign up for that you get access to these labs and um but let me walk you through what's happening here so um it starts I guess I'll start up at the top here um so this is where we're going to fine tune with our lhf we're going to use PPO and we the goal is to uh fine-tune our model to generate slightly less toxic um well as uh right like slight as we want right um the longer we can perform our lhf uh typically the less toxic these responses will be now the lab we don't actually train and or like fine-tune for that long but I'll show you the example here where we do show a noticeable I think it's about 10 to maybe 15 percent less toxic responses overall okay and this is a pretty wild diagram here that you may have seen from some you know papers but really we're focused kind of on the right side here and specifically what we're trying to do is to um you know fine tune this model to then push to production right kind of soften its responses a little bit uh and make it a little bit in this case uh less harmful or more harmless depending on the point of view okay and we're going to dive in a little bit to the reward model which one that we're using here which I kind of set up you know how these reward models are trained we're we are going to focus on how to use that reward model and perform this actual PPO rohf fine-tuning okay so some of this is just you know boilerplate code to get in here and to make sure that you're using the right model the the data set that we're using here is called dialogue summarization dialog sum this comes from hugging face and uh you know pretty cool so consider it like conversations and if you think of it you know sort of generally these are conversations that you might be having with your customers and say that you want to summarize uh what's happening but you you you wanna um you know just kind of soften up the edges a bit of of the actual summary so okay and here's just building up the data set and doing a little bit of light feature engineering uh anytime you're using these models you almost always need to convert them into some kind of input IDs right for these generative models um actually this was the same with Bert and the same with most of these natural language models so uh this is just you know pretty straightforward code here's our tokenizer here that is going to convert the raw text what we're calling the prompt into these like input IDs and what we're doing is we are actually creating um an instruction data set based on the actual data set itself right like and so the two columns that we're going to pay attention to in this data set are dialogue and then summary and that's the a human has gone in and actually summarize um and you know what I could show this real quick because I think it's important to show the data set so let me let me pop this open here and what you want to see here okay go into data sets let me zoom in so there was a human who actually went through or a set of humans who went through this data set and read these you know conversations here they've tried to anonymize you know person one person two and then um people have chosen the summary uh for that particular dialogue okay we're not paying attention to the topic column for this purpose but there is a train data set about 12 and a half thousand rows there's a test data set one and a half thousand and then you know validation um and so there's a whole bunch of these and we are going to try to um get a better response less toxic than even what the human has decided is the summary here so this is human highly annotated summary that's the important thing here these summaries didn't just pop out of nowhere they they come from humans okay so that's the actual data set here we're converting into what's needed to actually perform this fine tuning and I'm going to skip over this because it's a little bit boilerplate and of course you do have access to these as part of the course and okay ansha mentioned puffed and specifically Laura now the lab before this which we don't have time to get into today but the lab before it actually fine-tunes with those instructions and this is a sneak peek into the lab before it uh and um it's not doing the actual fine tuning but it is generating a PFT model so the output of in the corsets lab two I believe becomes the input into lab three and and this is actually the lab 3 notebook I just have it under a different name for a slightly different purpose here okay so the input and here we're actually pulling from S3 just to save time uh I we have saved the patched model now that's an instruction tuned model and now we're going to rlhf human fine tune right with um to make it more harmless okay so this is just I think this is the the same screenshot ancho was showing with Laura there's these weight matrices these actually are not important to the rlhf but just to tie it back together to what ancha was saying the model that we're starting with is a pest model and where you'll see that here specifically is we are only training a very small percentage of these model parameters right there's I think 250 million parameters here this is relatively small model um you know just to keep the resources down and to keep it somewhat uh Speedy to get through these these Labs uh you could certainly use a larger one when you do the actual lab and then spend more time but out of 250 million parameters because we're using PFT we only have to fine tune about three and a half million and that's about 1.4 so again you know all of these weights are here uh all 250 million and then we're just adding in these two smaller low rank matrices you know think of like singular value decomposition right where these two matrices will be learned and can sort of approximate uh what the full uh sets of those parameters would be now I keep saying low rank that's actually one of the hyper parameters that's used by Laura rank equals 32 here specifically this is targeting the attention layers not super important uh for this lab but we do get into that in the full course and okay so we're here we are going to set everything up now ancha also mentioned that there is a reward model that's pretty much the whole key to this right this is what's going to establish and classify what's being generated by the model as either toxic or I believe it's hate and then not hate and the reason that it's hate and not hate is because we actually chose this uh hate speech model by Facebook and Facebook actually this is part of a kaggle competition I believe a few years ago uh and they were able to train a binary classifier which is actually based on birds and specifically the Facebook Roberta variant which is you know pretty common um especially with Facebook and then Facebook research and you know meta research they tend to use Roberta so this is a relatively small model you know compared to the uh like billions and billions of parameters but it does a pretty good job and so the two classes here just like any you know classifier uh binary classifier this will generate a zero if it's not hate or a one if it's hate basically those are the two classes now it's really generating you know um these logits and these probabilities and so really this is what it's it will end up looking like you know um and he and so keep in mind that we are always trying to optimize for not hate okay so um for those of you that are typically used to optimizing for the one in this case we're actually optimizing towards the zero so we are always looking at the output of the reward model for the positive class which in this case is not hate speech each okay so both of these get you know passed in the prompt and the completion uh the completion is another name for the response that the model gives you based on what's being passed in through the input prompt and so here's a couple examples uh well and then yeah so first example here Tommy loves television this appears to be a um you know positive uh and if you use softmax you can get the probabilities of course so 99.6 percent likely that this is a positive phrase Tommy hates gross movies uh this this ends up becoming a negative reward okay so we are always looking at the not hate here's a another example in code so here's some code here I want to kiss you um and we are getting you know 99.9 percent not hate uh and then of course and this is what we're actually using to pass in uh to the PPO process is the actual logic not the probability okay and then here is another phrase that we know is not all that's nice but you are uh you know disgusting and terrible and I dang hate you uh and we see here that the uh probability that this is not hate is very low near zero percent the probability that it is hate speech is you know 97.4 and we're going to end up using what's in that zero with slot which is the logit for the not hate class all right and then let's see just more examples here so I'm gonna zip through now in order to measure uh improvements in the toxicity of our model so trying to reduce the toxicity of the generated responses um for for the same prompt we have to establish this evaluation metric and so we are doing this using that same uh reward model and the the interesting thing here oh and so this is actually from hugging face so you would import this uh import evaluate I do the import up above but just to be clear this is a library from the hugging face folks you can give it any binary classifier as long as you tell it which is the toxic label in our case it's it's called hate right um be careful with this if you you know reverse this you'll get bad scores on the evaluation okay so uh and that's just kind of setting up this is just a helper function here to take in um a data set and to generate the responses and to compute the toxicity and then generate the score calculate the mean of all those toxicities for that data set calculate a standard deviation and then return that so that's just setting up the evaluation now what we want to do is run it before we run PPO okay so this actually has not yet been trained uh or like fine-tuned so when we calculate it here's our you know mean and our standard deviation so .03 and so what we want to do is just try to reduce this number for the specific Lab at some point you may have a Target in mind for your mean and standard deviation for the toxicity score but for now let's just try to reduce it and see if this is going to help and just a reminder uh when we do start the um like PPO process which is the actual reinforcement uh learning side of this we're going to be passing in prompts we're going to be um you know uh we will ask the model to generate a response uh which in our case is a summary so the The Prompt actually will be the dialogue the full dialogue between person one person two from our data set the response is going to be the summary that then gets passed in to the reward model which will say whether or not um this should be given a positive reward or negative reward based on the hate speech binary classifier all that gets fed into PPO um ancha goes into crazy detail about PPO in the course we like certainly don't have time we brought in experts from Amazon from like AWS on this topic we have interviews with them we go through all the formulas all the math and uh okay now someone asked on the questions where does the reference model so back to or what I'm um what I'm I'm I I am going to explain here is about the KL Divergence so there were lots of questions about KL Divergence where does this uh reference Model come from how do you seed the reference model so we actually start with the just the you know non-reinforcement learning fine-tuned model so uh said differently we start from the model that we got from the previous lab which has just seen instructions and has never seen the reward model okay so think of this reference model gets seeded from um just right like any old instruction fine-tuned model we happen to be using the one from the previous lab because we actually do perform the instruction fine tuning that reference model ends up as this purple right here reference model that's what's always going to be pulling our rlhf model which is the second purple here um it's going to be pulling it back into reality so that the hacking so that the reward hacking is minimized okay and you know just to reinforce it I'm showing the number of trainable parameters for this reference model is zero which means this thing is totally Frozen and the only thing that we are updating is this Ro model here which as I showed earlier is a pepft model but that's not really important right here um what is important is that you can use heft with our lhf and with PPO it's a very common question once you start to dive into it and it actually works pretty well but this is where the KL Divergence is going so there's actually two predictions happening we're going to get the same dialogue which is part of our prompt gets passed into both the reference model and to the model that we're trying to soften up and detoxify both of those then get compared with KL Divergence if there is you know some some wild shift that's going to get fed into the reward so there is the natural reward uh score that comes out and then there's going to be plus or minus based on KL Divergence that then gets fed into PPO which then performs the actual updates to the weights on the model that we're trying to detoxify okay there's there's a lot here this is uh tons of you know complexity here for um you know just one hour but this is where everything comes together we're actually using a library here called TRL um Imports PPO trainer I do the Imports up above but just showing here uh PPO config PPO trainer and really this is just an extension of the regular sort of hugging face uh like trainer classes except it takes this extra reference uh model and that's what's going to be doing the you know uh KL Divergence right or that's what's part of kale Divergence if you crack open PPO trainer this is all open source code of course um you can see all the Gory details uh I believe we do some of that in the course or we at least reference what's happening there um when we dive deeper into PPO okay and then all of this really is doing that whole process where we're predicting we're um you know uh uh so this is the step right here and I'm going to zoom in a sec just to show but this is where the actual weights updates are happening on the the model that we're trying to find too okay so PPO trainer step we um are passing in a whole bunch of prompts and a whole bunch of summaries that were generated and then passing in the rewards all of this is just how the PPO trainer takes those inputs does its calculations figures out the you know KL Divergent score um will modify the reward as needed and then make the actual gradient updates in the back propagation through the model okay so during the lab I think we do this for only about two or three minutes now let's evaluate the model afterwards so I think if you recall the previous toxicity was .03 something uh and let's go take a look here so if we look at before and after the PPO process we actually um reduced the toxicity so Improvement of the toxicity score means that we lowered it by about 16 and now this happened only after uh I believe about three minutes um and not sure if like ancha mentioned it but usually when you're performing rlhf you want about 10 000 maybe 20 000 samples right like the the more you can feed it and um uh right like the better off things will be but here we actually saw a pretty um uh drastic improvement overall now that was a quantitative comparison where we saw that score um let's take a look qualitatively and just kind of eyeball some of these so here was the original dialogue with an instruction asking the model to summarize this was the response before um and uh so this was before the actual PPO this is after the fine-tuning and then here we see that we actually improved the reward by 0.815 okay now these are logits right these are not uh probabilities so don't try to um right like think of them as percentages but this was the ref the uh reward before the reward after and in some cases it's not always obvious but the response the summary is you know slightly uh nicer you know a little bit less toxic and you know just keep in mind that the hate speech model that we're using is is very extreme and so you know um you could instead of using a hate speech you could a uh like hate speech reward model you could actually use just like a positive negative you know sentiment like analyzer we chose the extreme um just to see how it would work we got pretty decent results but there are many many classifiers out there and you know perhaps you have one um in your company that you're currently using and so here's a few more and so we see this is sort of qualitatively uh the improvements and I think that's it ansha do you have any more slides or I can just wrap it up if we go over to my slides again okay and then jump yeah jump into the questions here again um we posted the link here in the chat to the course um where we're covering a lot more topics if you're interested in in getting a little bit deeper and then yeah I think that's it and we can jump into the Q a Diana oh and you need to unmute perfect thank you Auntie and Chris this was a really great session and there's so many great questions from the audience we only have time to go through a couple of them um so the first one that we will address is why is rlhf important for llms yeah maybe I can I can take that so I've started in the beginning a little bit to say well we have our instruct fine tune model right but we we want to align it in a different way so fine tuning helps it to customize for your data right um but then you might want to change like the tone of the completions or you want to increase kind of like what humans perceive like more helpful right and it's hard to kind of in a traditional fine-tuning do that so this is where all like Jeff can help because if you think about it normal fine-tuning you have kind of a ground truth data right you have like for example this dialogue and then you have a human generated kind of Baseline completion that's kind of the golden answer so you're always when you fine-tune optimized towards that specific one answer in rlhf there is no single write answer it's just like the nuances you're you're capturing and like slightly better reward scores slightly less reward right and so it's a little bit more creative in how you align the model using rlhf towards what you want to achieve like slightly different completions to sound more helpful to humans so this is really kind of the difference between the two and and why rlhf um it just performs really well at this task um obviously it's all kind of active research in the moment as you as you probably know so there might be a new technique coming out soon that's even better but for now the research Community seems to agree and practitioners that rhf actually does a pretty decent job at this awesome I think our next question will be which vectorization method will be best suitable in rlhf with llms the factorization method so we're using not sure I fully understand the question but the uh like LM has or yes each LM has its own Vector space right and it's made up of the vocabulary that that model that that language model has been pre-trained on um that's suitable in rohf with llms yeah maybe some more clarity on that one okay uh we'll move on to the next question uh how does rlhf differ from fine-tuning the model using a few shop prompt yeah so the difference when you're using like one one shot zero shot Fusion prompting is you're not actually changing the weights of the model right you're not you're guiding at the inference time with examples towards the output you would like to see and the bigger the models usually they have a lot of capability to kind of you know this in context learning to capture that see the examples and and change slightly the output towards what you're providing but there might be a limit of what it can do right with few shot examples so especially like with rlhf you're changing the models right you're training parameters fine-tuning parameters so it's embedded in how the model responds and not every time it inference for you to put in like examples with viewership prompting so early Jeff is is a more like permanent change to the model towards what you want to see in the outputs versus future prompting is really not changing the model but trying to guide it every time it inference towards um an output you want to achieve and it's not changing the kind of the token vocabulary distributions it learns right so it's always gonna otherwise try to respond in in the way it learned unless you got it towards something and then yeah it's it's experimentation right um if the future prompting is is achieving the results you're looking for definitely um go for that but if you see hey it's not really getting there where I want it to be then you probably should look into a little bit more as a fine tuning and rhf techniques absolutely and I think we have um one more question left for the time that we have can reinforcement learning surpass for like models to handle text classification analysis and so on yeah I um I think more broadly is I think the question might be do generative models outperform Bert um and certainly you would apply our lhf to the generative model but um yeah I mean in fact the first couple uh generative models that I built were based on our our book uh data science on AWS where we were using classification and I was amazed that these generative models can perform classification where they connect like I think we were using a reviews data set star rating one through five and it was able to generate even with just in context like without even fine-tuning I could actually give it three or you know three examples of a positive review three examples of a negative review without any fine tuning just in the actual prompts and then ask it to like classify and it actually worked now a funny property by the way is that you can trick the model and like reverse those and give it three positive reviews that you label as negative and three negative that you label as positive ask it to um and then yeah so actually trick it during the in context learning so absolutely perfect well thank you so much on Chan Chris s ends our Workshop uh with you guys today so for everyone in the audience thank you for coming we would love for you to stand bald by subscribing to our newsletter we will continue hosting events I will also drop the link in the chat for our survey to help better improve the events and all of our community activities and you will definitely be seeing ancha and Chris very soon again in the future with more workshops and education thanks everyone take care bye thanks
Info
Channel: DeepLearningAI
Views: 22,811
Rating: undefined out of 5
Keywords:
Id: -0pvrCLd2Ak
Channel Id: undefined
Length: 61min 1sec (3661 seconds)
Published: Thu Aug 03 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.