RLHF+CHATGPT: What you must know

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
but um I want to talk a little bit about rlhf uh we spoke about that before we started recording and that has a very interesting effect on on language models so it's not so much it changes their capabilities but it creates an interface for us to use but in in the most abstract um sense how do you understand rlh yeah so I I think um so in in popular culture now there's actually quite a good meme that explains it um I think it kind of carries the spirit of it which is this uh Shogun meme that you've probably seen yeah uh where uh the idea is that like the base model that's just trained to model P of X the distribution over internet text uh that's this very sort of chaotic being that you know has all you don't it's it's uh it's this very like enormous being that essentially has modeled the whole internet of text and so there's lots of um crevices within that distribution and there's both good and bad there's uh there's a lot of probably hate speech but then also lots of very brilliant content as well um and so that's kind of what you get if you just model the whole internet as a distribution and using next token prediction on these large models um and then of course the problem becomes if I ask this model to essentially perform a task by essentially providing it a prompt like a prefix into a text that I wanted to complete then it's it can be difficult to anticipate how it's going to complete it like if I ask it um a question uh about calculus is it going to complete it as a professor of math or is it going to complete it by giving me a response from a random you know like commenter on 4chan or like a Reddit post and you might get wildly different qualities of responses based on that um in reality um it's sort of this like giant mass of just people that it's modeled and so it's like multiple personalities times a billion people on the internet that's what it's modeling and so then this meme basically says okay well rlhf is basically sticking essentially like a smiley face on top of this where it's essentially giving you um it's basically hiding this mess it's hiding the fact that it's you know this chaotic like population of text that it's modeled and instead uh it's going to provide you with a a very friendly interface into specific parts of that uh mass of people it's modeling and the way it's doing that is it's basically training it's fine-tuning the model on a reward signal that's um that itself is learned from uh human preference data so basically they collect human feedback data from different generations of the language model and they essentially use that to train another model that outputs a feedback signal for whether given an input how good is an output so it's modeling human preferences that's um empirically collected um and then you use that as a reward signal to fine-tune the predictions the generations of your language model so treating the language model now in the fine-tuning phase as a reinforcement learning policy so like basically given what I've generated thus far like uh and the prompt what should I what is the next token I should predict and so treating this as a reinforcement learning problem uh where the reward signal is this human preference model and so um what that's doing is it's essentially saying you started with P of X which is modeling the distribution of internet text and now we're going to use RL where we're basically going to start to introduce bias into this distribution um so what what is interesting about this is that when you train a typical language model you're basically training it with something like a cross entropy loss and cross entropy loss is equivalent to a Divergence metric between two distributions and so when you do this um you're essentially you know if you had tons of uh computation and when the process converges you should expect that your model is essentially learning to model it's minimizing the distributional Divergence this distance between distributions uh between its learn distribution over text and the training data distribution over text and so basically when this process converges when you minimize this loss it should actually be matching the distribution of text on the internet what reinforcement learning does is almost the opposite of this reinforcement learning is not doing distribution matching reinforcement learning is mode seeking so if you had some data where you say um I have maybe 51 of a positive I have 51 percent of an example where the right answer is a and uh 49 where the right answer is B but I can't tell ahead of time like which which answer it should be because maybe the two inputs are aliased then reinforcement learning is going to if you keep training the policy to maximize the reward it's going to always choose the first answer because it's got a slight bias in the distribution but by always choosing it you're going to maximize the reward in expectation that's what it's doing so reinforcement learning is mode seeking and so when you apply this on top of um a p of X that you've learned to just like that's distribution matching internet text you're essentially introducing these like mode seeking biases into your model and so it's going to tend to the the generations are going to tend to hone in more they're going to tend to call perhaps more towards the types of outputs the part of the the the domain of language where human preferences have assigned learned human preference has assigned a higher reward and so what you're doing is you're losing a lot of the diversity of P of X so you're losing a lot of diversity in exchange for perhaps more reliable Generations that take you more into the parts of the distribution um the original distribution that had more you know higher quality answers so maybe now if it's tuned to give me good answers on math questions now if I ask it a calculus question it'll tend to favor those those uh completions that are modeling the outputs of a college professor of math rather than you know someone who's like asking the same question on Reddit and saying help I don't know how to do this problem yes yes so what could possibly go wrong um yeah as you say so a language model it's learning this conditional probability distribution conditioned on a sequence of tokens what's the next token can and that probability distribution has loads of modes it's like this big hilly landscape and some of the modes are you know um 4chan some of the modes are Stanford University and we want to kind of like snip out all of the bad ones and all of the ones that we like we we want to remain now I guess I wanted your intuition on how this pruning can we call it a pruning process how that affects both the capability and the bias yeah so I think it essentially improves um how reliable the answers are by introducing a bias so you're biasing the model to generate um completions that were favored by humans in the preference uh in the when you collected the preference data and so if you assume the preference model that it's used to train on is a is a proper reflection of the human preferences then it is biasing the model towards those that whatever human participants used preferred and so that itself also introduces bias because it's like the specific humans that are providing their value um assignments to the completed answers they're you're essentially distilling their reward like you're distilling their preferences and so the actual humans the choice of humans for collecting the preferences is very important um because you're ultimately the model is going to exhibit those values as well and uh this is at the cost of the diversity of the generations that you can you can sample because uh rather than sampling lots of different possible paths where again like from the why greatness cannot be planned perspective right that's quite that's quite like useful sometimes because maybe by generating something that's unlikely under a fine-tuned model it's actually acting as like a prefix stepping stone to like a better answer right that just that somehow was um glanced over within the preference function that you learned um you end up generating uh of less diversity and so you end up going towards maybe answers that are good but maybe there's better answers or there's more interesting answers um that would otherwise be generated could have touched on a couple of things I mean even before we do rl8 Chef when you look at the um the probability distribution it's kind of it's It's exponentially distributed so it's still likely to say the next word is this and this and much less likely to be something else but as you say from a open-ended perspective we are making it far more convergent and you could argue yeah it's a form of robustification but then when the humans give their preferences you know we were talking about good earlier that's a proxy so we have benchmarks like big bench we have um this human over there said it looked like a good output yep and what what's the right thing to do I guess what I'm saying is that there's there's what's right from an alignment point of view so there's morality and ethics there's being able to perform well on mathematics challenges and there's this human over there thought it was a good thing to do yeah I mean I think it really depends on the use case uh so this is kind of tying back to the industry slant because I think that our lhf in some it does make a lot of sense if you want your language model to essentially be a Google replacement if you want it to be a search engine then it makes a lot of sense to bias it towards um a subset of the distribution that corresponds to like good answers to search queries um but this is going to like an example of when this would be in more perhaps direct opposition to a use case to another use case would be if you want to use a language model um as a creative writing assistant right if I want to use the language model to help me generate new creative art in the form of a novel or a short story or a poetry um by using HF you're reducing the diversity of the outputs because you're basically saying I'm going to play it safe rather than uh play the wild card and get more interesting content but when you're an author when you want to actually create new cultural content that's interesting you often want to play the Wild Card you want to explore the space of ideas you wanna and sometimes items that requires going through Stepping Stones of ideas that maybe are pretty sub-optimal or maybe controversial or offensive um but that's all required to get to somewhere better which is maybe somewhere where you can't get if you were just following the uh RL rohf uh fine-tuned trajectories yes and and that's why lots of um creative people I think I read on less wrong they prefer using the original DaVinci 3 because they thought that the command models were less created but um here's something interesting I mean as an open-ended person
Info
Channel: Machine Learning Street Talk
Views: 42,574
Rating: undefined out of 5
Keywords:
Id: PBH2nImUM5c
Channel Id: undefined
Length: 10min 48sec (648 seconds)
Published: Sun Mar 26 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.