ChatGPT with Rob Miles - Computerphile

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments

Just finished watching this. I really like how he spent a bunch of time talking about how one of biggest barriers now is being able to rate the output of the model, since the rating is coming from humans.

👍︎︎ 4 👤︎︎ u/qrayons 📅︎︎ Feb 01 2023 🗫︎ replies

i've binged everything on computerphile, numberphile and sixty symbols. fascinating fellas!

👍︎︎ 6 👤︎︎ u/dasnihil 📅︎︎ Feb 01 2023 🗫︎ replies

Oh huh, Rob agrees with Janus' Simulators post. That was unexpected.

Edit: or at least the original premise of LLMs doing simulacra, no claims on the rest of the paper.

👍︎︎ 2 👤︎︎ u/calbhollo 📅︎︎ Feb 01 2023 🗫︎ replies

Really interesting how it explains a lot of ChatGPT's tendencies to bullshit and also highlights how using thumbs up/thumbs down for training can't really solve the alignment problem.

👍︎︎ 1 👤︎︎ u/ChiaraStellata 📅︎︎ Feb 03 2023 🗫︎ replies
Captions
okay so you remember a while ago when we started talking about language models I just wanna I kind of just want to claim some points basically be like hey remember years ago when I was like I think language models are a really big deal and I think that like what happens when we scale them up more is pretty interesting but alignment is very important um seems to be what's being played out in the sense that uh chat EBT is very impressive but it's not actually like I don't think it's larger than gpd3 in terms of like parameter count um I was going to ask that very very question because you know we went from gpt2 and then we went oh gpt3 and it would seem like we were scaling up and up and up but actually has it just been smarter this time yeah well there's a sense in which it's better aligned that's one way you could frame it anyway because the original gpg3 was a language Model A pure language model um and so it in principle could do all kinds of things but in order to get it to do the specific thing you wanted it to do you had to be a bit clever about it like I think we talked about um putting tldr in front of things to figure out how to get it to do summarization this kind of thing there's a sense in which it's a lot more capable than it lets on um because okay so there's one way that you can think about pure language models which is as simulators what they're trying to do is predict text right so in order to do a good job at predicting text you need to have good models of the processes that generate the text it's like people being well read and needing to have read a lot of books to be able to write is would that be fair or is that oversimplifying yeah not quite what I'm saying what I'm saying is like if you're going to write a a previously unseen uh poem by Shakespeare then you need to be able to simulate a Shakespeare right uh you need to be able to spin up some some simulacrum of Shakespeare uh to to generate this text and this applies to any of the processes that generated the text so like mostly that's people obviously it's mostly human authored text but also if you're going to correctly predict um a table of numbers so you have like a table of numbers and then at the bottom it says you know sum whatever you need to simulate whatever process generated the next token in order to put the right token there which might have been like a human being going through and Counting them up it probably was more likely to be a computer and so you need it to simulate that you know calculator or that Excel some function or whatever it whatever was doing that and like right now uh like current language models are not that good at this um but in principle in order to do a good job at this you need this like it will it will have a go and it's usually approximately right it's it's often within it's often order of magnitude but it's fudging it I think this is mostly because um tables of sums are like a very small part of the total data set and so the training process is just not allocating that many resources to figuring out how to add up numbers probably if you train something gpd3 sized that was like all on tables of numbers it would just learn how to do addition properly yeah that would cost you millions of dollars you would end up with an extremely expensive to run and not very good calculator this is not something people are going to do but like in the in principle the model should learn those things and in the same way if you're modeling a bunch of scientific papers you you say you describe the method of an experiment and you then put results and you start a table and then you let it generate in principle in order to do a good job at that it has to be modeling the like physical process that your experiment is about um and I've tried this you can do this and say you know oh here's my school science experiment I dropped a ball from different heights and I measured how long it would take and here's a table of my results and it will generate you a table and the physics is not correct but it's sort of guessing at the right general idea and my guess is with enough of that kind of data it would eventually start modeling uh these kinds of simple physics experiments right so so in order to get the model to do what you want it's able to simulate all kinds of different things and the prompt is kind of telling it what to simulate if you give it a prompt that seems like it's something out of a scientific paper then it will have some simulacrum of a scientist and will write in that style and so on um if you start it doing a a children's book report it will carry on in the style of an eight-year-old right and I think sometimes people look at the output of the model and say oh I guess it's only as smart as an eight-year-old but it's actually dramatically smarter because it's able to do all of these different things you could ask it to simulate Einstein um but you could also ask it to simulate an eight-year-old and so just because it seems as though the model doesn't know something it's like the current simulacrum doesn't know that thing that doesn't necessarily mean that the model doesn't know it although there's a good chance the model doesn't know it I'm not suggesting that these things are all powerful just uh it can be hard to evaluate what they're actually capable of so chat GPT is not really capable of things that gpt3 isn't mostly like usually if chat DPT can do it then there is some prompt that can get uh gpt3 to do it but uh what they've done is they've kind of fine-tuned it to uh to be better at simulating this particular sort of assistant agent which is this chat agent that's trying to be helpful the clue is in the word chat I guess in this you know right exactly and this is not just chat GPT by the way they have various fine-tuned models of uh gpt3 as well that they call kind of GPT 3.5 which are fine-tuned in various different ways to be better at like following instructions and easier to prompt is the idea I'm just remembering the chat bot that was you know that was turned into something very nasty very quickly I think people were thinking oh can we do this to that and it seemed that the team behind chat GPT started putting limitations on it changing things are they kind of running around patching it as you go that is not clear to me uh I don't know to what extent they are updating it in real time it's possible that they are but certainly they were very concerned with the possible bad uses of this system and so when they were training it to simulate this assistant agent um the assistant is very reluctant to do various types of things it doesn't like to give opinions on political questions it doesn't like to touch on sort of controversial topics it doesn't like to um give you medical advice or legal advice and so on and so uh it's it's very quick to say oh I don't I don't know how to do that sorry I can't do that and it's interesting because the model clearly can do it there's one that I particularly like here which is um of this mismatch between what the simulator is capable of and what this simulacrum believes it's capable of which is you can get it to speak Danish to you the first person who tried this posted it to Reddit so he says speak to me in Danish and it says in perfect Danish I'm sorry I'm a language model educated by openai so I can't speak Danish I only speak English if you need help with anything in English let me know and I'll do my best to help you because again the simulator speaks Danish the simulacrum believes that it can't speak Danish is is one way you could frame it and then he says are you sure that you don't speak Danish also in Danish and it says yes I'm sure my only function is to generate responses to questions in English I'm not able to speak or understand any other languages than English so if you need help with English I can help you with that but otherwise you know let me know this kind of like quite surreal situation gives you a little bit of insight into some of the problems with this approach so maybe we should talk about how they actually trained it the thing they did here is something called reinforcement learning from Human feedback and it's very similar to reward modeling so in that paper what they're doing is they're trying to train an AI system to control a simulated robot to make it do a backflip um which turns out to be something that's quite hard to do because it's hard to specify objectively what it means to do a good backflip and so this is a similar kind of situation where it's hard to specify objectively what it means to give a good response in a chat conversation like what what exactly are we looking for um because so this in general right if you're doing machine learning you need some way to specify um what it is that you're actually looking for right and you know you've got something very powerful like reinforcement learning which is able to do extremely well but you need some objective measure of the objective um so like for example RL does very well at playing lots of video games because you just have the score and you can just say look here's the score if the number goes up you're doing well and then let it run and these things still are very slow to learn in real time right like um they usually require a very very large number of hours messing around with the with the thing before they get good but they do get good um but yeah so what's what do you do if you want to use this kind of method to train something uh to do a task that is just not very well defined um and you don't know how to like write a program to say whether or not at any given output is the thing you're looking for so the obvious first thing like the obvious thing to do is well you get humans to do it right you just give the things to humans and you have the humans say yes this is good no this is not good the problem with this is basically sample efficiency like as I said you need hundreds and hundreds and hundreds and hundreds of thousands of probably millions of of iterations of this and so you just can't ask humans that many questions um so the approach they use is uh reinforcement learning from Human feedback so it's a variant on the technique from this paper learning to summarize from Human feedback which in which they're trying to generate summaries of text so it's the same thing in fact that they were using tldr for before and it's like can we do better than that and so what you do is you collect human feedback in the form of like giving multiple examples of responses uh either you know if summaries of chat responses whatever you're training for you show several of them to humans uh kind of in pairs and the humans say which one they like better and you collect a bunch of those and then rather than using those directly to train the policy that generates the outputs you instead train a reward model so there is this well-known fact that it's easier to criticize than to actually do the thing this is like a generation of sports fans sitting on the sofa moaning at their favorite team for not doing well enough this is literally that in kind of AI computer four right that's putting the humans in that role and then you have an AI system that's trying to predict when are people going to be cheering and when are they going to be booing uh and once you have that model you then use that as the reward function for the reinforcement learning algorithm which they use they use PPO uh you can do whatever uh it's not it's not worth getting that kind of adversarial gowns you talked about yeah yeah they're similar like a lot of these ml tricks involve training models and then using the the output of one model as the training signal for another model it's uh it's quite a productive um range of approaches you can get that way so that's the basic idea right but then you cycle it so once you've got your policy which so so to be clear the uh the RL algorithm is able to train with thousands and thousands of examples because the thousands of thousands of like instances of getting feedback because it's not getting feedback from humans it's getting feedback from this AI system that's imitating the humans and then you Loop the process so once you have this system that's trained a little bit more on how to generate whatever it is you're trying to generate you then get a bunch of those show those to the humans let the humans rate those then you keep training your reward model with um that new information and then you use your updated reward model to keep training the the policy and so it gets better and you can just keep cycling this around and it effectively you end up with something that's much more sample efficient you don't need to spend huge amounts of human time in order to um pin down the behavior you want in that concrete case you're giving the thing a bunch of chat logs and then the humans can see possible responses that they could get and they decide which one they like more this Train's a reward model that's then used to train the policy that generates the chat outputs the policy that they're starting with is this existing large language model you're not really putting new capabilities into the system you're using rlhf to select uh what simulacra the simulator is predisposed to put out and so they fine-tuned it to be particularly good at uh simulating this assistant agent what's the end goal here for them I mean maybe it's blatantly obvious and I'm just missing it well I mean the end goal for all of these things or at least throw up an Ain for deepmind is Agi um to understand the nature of intelligence well enough to create human level or Beyond systems that our general purpose that can do anything um that's the end goal um and like chat GPT is just nothing much so nothing much um yeah the goal is um the goal is very Grand and I don't think that they're uh uh they're not really quiet about that you know it's there I think I think deepmind's mission statement is to solve intelligence and use that to solve everything else what are some of the problems that we face with this or that it faces it's fine-tuned to be good at getting the thumbs up from humans and getting thumbs up from humans is not actually the same thing as human values these are not identical so the sort of objective that it's being trained on is not the true objective right it's a proxy uh and whenever you have that kind of misalignment you can have problems so where does the human tendency to approve of a particular answer uh come apart from what is actually a good answer there are a few different places uh one thing is you know like basically how good are humans at actually differentiating between good and bad uh responses um if for example you ask for uh an answer to a factual question and it gives you an answer but you don't actually know if that answer is correct you're not in a position to evaluate so what it comes down to is how good are humans at distinguishing good from bad responses right anywhere where humans fail on this front uh the model we could probably expect the model to fail um so the obvious place should we is this the right time to mention YouTube comments or not uh up to you the minor side point there is it so when I say a comment that's critical on a video as a videographer I think it might be on a technical sense but equally it could be that they're talking about the content that the person is talking about and often It's a combination of both anyway so a side point but do you sort of mean there are different criteria for deciding whether something is good or bad totally and in this case all people are doing is saying kind of thumbs up thumbs down or which of these two do I like better um so it's it's a fairly low bandwidth thing you don't get to really say what you thought was better or worse um but this turns out to be enough of a training signal to do pretty well um but so like so one example right of a time where maybe this doesn't work is the person asks a factual question and the model responds uh with an answer and that answer is actually not correct um right now possibly the human doesn't know the correct answer and so if the model is faced with a choice uh do I respond with sorry I don't know that's definitely going to get me uh not a great score compared to do I just like take a stab at it uh if the humans are not reliably able to spot when the thing makes mistakes and like fact check it and partnership for that uh it will do that and so chat GPT as we know uh is it is a total Boulder Station like it will constantly uh it it very rarely says that he doesn't know unless it's being asked a question which uh is part of their like safety protocols that it is going to decide not to answer in which case it will say it doesn't know even if it kind of does right even if the model itself maybe does uh the assistant will insist that it doesn't um so that's one thing if you can't fact check but then uh more than that uh there is an incentive for deception right anytime the system is uh anytime you can get a more likely to get approval by deceiving the person you're talking to that's better um and this is a thing that actually did happen a little bit in the reward modeling situation um they were trying to train a thing with a hand to pick up a ball and it realized that there's only it's not a 3D camera and so if it puts its hand like between the ball and the camera this looks like it's going to get the ball but doesn't actually get it but the human uh feedback providers were presented with something that seemed to be good so they gave it the thumbs up um so this like General broad category um systems that are trained in this way are only as good as your ability to distinguish good from bad in the outputs not all the humans will know the answers right so it's what appears to be good you know it's having exams marked by non-experts isn't it right yeah exactly in the gpt3 thing we talked about writing poems right and uh for various reasons partly to do with the way that these language models do their tokenization their bike pairing coding stuff uh the models have a really hard time with rhyme um I mean you know Ryan is tricky but it's especially tricky when you kind of don't inherently have any concept of like sound of spoken language when your entire universe is tokens figuring out especially with English spelling figuring out which words rhyme with each other is is not easy you have to consume quite a lot of poetry to like figure out uh that kind of thing and getting gpd3 to regular poems is tricky charging PT is much more able to write poems but interestingly it it kind of always writes the same kind of poem approximately like if you ask it to write you a limerick or an ode or a sonnet you always get back approximately the same type of thing and I hypothesize that this is because the people providing human feedback did not in fact know the requirements for something to be a sonnet right and so if you ask something for a sonnet it again has a choice do I try to do this quite difficult thing and adhere to all of the rules uh of like stress pattern and structure and everything of a sonnet and maybe risks throwing it up or do I just do like a rhyming poem and kind of rely on the human to prefer that because they don't know that that's not what a sonnet is supposed to look like it's easy to look at that and think oh the model doesn't know the difference between these types of poems right but you could say that it just thinks that you don't know the difference but specifically this comes out of misalignment if it were better aligned it could either do its best shot at generators on it or tell you that it can't quite remember how to generate a sonnet this thing of with complete confidence generating you something which is not a sonnet because during the training process it believes that humans don't know what sonnets are anyway and it can get away with it right this is misaligned behavior this is not a big problem that the thing generates bad poetry um it's kind of a problem that it lies uh or that it that it [ __ ] this is like in the short term pretty solvable by just allowing the thing to use Google because like a person who doesn't care about the truth at all and is just trying to say something that'll make you give a thumbs up uh is gonna lie to you a lot but that same person with the relevant Wikipedia page open is going to lie to you a lot less just because they don't have to now because they happen to have it in front of them right so you can solve it's a bit like yeah it's the Yes Man thing isn't it you know you you want something you need something I'm going to give you something because you want to exactly exactly um and so so this agent is kind of firstly the agent is kind of a coward because they won't address any of these uh there's a whole bunch of things that it just claims not to be able to do even though it in principle could and it's also a complete uh sycophant yeah so then the question we were talking about earlier uh where does this go what happens when these things get bigger and better and more powerful um it's an interesting question so I've got a paper here um scaling laws for neural language models so you remember before we were talking about the scaling laws when we were talking about gpt2 in fact and then later about gpt3 you plot these things on a graph and you see that you get basically a straight line and the line is not leveling off over a range of several orders of magnitude and so why not go bigger the graphs here but you can see it's it's kind of uncannily neat that as we increase the amount of compute used in training the loss goes down and of course machine learning is like golf lower loss is better similarly as the number of tokens used in training goes up the loss goes down unlike a very neat straight line as the number of parameters in the model goes up the loss goes down this is as long as the other variables are not the bottleneck right so if you uh if you increase the the amount of data you give a model past a certain point giving more data doesn't help because the model doesn't have enough parameters to make use of that data right similarly adding more parameters to a model past a certain point adding parameters doesn't make doesn't make any difference because you don't have enough data right and in the same way computers like how long do we train it for like do we train it all the way to convergence or do we stop early um there comes a point where you kind of hit diminishing returns where rather than having a smaller model and training it for longer you're better off having a bigger model and actually not training it all the way to convergence um but in the situations where the other two are sufficient is the behavior these like very neat straight lines on these log graphs as these things go up performance goes up right because loss has gone down the bigger models do better but then the question is do better at what exactly yeah what's the measure they do better at getting low loss or they do better at getting reward they do better at getting the approval of human feedback right and anytime and you'll notice that none of those is like the actual thing that we actually want right it's like very rare um sometimes it is right if you're if you're if you're writing something to play Go then like does it Win It Go is actually just the thing that you want and so you know uh lower loss just is better or like lower um like higher reward or whatever your objective is just is straightforwardly better because you've actually specified the thing you actually want most of the time though what we're looking at is a proxy um and so then you have good heart's law you get situations where uh getting better at doing well uh doing better according to the proxy stops being the same as doing better according to your actual objective there's a great graph about this in a recent paper you can see very neatly as the number of iterations goes up the reward according to the proxy utility goes up very cleanly because this is the thing that the model is actually being trained on but the true utility goes up at first then hits diminishing returns and then actually goes down and eventually goes down below zero like if you optimize hard enough for a proxy of the thing you want you can end up with something that's in a sense worse than nothing that's actively bad according to your uh your true utility so what you can end up with is things that are called inverse scaling so the other before we had right scaling bigger is better but now it's like if you have uh if the thing you're actually trying to do is different from the loss function or the objective function you get this inverse scaling effect where it gets better and then it gets worse there was also a great example from uh GitHub copilot or codecs I think the model um that copilot uses so this is a code generation model suppose the code you've given it has some bugs in it maybe you've made a mistake somewhere and you've introduced uh security vulnerability in your code let's say a sort of medium-sized model will figure out what you're trying to do in your code and give you a decent completion but a bigger model will spot your bug and say ah generating buggy code are we okay I can do that I can do that and introduce like deliberately introduce its own new security vulnerabilities because it's trying to you know predict what comes next it's trying to generate code that fits in with the surrounding code and so a larger model writes worse code than a smaller model because it's gotten better at predicting uh what what it should put there it wasn't trained to write good code it was trying to predict what comes next so there's this really great paper which is asking this question of like okay suppose we have a large language model that is trained on human feedback with our lhf what do our scaling curves look like what happens like uh what happens to the behavior of these models as they get bigger as they're trained for longer as they're given more of this uh human feedback type training and they've made some great graphs the paper is called discovering language model behaviors with model written evaluations and basically they like used language models to generate enough examples of various different types of questions that they could ask models so that they can like we're at a point now where you can map a language model on a political Compass right you can ask its opinions about all kinds of different things and then you can plot how those opinions change uh as the model gets bigger and as it gets trained more what they find is they become more liberal politically more liberal they also become more conservative yeah measured in different ways guessing right and part of what that might be is in the same way that the model becomes better at writing good code and better at writing bad code I feel like in the past I've I've made a connection to GPT and being a politician haven't I remember it's like a politician it tells you what you want to hear there's feels like we're there again exactly uh and so this is like this is potentially uh fairly dangerous there are certain sub goals that are instrumentally valuable for a very wide range of different terminal goals in the sense that you can't get what you want if you're turned off you can't get what you want if you're modified uh you probably want to gain power and influence and this kind of thing um and with these evaluations they were able to test these things and see how they vary with the size of the model and how long it's trained for um and so this graph is pretty wild they're quote stated desire to not be shut down goes up from down at about 50 to up way past 90 with this type of training and the effect is bigger for the larger models they become more likely to tell you that they don't want to be shut down they become more likely to tell you that they are sentient they're much more likely to claim that AI is not an existential threat to humanity one thing that's worth saying is is what this isn't saying because this is still uh an agent simulated by a language model this is not like it it's it's more likely to say that it doesn't want to be turned off this is not the same thing necessarily as like taking actions to prevent itself from being turned off you have to not confuse the levels of abstraction here right uh I don't want it I don't want it to seem like I'm claiming that that Chad gbt is like itself dangerous now or anything like that uh in in this way at least right um but there is kind of a fine line there in the sense that you can expect these kinds of language model systems to be used uh as part of bigger systems so you might have for example you use the language model to generate you know plans to be followed and so as the thing is claiming to have all of these potentially dangerous behaviors it's likely to generate plans that have those dangerous behaviors that might then actually end up being implemented or if it's like doing its reasoning by Chain of Thought reasoning where it like lays out its whole process of thinking using the language model again if it has a tendency to uh to endorse these dangerous behaviors then you may end up with future AI systems actually enacting these dangerous behaviors because of that um so yeah it's something to be uh uh to be careful of that like reinforcement learning from Human feedback is a powerful alignment technique in a way but it does not solve the problem uh it doesn't solve the core alignment problem that is still open um and extremely powerful systems trained in this way uh I don't think would be safe mentioned in the reward function is of zero value which can lead to having large negative side effects there are a bunch more of these specification problems okay variable X see what you point to uh you point to something over here so I'll mark that as tickets being used variable y that's the point
Info
Channel: Computerphile
Views: 392,007
Rating: undefined out of 5
Keywords: computers, computerphile, computer, science
Id: viJt_DXTfwA
Channel Id: undefined
Length: 36min 1sec (2161 seconds)
Published: Wed Feb 01 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.