Training AI Without Writing A Reward Function, with Reward Modelling

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments

this guy rocks

👍︎︎ 8 👤︎︎ u/nameless_pattern 📅︎︎ Dec 13 2019 🗫︎ replies

Where's the code to try this myself?

👍︎︎ 1 👤︎︎ u/Stack3 📅︎︎ Dec 26 2019 🗫︎ replies
Captions
hi what is technology don't skip ahead I promise I'm going someone with this so you could have some kind of definition from a dictionary that's like technology is machinery and equipment made using scientific knowledge something like that but where are the boundaries of the category what counts for example pair of scissors technology I think most people would say no although it does meet the definition perhaps scissors used to be technology but now I think they're too simple they're too well understood I think once we've really nailed something down and figured out all of the details people stop thinking of it as technology I think in order to be technology something has to be complex and unpredictable maybe even unreliable YouTube for example is definitely technology as is the device you're watching this on ok why does this matter I guess part of my point is the exact definitions are really difficult and this generally isn't much of a problem because language doesn't really work by exact definitions maybe it's hard to specify exactly what we mean when we use a word like technology but to paraphrase something from the US Supreme Court you know it when you see it and that's good enough for most uses the reason I bring this up is sometimes people ask me about my definition of artificial intelligence and actually think that's pretty similar you could say that AI is about trying to get machines to carry out human cognitive tasks but then arithmetic is a cognitive task does that make a calculator artificial intelligence you know sorting a list is a cognitive task I don't think most people would call that AI playing a perfect game of noughts and crosses used to be considered AI but I don't think we'd call it that these days so to me AI is about making machines do cognitive tasks that we didn't think they could do maybe it's because it's about making machines do human cognitive tasks and once machines can do something we no longer think of it as a human cognitive task this means that the goal posts are always moving for artificial intelligence some people have complained about that but I think it's pretty reasonable to have that as part of the definition so that means that the goal of AI research is to continue to expand the range of tasks that computers can handle so they can keep surprising us it used to be that AI research was all about figuring out and formalizing things so that we could write programs to do them things like arithmetic sorting lists and playing noughts and crosses these are all in the class of problems that you might call things we can specify well enough to write programs that do them and for a long time that was all that we could do that was the only type of problem we could tackle but for a lot of problems that approach is really really hard like consider how would you write a program that takes an image of a handwritten digit and determines what digit it is you can formalize the process and try to write a program it's actually kind of a fun exercise if you want to get to grips with the old school like computer vision and image processing techniques and once you've written that program you can test it using the M NIST data set which is a giant collection of correctly labeled small images of digits what you'll find is if you do well then this thing will kind of work but even the best programs written this way don't work that well they're not really reliable enough to actually use someone is always going to come along with a really blunt pencil and ruin your programs accuracy and this is still a pretty easy problem I mean what if you wanted to do something like letters as well as numbers now you have to differentiate between oh and zero and one and I and a lowercase L it's forget about it's never going to work and even that is a relatively simple problem what if you're trying to do something like differentiating pictures of cats from pictures of dogs this whole approach is just not going to work for that but there is a fact that we can exploit which is that it's a lot easier to evaluate a solution than to generate a solution for a lot of these problems I've talked about this before I couldn't generate a good rocket design myself but I can tell you that this one needs work it's easier to write a program to evaluate an output than to write one to produce that output so maybe it's too hard to write a program that performs the task of identifying handwritten numbers but it's pretty easy to write a program that evaluates how well a given program does at that task as long as you have a load of correctly labeled examples you just keep giving it labeled examples from the data set and you see how many it gets right in the same way maybe you can't write a program that plays an Atari game well but you can easily write a program that tells you how well you're doing you just read off the score and this is where machine learning comes in it gives you ways to take a program for evaluating solutions and use it to create good solutions all you need is a data set with a load of labeled examples or a game with a score or some other way of programmatically evaluating the outputs and you can train a system that carries out the task there's a sense in which this is a new programming paradigm instead of writing the program itself you write the reward function or the loss function or whatever and the training process finds you a set of parameters for your network that perform well according to that function if you squint the training process is sort of like a compiler it's taking code you've written and turning it into an executive all that actually performs the task so in this way machine learning expands the class of tasks that machines can start to perform it's no longer just tasks that you can write programs to do but tasks that you can write programs to evaluate but if this is a form of programming it's a very difficult one anyone who has programmed in C or C++ will tell you that the two scariest words you can see in a specification are undefined behavior so how many folks you're a little bit afraid of undefined behavior in their source code everybody and machine learning as a programming paradigm is pretty much entirely undefined behavior and as a consequence programs created in this way tend to have a lot of quite serious bugs and these are things that I've talked about before on the channel for example reward gaming where there's some subtle difference between the reward function you wrote and the actual road function that you kind of meant to write and an agent will find ways to exploit that difference to get high reward to find things it can do which the reward function you wrote gives a high reward too but the reward function you meant to write wouldn't have or the problem of side-effects where you aren't able to specify in the reward function everything that you care about and the agent will assume that anything not mentioned in the reward function is of zero value which can lead to a having large negative side-effects there are a bunch more of these specification problems and in general this way of creating programs is a safety nightmare but also it still doesn't allow machines to do all of the tasks that we might want them to do a lot of tasks are just too complex and too poorly defined to write good evaluation functions for for example if you have a robot and you want it to scramble you an egg how do you write a function which takes input from the robot senses and returns how well the robot is doing it scrambling an egg that's a very difficult problem even something simple like getting a simulated robot to do a back flip it's actually pretty hard to specify what we about this well normal reinforcement learning looks like this you have an agent and an environment the agent takes actions in the environment and the environment produces observations and rewards the rewards are calculated by the reward function that's where you program in what you want the agent to do so some researchers tried this with the bakflip task they spent a couple of hours writing a reward function it looks like this and the result of training the agent with this reward function looks like this I guess it's that's basically a back flip I've seen better something like evaluating a back flip is very hard to specify but it's not actually hard to do like it's easy to tell if something is doing a back flip just by looking at it it's just hard to write a program that does that so what if you just directly put yourself in there if you just play the part of the reward function every time step you look at the state and you give the agent a number for how well you think it's doing it back flipping people have tried that kind of approach but it has a bunch of problems the main one is these systems generally need to spend huge amounts of time interacting with the environment in order to learn even simple things so you're going to be sitting there saying no that's not a back flip no that's not a back flip either that was closer nope that's worse again and you're gonna do this for hundreds of hours nobody has time for that so what can we do well you may notice that this problem is a little bit like identifying handwritten digits isn't it we can't figure out how to write a program to do it and it's too time-consuming to do it ourselves so why not take the approach that people take with handwritten numbers why not learn our reward function but it's not quite as simple as it sounds back flips are harder than handwritten digits in part because where are you going to get your data from four digits we have this data set M list we have this giant collection of correctly labelled images we built that by having humans write lots of numbers scanning them and then labeling the images we need humans to do the thing to provide examples to learn from we need demonstrations now if you have good demonstrations of an agent performing a task you can do things like imitation learning and inverse reinforcement learning which are pretty cool but there are subject for a later video but with backflips we don't have that I'm not even sure if I can do a back flip and that wouldn't help wait really I don't have to do it no we don't need a recording of a human backflipping we need one of this robot backflipping right there physiology is different but I don't think I could puppeteer the simulated robot to backflip either that would be like playing co-op on nightmare mode so we can't demonstrate the task so what do we do well we go back to the Supreme Court exactly defining a back flip is hard doing of actually this hard but I know a back flip when I see one so we need a setup that learns a good reward function without demonstrations just by using human feedback without requiring too much of the humans time and that's what this paper does it's called deep reinforcement learning from human preferences and it's actually a collaboration between open AI and deep mind the paper documents a system that works by reward modeling if you give it an hour of feedback it does this that looks a lot better than two hours of reward function writing so how does reward modeling work well let's go back to the diagram in reward modeling instead of the human writing the reward function or just being the reward function we instead replace the reward function with a reward model implemented as a neural network so the agent interacts with the environment in the normal way except the rewards it's getting are coming from the reward model the reward model behaves just like a regular reward function in that it gets observations from the environment and gives rewards but the way it decides those rewards is with a neural network which is trying to predict what reward a human would give okay how does the reward model learn what reward a human would give well the human provides it with feedback so the way that works is the agent is interacting with the environment you know trying to learn and then the system will extract two short clips of the agent flailing about just a second or two and it presents those two clips to the human and the human decides which they liked better which one is more backflipping and the reward model then uses that feedback in basically the standard supervised learning way it tries to find a reward function such that in situations where the human prefers the left clip to the right clip the reward function gives more reward to the agent in the left clip than the right clip and vice-versa so which clip gets more reward from the reward model ends up being a good predictor of which clip the human world which should mean that the reward model ends up being very similar to the reward function the human really wants but the thing I like about this is the whole thing is happening asynchronously it's all going on at the same time the agent isn't waiting for the human it's constantly interacting with the environment getting rewards from the reward model and trying to learn at many times faster than real time and the reward model isn't waiting either it's continually training on all of the feedback that it's got so far when it gets new feedback it just adds that to the data set and keeps on training this means the system is actually training for tens or hundreds of seconds for each second of human time used so the human is presented with a pair of clips and gives feedback which takes just a few seconds to do and while that's happening the reward model is updating to better reflect their previous feedback at God and the agent is spending several minutes of subjective time learning and improving using that slightly improved reward model so by the time the human is done giving feedback on those clips and it's time for the next pair the agent has had time to improve so the next pair of Clips will have new hopefully better behavior for the human to evaluate this means that it's able to use the humans time quite efficiently now to further improve that efficiency the system doesn't just choose the clips randomly it tries to select clips where the reward model is uncertain about what the reward should be like there's no point asking for feedback if you're already pretty sure you know what the answer is right so this means that the user is most likely to see clips from unusual moments when the agent has worked out something new and the reward model doesn't know what to make of it that maximizes the value of the information provided by the human which improves the speed the system can learn so what about the usual reinforcement learning safety problems like negative side effects and reward gaming you might think that if you use a neural network for your reward signal it would be very vulnerable to things like reward gaming since the reward model is just an approximation and we know that neural networks are very vulnerable to adversarial examples and so on and it's true that if you stop updating the reward model the agent will quickly learn to exploit it to find strategies that the reward model scores highly but the true reward doesn't but the constant updating of the reward model actually provides pretty good protection against this and the way that the clips are chosen is part of that if the agent discovers some crazy new illegitimate strategy to cheat and get high reward that's going to involve unusual novel behavior which will make the reward model uncertain so the human will immediately be shown clips of the new behavior and if it's reward gaming rather than real progress the human will give feedback saying no that's not what I want the reward model will update on that feedback and become more accurate and the agent will no longer be able to use that reward gaming strategy so the idea is pretty neat and it seems to have some safety advantages how well does it actually work is it as effective as just programming a reward function well for the back flip it seems like it definitely is and it's especially impressive when you note that this is two hours of time to write this reward function which needs a lot of expertise compared to under one hour of rate and clips which needs basically no expertise so this is two hours of expert time versus one hour of novice time now they also tried it on the standard magico simulated robotics tasks that have standard reward functions defined for them here it tends to do not quite as well as regular reinforcement learning that's just directly given the reward function but it tends to do almost as well and sometimes it even does better which is kind of surprising they also tried it on Atari games now for those it needed more feedback because the task is more complex but again it tended to do almost as well as just providing the correct reward function for several of the games also there's kind of a fun implementation detail here they had to modify the games to not show the score otherwise the agent might learn to just read the score off the screen and use that they wanted to rely on the feedback so it seems like reward modeling is not much less effective than just providing a reward function but the headline to me is that they were able to train these agents to do things for which they had no reward function at all like the back flip of course they also got the cheetah robot to stand on one leg which is a task I don't think they ever tried to write a reward function floor and in enduro which is an Atari game a racing game they managed to train the agent using reward modeling to stay level with other cars even though the games score rewards you for going fast and overtaking them and what all this means is that this type of method is again expanding the range of tasks machines can tackle it's not just tasks we can write programs to do or tasks we can write programs to evaluate or even tasks we're able to do ourselves all that's required is that it's easy to have great outputs that you know good results when you see them and that's a lot of tasks but it's not everything consider for example a task like writing a novel sure you can read two novels and say which one you liked more but this system needed 900 comparisons to learn what a back flip is even if we assume that writing a novel is no more complicated than that does that mean comparing 900 pairs of AI generated novels and a lot of tasks are like this what if we want our machine to run a company or design something complex like a cities transportation system or a computer chip we can't write a program that does it we can't write a program that evaluates it we can't reliably do it ourselves enough to make a good data set we can't even evaluate it ourselves without taking way too much time and resources so we're screwed right not necessarily there are some approaches that might work for these kinds of problems and we'll talk about them in a later video [Music] I recently realized that my best explanations and ideas tend to come from actual conversations with people so I've been trying a thing where for each video I first have a couple of video calls with patreon supporters where I try sort of running through the idea and seeing what questions people have and what's not clear and so on so I want to say a big thank you to the patrons who helped with this video you know you are I'm especially thanking Jake Eric and of course thank you to all of my patrons who make this whole thing possible with their support which reminds me this video is sponsored by nobody know I actually turned down a sponsorship offer for this video and I'll admit I was tempted because it's a company whose product I've used for like 10 years and the offer was thousands of pounds but they wanted me to do this whole 60-second long spiel and I just thought no I don't want to waste people's time with that and I don't have to because I've got patreon so thank you again to all of you if you like learning about AI safety more than you like learning about mattresses and VPNs you might want to consider joining those link in the description thanks again for your support and thank you all for watching hi there my knees
Info
Channel: Robert Miles
Views: 159,401
Rating: 4.9604664 out of 5
Keywords: AI, AGI, Artificial Intelligence, AI risk, AI safety, robert miles, robert miles AI, rob miles, rob miles AI, reinforcement learning, RL, reward modelling, backflip, deepmind, openai
Id: PYylPRX6z4Q
Channel Id: undefined
Length: 17min 51sec (1071 seconds)
Published: Fri Dec 13 2019
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.