Stop Button Solution? - Computerphile

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Was just about to post this :) There's one thing about this that makes me cautious however, Robert mentions that the robot does not know what the reward is and will observe humans to figure out what the humans want. What's to stop the AI from bootstrapping into its own system and hacking its reward function so it understands it and optimises it?

👍︎︎ 2 👤︎︎ u/TimesInfinityRBP 📅︎︎ Aug 04 2017 🗫︎ replies

Captions

a while back we were talking about the stop button problem right you have this you have this it's kind of a toy problem in a high safety you have an artificial general intelligence in a robot it wants something you know it wants to make you a cup of tea or whatever you put a big red stop button on it and you want to set it up so that it behaves courage' bleah that it will allow you to hit the button it won't hit the button itself you know and it won't try and prevent you this sort of behaving in a in a sensible way in a safe way and that like by default most AGI designs will not behave this way well we left it as an open problem right and it kind of still is an open problem but there have been some really interesting things proposed as possible solutions or approaches to take and I wanted to talk about cooperative inverse reinforcement learning I thought the easiest way to explain cooperative inverse reinforcement learning is to build a tower backwards right learning we know like machine learning and reinforcement learning is an area of machine learning I guess you could call it it's it's kind of a it's a way of presenting a problem in most machine learning the kind of thing that people have talked about already a lot on computer file thinking of ovals videos and the related ones usually you get in some data and then you're trying to do something with that like classify you know unseen things or you're trying to do like regression to find out what value something would have for certain inputs that kind of thing whereas reinforcement learning the idea is you have an agent in an environment and you're trying to find a policy but so so there were at the backup what do we mean by an agent it's an entity that interacts with its environment to try and achieve something effectively doing things in an environment so this isn't necessary is this a physical thing or is it doesn't have to be so if you have a robot in a room then you can model as the robot being an agent and the rim being the environment similarly if you have a computer game like pac-man then pac-man is an agent and the sort of make it easy mrs. environment so let's stick with pac-man net the way that a reinforcement learning framework for dealing with pac-man is you say okay you've got pac-man here's the agent is in the environment and you have actions the pac-man can take in the environment now it's kind of neat in pac-man there are always exactly four actions you can take or well I guess five you can sit there and do nothing you can move up left right or down you don't always have all of those options like sometimes there's a wall and you can't move right but those are the only that's the that's the complete set of actions that you have and then you have the environment contains sort of dots that you can pick up which are they give you points it's got these ghosts that chase you that you don't want to touch and I think there's also there's like tails you can pick up that make the ghost edible and then you chase them down the stuff anyway so the difference in reinforcement learning is that the agent is in the environment and it learns by interacting with the environment is and so it's kind of close to the way that animals learn and the way the humans learn you try you try doing something you know I'm going to try you know touching this fire oh that hurt so that's act caused me like a negative reward that's caused me a pain signal which is something I don't want so I learn to avoid doing things like touching a fire so in in a pacman environment you might you might sort of say if you're in if you're in a situation like let's draw pac-man let's say he's an amazed like this you look at pac-man's options he told her left he can't go right he can go up and if he goes up he'll get a dot which earns you some points so up gets a score of you know +10 whereof you've decided it oh well whatever the score is in the game either way or if he goes down he'll be immediately go by this ghost the point is that pac-man doesn't need to be aware of the entire right the entire maze you can just feed in a fairly small amount of information about its immediate environment which is the same thing as if you have a robot in a room it can't it doesn't know everything about the whole room it can only see what it sees through its camera you know it has sensors they give it some some information about the environment partial information I suppose just playing devil's advocate the difference here is you usually Pat mine as being controlled by a human who can't see the whole board so the point being if that Ghost is actually not static and is chasing pac-man and he's heading up to get that pill if if a few pixels later that that corridor if you like stops in a dead-end yep well he's kind of stuff neither way really that's true yeah so that is because so so the most well yeah almost every reinforcement learning algorithm almost everything that tries to deal with this problem doesn't just look at the immediate surroundings or it looks at the immediate surroundings but it also looks a certain distance in time so you're not just saying what's going to happen next frame but so like if you go down here most algorithms would say okay the option is going down in this situation is bad but also all of the options we chose in all of the situations that we were in in the last second or two also get a little bit this is kind of a decay there's time time discounting so that you're not just punishing the immediate thing that causes the negative reward but also the decisions you make leading up to it so that pac-man might learn not to get himself stuck in corners as well as just learning not to run straight into ghosts so that's the basics of reinforcement learning there's different algorithms that do it and the idea is you you actually you start off exploring the environment just at random you just pick completely random actions and then as those actions start having consequences for you and you start getting rewards and punishments you start to learn which actions are better to use in which situations does that mean in part man's case would learn the maze or would it just learn the back to sources it depends on what algorithm you're using very sophisticated one might learn the whole maze a simpler one might just learn a more kind of local policy but the point is yeah you learn you learn a kind of mapping between or a function that takes in the situation you're in and outputs a good action to take there's also kind of an interesting trade-off there which I think we may have talked about before about exploration versus exploitation in that you want your agent to be generally taking good actions but you don't want it to always take the action that it thinks is best right now because it's understanding maybe being complete and then it just kind of gets stuck right it never finds out anything it never finds out anything about other options that it could have gone with because as soon as it did something that kind of worked it just goes with that forever so a lot of these systems build in some sub variants some randomness awesome right exactly like you usually do the thing you think is best but some small percentage of the time you just try something random anyway and you can change that over time like a lot of algorithms as as the policy gets more and more as they learn more and more they start doing random stuff less and less that kind of thing so that's the like absolute basics of reinforcement learning and how it works and it's really really powerful like especially when you combine it with deep neural networks as the thing that's doing the learning like deep mind did this really amazing thing where I think they were playing pacman they were playing a bunch of different Atari games and the thing that's cool about it is all they told the system was here's what's on the screen and here's the score of the game make the score be dig this score is your reward right that's it and it learned all of the specific dynamics of the game and generally achieved top level and also the human play the next word is going to be in verse we did a thing with a veil on anti learning but can't work all the time that sort of thing right yeah this is not like that this is a description of a different type of problem it's a totally different problem that they call inverse because in reinforcement learning you have a reward function that determines when you want situations you get rewards in and you're in your environment with your reward function and you're trying to find the appropriate actions to take that maximize that reward in inverse reinforcement learning you're not in the environment at all you're watching an expert so you've got the video of the world championship record pac-man player right and you have all of that all of that information you can see so you're saying rather than rather than having the reward function and trying to figure out the actions you can see the actions and you're trying to figure out the reward function so it's inverse because you're kind of solving the reverse of the problem you're not trying to maximize a reward by choosing actions you're looking at actions and trying to figure out what for what they're maximizing so that's really useful because it lets you sort of learn by observing experts so coming back to AI safety you might think that this would be kind of useful from an AI safety perspective you know you have this problem the core problem of AI safety or one of the core problems of AI safety is how do you make sure the AI wants what we want we can't reliably specify what it is we want so and if we create something very intelligent that wants something else that's something else is what's probably going to happen even if we don't want that to happen how do we make a system that reliably wants the same thing we want so you can see how inverse reinforcement learning might be kind of attractive here because you might have a system that watches humans doing things and tries to figure out you know if we are experts humans it's trying to figure out what rewards we're maximizing and try and sort of formalized in its in its understanding what it is we want by observing us that's pretty cool but yeah it has some problems one problem is that we don't in inverse reinforcement learning there's this assumption of optimality that the person that the agent you're watching is an expert and they do an optimal play and you're you know there is some clear coherent thing like the score the Dare optimizing and the assumption of the the algorithms that do this is that the way the world champion plays is the best possible way and that assumption is obviously never quite true or generally not quite true but it works well enough you know but humans are not like human behavior is not actually really optimizing to get more humans one perfectly in ways places where that assumption isn't true could cause problems so is this where cooperative comes in because when we started doing it backwards whose co-operative inverse reinforcement learning on right so you could imagine a situation where you have the robot you have the AGI it watches people doing their thing uses inverse reinforcement learning to try and figure out the things you menses how you try and figure out the things you value and then adopt those values as if though right the most obvious like the first problem is we don't actually want to create something that values the same thing as humans like if it observes that I you know I want a cup of tea we want it to want me to have a cup of tea we don't want it to want a cup of tea but that's like it that's quite easy to fix you just say you know figure out what the value and then optimize it for the humans so these it affects but you don't amoenus that's doable but then the other thing is if you're if you're trying to teach if you're actually trying to use this to teach a robot to do something it turns out to not be very efficient like if you works with pac-man if you want to learn how to be good at pac-man you probably want to not just watch the world's best pac-man player and try to copy them right that's not like an efficient way to learn because there might be a situation where you you're thinking what do I do if I find myself stuck in this corner of the maze or whatever and the pros never get stuck there so you have no you have no example of what to do well all the pro or watching the pros can teach you is don't get stuck there and then once you're there you've gotta know you've gotta hope let's say I wanted to teach my room or to make me a cup of tea I go into the kitchen and I show it how I make a cup of tea I would probably have to do that a lot of times to actually get the all the information across because and you'll notice this is not how people teach right if you were teaching a person how to make a cup of tea you might do something like if there's some difficult stage of the process you might show you might do one demonstration but show that one stage like three times so usually do it like this let me show you that again and then if you're using inverse reinforcement learning the system believes that you are playing optimally right so it thinks that doing it three times is somehow necessary and it's trying to figure out what values like what rewards you must be optimizing that doing it three times is important so that's the problem right that's where the assumption isn't true or you might want to say okay what you do is you get the tea out of the box here and you put it in the thing but if there's none in this box you go over to this cupboard where we keep the backup supplies and you open a new box right but you can't show that the only way that the only way that the robot can learn to go and get the extra supplies only when this one has run out is if you were in a situation where that would be optimal place that the thing has to be actually run out in order for you to demonstrate that you can't say if the situation were different from how it is then you should go and do this so the other thing you might want if you're trying to teach things efficiently you might want the AI to be taking an active role in the learning process right you kind of want it to be like if there's if there's some aspect of it that it doesn't understand you don't want it just sitting there observing you optimally do the thing and they're trying to copy if there's something that it didn't see you kind of wanted to be able to say I hang on I didn't see that you know or I'm confused about this or maybe ask you a terrifying question or just in general like communicate with you and cooperate with you in the learning process so yeah so the way that the way that cooperative inverse reinforcement learning works is it's a way of setting up the rewards such that these types of behaviors hopefully will be incentivized and should come out automatically if you're optimizing you know if the AI is doing well so what you do is you specify the interaction as a cooperative game where the robots reward function is the humans reward function but the robot doesn't know that reward function at all it never knows the reward that it gets and it never knows the function that generates the reward did it gets it just knows that it's the same as the humans so it's trying to optimize it's trying to maximize the reward it gets but the only clues it has for what it needs to do to maximize its own reward is observing the human and trying to figure out what the human is trying to maximize this is a bit like two players on the ice game but you can only see one school yeah like if you're you're you're both on the same team yeah but only the human knows the rules of the game effectively we both want you both get the same reward so you both want the same thing just kind of by definition but the process so in a sense you've kind of just defined the core problem of though saying that the core problem one of the core problems of AI safety is how do you make sure that the robot wants what the human ones and in this case you've just specified it usually you couldn't do that because we don't really know what the human wants either to people who don't speak the same language can still communicate with actions and gestures and things yeah and you can generally get the gist as the idea across to the other person is a bit like that yeah but a sufficiently sophisticated agent if you have an AGI that could be quite powerful it can speak you know and it can understand language and everything else and it knows that this so it knows for example hopefully it should be able to figure out that when the human is showing something three times that it's that the human is doing that in order to communicate information and not because it's the optimal way to do it because it knows that the human knows there's a kind of there's common knowledge of what's going on in this in the scenario so it allows for situations where the human is just demonstrating something or explaining something or it allows the AI to ask about things that it's unclear about it because everybody's on the same team trying to achieve the same thing in principle so the point is if you have a big red stop button in this scenario the AI is not incentivized to disable or ignore that stop button because he constitutes important information about its reward right the AI is desperately trying to maximize a reward function that it doesn't know and so if it observes the human trying to hit the stop button that provides really strong information that what it's doing right now is not going to maximize the humans award which means is not going to maximize its own reward so it wants to allow itself to be shut off if the human wants to shut it off because it's for its own good so this is this is a clever way of aligning its interests with ours right right it's not so it's so like the problem in the in the default situation is I've told it to get a cup of tea and it's going to do that whatever else I do and if I try to turn it off it's not going to let me because that will stop it from getting your cup of tea whereas in this situation the fact that I want a cup of tea is something it's not completely sure of and so it doesn't think it knows better than me so when I go to hit that stop button it thinks oh I thought I was supposed to be going over here and getting a cup of tea and running over this baby or whatever but the fact that he's rushing to hit the button means I must have gotten something wrong so I'd better stop and learn more about this situation because I'm at risk of losing a bunch of reward so yeah it seems like it seems like a potentially workable thing a workable approach so one interesting thing about this is there is still an assumption that the humans behavior is in accordance with some utility function or some reward function some objective function like if the human Bayes very irrationally that can cause problems for their system because the whole thing revolves around the fact that the robot is not completely confident of what its reward is it's got a model of its of what the reward function is like that it's constantly updating as it learns and it doesn't have full confidence and it's using the human as the source of information so fundamentally you have robot believes that the human knows better than it does how to maximize the humans reward so in situations where that's not true like if you run this for long enough and the robot managed to build up a really really high level of confidence in what it thinks the human reward function is then it might ignore it stop button later on if it thinks that it knows better than the human what you think at once which sounds very scary but might actually be what you want to happen like if you imagine you know it's the future and we've got these robots and they all have a big red stop button on them in they're all here and everything's wonderful and you say to your robot Oh take my my four-year-old son to school you know drive him to school in the car because it's a 1950s sci-fi future we're not self-driving cars it's like robots in cars anyway and it's it's driving this kid to school it's doing 70 on the motorway and the kid sees the big red shiny button and smacks it right in principle the human has just pressed the button and a lot of designs for a button would just say a human is hit your button you have to stop whereas this design might say I have been around for a long time I've learned a lot about what human value and also I observe that this specific human does not reliably behave in its own best interests so maybe this hitting the button is not communicating to me information about what this human really wants they're just hitting it because it's a big red button and I should not shut myself off so it has the potential to be safer than a button that always works but it's a little bit unsettling that you might end up with systems that sometimes actually do ignore the shutdown command because they think they know better it's looking at right now as it says button gets hit I get zero reward button doesn't get hit if I manage to stop them then I get the cup of tea I get like maximum reward if you give some sort of compensation

Info

Channel: Computerphile

Views: 390,112

Rating: 4.9262061 out of 5

Keywords: computers, computerphile, computer, science, AI, Rob Miles, AI Safety, Computer Science, Cooperative Inverse Reinforcement Learning

Id: 9nktr1MgS-A

Channel Id: undefined

Length: 23min 45sec (1425 seconds)

Published: Thu Aug 03 2017