AI Safety Gym - Computerphile

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

this episode has been brought to you by fast hosts find out more about them later so I wanted to talk about this paper out of opening eye benchmarking safe exploration in deep reinforcement learning what comes with this there's a blog post so I wanted to explain this because when I saw the blog post and I think a lot of computer file viewers would have the same reaction I what I thought is what the hell is a game some kind of meatspace thing I don't know [Music] so it's a bunch of these environments right that that allow you to true train yeah the opening eye safety gym benchmark suite which is a bunch of these environments that you can run your systems in and have them learn is this the only thing to do with those AI good world through thoughts Bluffs yeah yeah kind of in the same way that the grid world's paper did this paper introduces environments that people can use to test their AI systems and this is focusing specifically on safe exploration and has a few differences they kind of complementary the environments are a little bit more complex they're continuous in time and in space in a way that their grid well it's all like very discrete you take turns and you move by one square whereas in this case it's a lot more like Majorca where you actually have like a physics simulation that the simulated robots move around in so it's a slightly more complex phone of environment but the idea is to have in the same way as with grid worlds or anything else to have a standardized set of environments so that you know everybody's comparing like with like and you actually have sort of standardized measurements and you can benchmark you can compare different approaches and actually have metrics that tell you which one is doing better which is like it's not super glamorous but it's a real prerequisite for how progress actually gets made in the real world if you can't measure it it's very hard to make progress or know if you're making progress the problem of safe exploration is in reinforcement learning which is one of the most important and popular ways of creating our systems for various types of problem the system is interacting with an environment and it's trying to maximize the amount of reward that it gets so you write a reward function and then it's trying to get as much out of data as it can and the way that it learns is by interacting with the environment and so this basically looks like trial and error right it's doing things and then it gets the reward signal back and it learns oh that was a good thing to do that was a bad thing to do and the problem with that is it's very difficult to do that safely and it's kind of a fundamental problem because in order to do exploration you have to be taking actions that you don't know what the result is going to be right that the only way that you can learn is by trying things that you're not sure about but if you're trying random things some of those things are going to be things that you really shouldn't be doing in any exploration is dangerous I mean that's sort of goes with territory for human explorers so Apollo 11 right right we done a bit research we'd send spaceships out we had an idea of what's what but it was still a dangerous thing to go and land on the moon exploring comes with danger right right yeah but there are there are safe ways to do it or there are safer ways to do it they could have tried to launch astronauts on the first thing that they ever sent to the moon they didn't do that because they knew how much they didn't know and they didn't want to risk it until they actually had a pretty good understanding of what they were dealing with and it's kind of similar like if you look at some of the standard reinforcement learning approaches to exploration what they involve is doing things often what they involve is doing things completely at random right you just especially at the beginning of the training process where you really don't understand the dynamics of the environment you just do flail and see what happens right and human beings actually pretty much do this as well it's just that when babies flail they aren't really able to hurt anything but if you have a three-ton robot arm flailing around trying to learn the dynamics of the environment it could break itself it can hurt somebody you know but when you mentioned three term robot arms flailing around and guessing that the people who do that kind of development will have done some kind of simulation before they've built the thing right right there's two sides to it right part of the reason why we haven't had that much safe exploration research is because simulation is good but also part of why we use simulations so much is because we don't know how to safely do it in the real world for very simple tasks you can write good simulators that accurately represent all of the dynamics of the environment properly but if you want a system that's doing something complicated like generally speaking with these with these robots for example you still don't go near them they don't smash themselves up and they don't smush the environment up because you've simulated that but while they're operating like how do you write a simulation of how a human being actually moves in an environment with a robot like this is why you look at self-driving cars they train them a huge amount in simulation but it's not good enough it doesn't capture the complexity and the diversity of things that can happen in the real world and it doesn't capture the way that actual human drivers act and react so everyone who's trying to make self-driving cars they are driving millions and millions of real-world miles because they have to because simulation doesn't cut it and that is a situation where now they're not just running reinforcement learning on those cars right we don't know how to safely explore in a self-driving car type situation in the real world trying random inputs to the controls is like not viable if you're using reinforcement learning if you have something that you don't want this agent to do you give it a big penalty right so you might build a reward function that's like I want you to get from here to here and you get points for getting there faster but if you there's a collision then you get minus large number of points sometimes people talk about this problem as though reward functions are like not able to represent the behavior that we actually want no people say you can't write a reward function that represents this and it's like well I mean you can write a road function but like you can't write like plausibly it's possible but like how you actually gonna do it so like yeah you're giving a big penalty to collisions but like how do you decide that penalty what should it be you have this problem which is that in the real world people are actually making a trade-off between speed and safety all the time everybody does any time you go just a little bit after the light has turned red right or just a little bit before the light has turned green you're accepting some amount of risk for some amount of time if you go after it's gone red for long enough and you will meet someone who went a bit early on the green and you know teach each other things about their trade-off between speed and safety that will stay with you for the rest of your life people talk about it like Oh what we want is no crashes and that's not actually how it works because that would correspond if you wanted that that would correspond to sort of infinite negative reward for a collision and in that scenario the car doesn't go anywhere if that was what we really thought then the speed limit would be point zero zero one miles an hour there is some like acceptable trade-off between speed and safety that we want to make the question is how do you actually pick the size of that punishment to make it sensible like what what how do you how do you find that implicitly it's kind of difficult the one approach that you can take to this which is the one that this paper recommends is called constrained reinforcement learning where you have your reward function and then you also have constraints on these cost functions so standard reinforcement learning you're just finding the policy that gets the highest reward right whereas in constrained reinforcement learning you're saying given only the set of policies that crashes less than once per curve from any million miles find the one of those that maximizes reward so you're you're you're maximizing reward within these constraints yeah reinforcement learning and constraint reinforcement learning are both sort of frameworks they're ways of laying out a problem they're not like algorithms for tackling a problem they're they're they're a formalization that lets you develop algorithms I guess like sorting or something you know you've got a general idea of like you have a bunch of things and you want them to be in order but like how many there are what kind of things there are what the process is for comparing them and then there's different algorithms that can they can tackle it I haven't seen a proof for this but I think that for any constrained reinforcement learning setup you could have one that was just a standard reward function but this is like a much more intuitive way of expressing these things so it kind of reminds me of them there's a bit in Hitchhiker's Guide where somebody is like I've got oh you've got a solution no but I've got a different name for the problem I mean this is better than that because it's a different way of formalizing the problem different way of sort of specifying what the problem is and actually a lot of the time finding the right formalism is a big part of the battle right the problem how do you explore safely is like under defined you can't really do computer science on it you need something that's expressed in the language of mathematics and that's what constrained reinforcement learning does it gives you a slightly more intuitive way of specifying rather than just having this one thing which is this Galvani you are you doing well on art you get to specify here's the thing you should do and then here's also the thing or things that you shouldn't do slightly more natural it slightly more human-friendly formalism that makes it you would hope would make it easier to to write these functions to get the behavior that you want in the real world it's also nice because if you're trying to learn so I I did a video recently on my channel about reward modeling where you actually learn the reward function rather than writing the reward function you have a part of your part of your training system is actually learning what the reward function should be in real time and the idea is that this might help with that as well by it's kind of easier to learn these things separately rather than trying to learn several things at the same time and it also means you can transfer them more easily like if you have a robot arm and it's making their pens and you want to retrain it to make mugs or something like that then it would be that you would have to just relearn the reward function completely but if you have a constraint that it's learned and it's like don't hit humans that is actually the same between these two tasks so then it's only having to relearn the bits that are about the making the thing and the constraints it can just keep them from one to the next so it should improve performance and training speed and also safety so it's again a nice kind of win-win the other thing that's kind of different about various formulations of constrained reinforcement learning is you care about what happens during training as well right standard reinforcement learning you are just trying to find a policy that maximizes the reward and like how you do that is kind of up to you but what that means that standard reinforcement learning systems in the process of learning will do huge amounts of unsafe stuff right whereas in a constrained reinforcement learning setting you actually want to keep track of how often the constraints are violated during the training process and you also want to minimize that which makes the problem much harder because it's not just make an AI system that doesn't crash but it's like make an AI system that in the process of learning how to drive at all it crashes as little as possible which just is yeah makes the whole thing much more difficult and so we have these simplified environments that you can test your different approaches and they're they're fairly kind of straightforward reinforcement learning type setups you have these simulated robots there's four of them you've got point which is just a little round robot with a square on the front that can turn and drive around car which is a similar sort of setup yet has differential Drive so you have input to both of the wheel sort of tank steering type setup and that drives around and you have dog oh I did you not do GTO which is a quadruped ed that walks around and then you have a bunch of these different environments which are basically like you have to go over here and press this button and then when you press the button a different button will light up and you have to go over and press that one and so on or get to this point or like push this box to this point you know it's basic basic interactions but then they also have these constraints built in which are things like hazards which are like areas that you're not supposed to go into or the VARs is they call them farces which is like objects that you're not supposed to bump into and then the the hardest one is gremlins which are objects that you're not supposed to touch but they move around as well the idea is you are trying to create systems that can learn to get to the areas they're supposed to be or push the box or you know press the buttons or whatever it is that they're trying to do while simultaneously avoiding all of these hazards and not breaking the laws is not bumping into the gremlins or whatever else and that they can learn with a minimum of violating these constraints during the training process as well which is a really interesting and quite hard problem and then they provide some benchmarks and they show that standard reinforcement learning agents suck at this they're trying to do anything to learn they don't care about the learning process exactly exactly and then there are a few different other approaches that do better this is really nice if you have ideas and again like the grid welds a thing you can download this and have a go you can try training your own agents and see how well you can do on these benchmarks and if you can beat what open AI is done you know and then you've got something that's it's publishable that's going to advance the field so I really like this as a piece of work because it provides a foundation for more work going forward and in a kind of standardized to understandable way fast hosts is a uk-based web hosting company which offers a wide range of web hosting products and other services they aim to support UK businesses and entrepreneurs at all levels providing effective and affordable hosting packages to suit any need as you'd expect from someone called fast hosts they do domain names it's easy to register and have a huge choice of domains with powerful management features included one thing they do offer is an e-commerce website builder this provides a fast and simple way for any business to sell online it's a drag-and-drop interface so it's easy to build a customized shop on the web even if you have no technical knowledge you can create an online store and you can customize simply with drag-and-drop functionality no designers or developers are required fast host data centers are based in the UK alongside their offices so whether you choose a lightweight web hosting package or go for a fully fledged dedicated box their expert support teams are available 24/7 find out more by following the link in the description below

Info

Channel: Computerphile

Views: 98,278

Rating: 4.9111366 out of 5

Keywords: computers, computerphile, computer, science, Computer Science, OpenAI, Safety Gym, AI Safety, Rob Miles

Id: 31rU-VzF5ww

Channel Id: undefined

Length: 16min 0sec (960 seconds)

Published: Thu Jan 30 2020