Ilya Sutskever: OpenAI Meta-Learning and Self-Play | MIT Artificial General Intelligence (AGI)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments

"there's only one reward in life, existence or non-existence, everything else is a corollary of that."

Hmm quite dark eh?

👍︎︎ 20 👤︎︎ u/NotAlphaGo 📅︎︎ Apr 26 2018 🗫︎ replies

Am I the only one who found this talk a bit too hype-y?

👍︎︎ 17 👤︎︎ u/[deleted] 📅︎︎ Apr 26 2018 🗫︎ replies

While the ideas/insights he's explaining are terrific, am I the only one of the opinion that he's not the best at explaining? I'm coming away from a lot of what he's saying with confusion.

👍︎︎ 7 👤︎︎ u/[deleted] 📅︎︎ Apr 26 2018 🗫︎ replies

Really enjoyed this talk although he does seem rather sceptical of the potential for evolutionary methods to compete with deep learning. I feel at some point there will be a realisation that learning is missing something fundamental that evolutionary methods can provide, and that the best systems (and maybe even artificial general intelligence) will be hybrids of both techniques.

👍︎︎ 3 👤︎︎ u/DaBigJoe 📅︎︎ Apr 26 2018 🗫︎ replies

Life: The whole is rewarded for spreading its parts.

AI: The whole is rewarded for not falling apart.

👍︎︎ 1 👤︎︎ u/[deleted] 📅︎︎ Apr 29 2018 🗫︎ replies

Nice!

👍︎︎ 1 👤︎︎ u/[deleted] 📅︎︎ Apr 26 2018 🗫︎ replies
Captions
welcome back to 6 SZ row 99 artificial general intelligence today we have Ilya sutskever co-founder and research director of open AI he started in the amel group in Toronto Geoffrey Hinton then at Stanford with an jiaying co-founded DNN research for three years as a research scientist at Google brain and finally co-founded open AI citations aren't everything but they do indicate impact and his work recent work in the past five years has been cited over forty six thousand times he has been the key creative intellect and driver behind some of the biggest breakthrough ideas in deep learning and artificial intelligence ever so please welcome Ilya alright thanks for the introduction Lex alright thanks for coming to my talk I will tell you about some work we've done over the past year on on meta learning and software open AI and before I dive into some of the more technical details of the work I want to spend a little bit of time talking about deep learning and why it works at all in the first place which I think it's actually not a self-evident saying that they should work one fact it's actually a fact it's a mathematical theory that you can prove is that if you could find the shortest program the does very very well on your data then you will achieve the best generalization possible with a little bit of modification you can turn it into a precise theorem and on a very intuitive level it's easy to see what it should be the case if you have some data and you're able to find a shorter program which generates this data then you've essentially extracted all the all conceivable regularity from this data into your program and then you can use these objects to make the best predictions possible like if if you have data which is so complex but there is no way to express it as a shorter program then it means that your data is totally random there is no way to extract any regularity from it whatsoever now there is little known mathematical theory behind this and the proofs of these statements actually not even that hard but the one minor slight disappointment is that it's actually not possible at least given today's tools and understanding to find the best short program that explains or generates or solves your problem given your data this problem is computationally intractable the space of all programs is a very nasty space small changes to your program result in massive changes in the behavior of the program as it should be it makes sense you have a loop you change the inside of the loop of course you get something totally different so the space of programs is so hard at least given what we know today search there seems to be completely off the table well if we give up on shorts on short programs what about small circuits well it turns out that we are lucky it turns out that when it comes to small circuits you can just find the best small circuits circuits that solves the problem using back propagation and this is the miraculous fact on which the rest of AI stands it is the fact but then you have a circuit and you impose constraints on your circuits on your circuit using data you can find the way to satisfy these constraints these constraints using that problem by iteratively making small changes to the base of your neural network until its predictions satisfy the data what this means is that the computational problem that so the back propagation is extremely profound it is circuit search now we know that you can solve it solve it always but you can solve it sometimes and you can solve it at those times where we have a practical data set it is easy to design artificial data sets for which you cannot find the best neural network but in practice that seems to be not a problem you can think of training a neural network as solving a neural equation in many cases where you have a large number of equation terms like this f of X I theta equals y I so you got your parameters and they represent all your degrees of freedom and you use gradient descent to push the information from these equations into the parameters satisfy them all and you can see that the neural network let's say one with 50 layers is basically a parallel computer that is given 50 time steps to run and you can do quite a lot with a 15 with 50 time steps of a very very powerful massively parallel computer so for example I do I think it is not widely known that you can learn to sort sort n n bit numbers using a modestly sized neural network with just two hidden layers which is not bad it's not self-evident especially since we've been taught that sorting requires log n parallel steps with the neural network you can sort successful using only two parallel steps so there's some things like an arm is going on now these are parallel steps of threshold threshold neurons so they're doing a little bit more work let's answer to the mystery but if you've got 50 such layers you can do quite a bit of logic quite a bit of reasoning all inside the neural network and that's why it works given the data we are able to find the best neural network and because the neural network is deep because it can run computation inside of its act inside of its layers the best neural network is worth finding because that's really what you need you need something you need the model class which is worth optimizing but it also needs to be optimizable and deep neural networks satisfy both of these constraints and this is why everything works this is the basis on which everything else resides now I want to talk a little bit about reinforcement learning so reinforcement learning is a framework it's a framework of evaluating agents in their ability to achieve goals and complicated stochastic environments you've got an agent which is plugged into an environment as shown in the figure right here and for any given agent you can simply run it many times and compute its average reward now the thing that's interesting about the reinforcement learning framework is that there exist interesting useful reinforcement learning algorithms the framework existed for a long time it became interesting once we realized that good algorithms exist now these are there are perfect algorithms but they are good enough to do interesting things and all you want the mathematical problem is one where you need to maximize the expected reward now one important way in which the reinforcement learning framework is not quite complete is that it assumes that the reward is given by the environment you see this picture the agent sends an action while the reward sends it an observation in a both the observation and the reward backwards that's what the environment communicates back the way in which this is not the case in the real world is that we figure out what the reward is from the observation we reward ourselves we are not told environment doesn't say hey here's some negative reward it's our interpretation over census that lets us determine what the reward is and there is only one real true reward in life and this is existence or nonexistence and everything else is a corollary of that so well what should our agent be you already know the answer should be a neural network because whenever you want to do something dense it's going to be a neural network and you want the agent to map observations to actions so you let it be parametrized with a neural net and you apply learning algorithm so I want to explain to you how reinforcement learning works this is model free reinforcement learning the reinforcement learning has actually been used in practice everywhere but it's also deeply it's very robust it's very simple it's also not very efficient so the way it works is the following this is literally the one sentence description of what happens in short try something new add randomness directions and compare the result to your expectation if the result surprises you if you find that the results exceeded your expectation then change your parameters to take those actions in the future that's it this is the fool idea of reinforcement learning try it out see if you like it and if you do do more of that in the future and that's it that's literally it this is the core idea now it turns out it's not difficult to formalize mathematically but this is really what's going on if in a neural network in a regular neural network like this you might say okay what's the goal you run the neural network you get an answer you compare it to the desired answer and whatever difference you have between those two you send it back to change the neural network that's supervised line in reinforcement learning you run in your own network you add a bit of randomness to your action and then if you like the result your randomness turns into the desired target in effect so that's it trivial now math exists without explaining what these equations mean the point is not really to derive them but just to show that they exist there are two classes of reinforcement learning algorithms one of them is the policy gradient where basically what you do is that you take this expression right there the sum of expected we work the sum of rewards and it just crunched through the derivatives you expand the terms iran you do some algebra and you get a derivative and miraculously the derivative has exactly the form that i told you which is try some actions and if you like them increase the log probability of the actions that we truly follows from the math it's very nice when the intuitive explanation has a one-to-one correspondence to what you get in the equation even though you have to take my word for it if you are not familiar with it that's that equation at the top now there is a different class of reinforcement learning algorithms which is a little bit more difficult to explain it's called the Q learning based algorithms they are a bit less stable a bit more sample efficient and it has the property that it can learn not only from the data generated by the actor but from any other data as well so it has it has some rope but it has different robustness profile which would be a little bit important but it's only going to be a technicality so yeah this is the own policy of policy distinction but it's a little bit technical so if you find this hard to understand don't worry about it if you already know this then you already know it so now what's the potential for enforcement learning wasn't it promised what is it actually why should we be excited about it now there are two reasons the reinforcement learning algorithms of today already useful and interesting and especially if you have a really good simulation of your world you could train agents to do lots of interesting things but what's really exciting is if you can build a super amazing sample efficient out of reinforcement learning algorithm we just give it a tiny amount of data and the algorithm just crunches through it and extracts every bit of entropy out of it in order to learn in the fastest way possible now today our algorithms are not particularly efficient they are data inefficient but as our field keeps making progress this will change next I want to dive into the topic of meta learning the goal of meta learning so meta learning is a beautiful idea that doesn't really work but it kind of works and it's really promising too it's another promising idea so what's the dream we have some learning algorithms perhaps you could use those learning algorithms in order to learn to learn I'd be nice if we could learn to learn so how would you do that you will take a system which you train it not on one task but on many tasks and you ask you that it learns to solve these tasks quickly and that may actually be enough so here's how it looks like here's how most traditional metal earning look works like it looks like you have a model which is a big neural network what what you do is that you treat every instead of training cases you have training tasks and instead of test cases you have test tasks so your input may be instead of just your current test case it would be all the information about the new T above the test tasks plus the test case and you'll try to output the prediction reaction for that test case so basically you say yeah I'm going to give you your ten examples as part of your input to your model figure out how to make the best use of them it's a really straightforward idea u-turn the neural network into the learning algorithm by turning a training task into a training case so training to ask a constraining case this is meta learning just one sentence and so they've been several success stories which I I think are very interesting one of the success stories of meta learning is learning to recognize characters quickly so they've been a dataset produced by MIT by lake corral and this is a data set we have a large number of different handwritten characters and people have been able to train extremely strong meta learning system for this desk another successful another very successful example of meta learning is in that of neural architecture search by is openly from google where they found neural architecture that solved one problem well small problem and then you could generalize and then if you successfully solve large problems as well so this is the kind of the the small number of bits meta learning is that when you learn the architecture or maybe even learn a program small program or learning algorithm which you apply to new tasks so this is the other way of doing meta learning so anyway but the point is what's happening what's really happening in meta learning in most cases is that you turn a training task into a training case and pretend this is totally normal normal deep learning that's it this is the entirety of meta learning everything else suggests minor details next I wanna dive in so now that I've finished the introduction section I want to start discussing different work by different people from opening I and I want to start by talking about hindsight experience replay it's been a large effort by and recurvature all to develop a learning algorithm for reinforcement learning that doesn't solve just one task but it solves many tasks and it learns to make use of its experience in a much more efficient way and I want to discuss one problem in reinforcement learning it's actually I guess a set of problems which all related to each other at one really important thing you need to learn to do is to explore you're in that you start out in an environment you don't know what to do what do you do so one very important thing that has to happen is that you must get rewards from time to time if you try something and you don't get rewards then how can you learn so said that's the kind of the crux of the problem how do you learn and relatedly is there any way to meaningfully benefit from your ex from the experience from your attempts to from from your failures if you try to achieve a goal and you fail can you still learn from it you tell you instead of asking your algorithm to achieve a single goal you want to learn a policy that can achieve a very large family of goals for example instead of reaching one state you want to learn a policy that reaches every state of your system and what's the implication anytime you do something you achieve some state so let's suppose you say I want to achieve state a I try my best and I end up achieving state B I can either conclude well that was disappointing I haven't learned almost anything I'm still have no idea how to cheat how to achieve state aid but alternatively I can say well wait a second I've just reached a perfectly good state which is B can I learn how to achieve state B from my attempt to achieve state a an answer is yes you can and it just works and I just want to point out this is the one case there is a small subtlety here which may be interesting to those of you who are very familiar with on Part B the distinction between on policy and off policy when you try to achieve a you are on you're doing on policy learning for reaching the state a but you're doing off policy learning for it in the state be because you would take different actions if you would actually try to reach they'd be so that's why it's very important that the algorithm you use here can support of policy learning but that's a minor technicality at the crux the crux of the idea is you make the problem easier by ostensibly making it harder by training a system which can which aspires to reach to learn to reach every state to learn to achieve every goal to learn to master its environment in general you build a system which always learn something it learns from success as well as from failure because if it tries to do one thing one thing and it does something else it now has training data for how to achieve that something else I want to show you a video of how this thing works in practice so one challenge in reinforcement learning systems is the need to shape the reward so what does it mean it means that at the beginning of the system at the start of learning then the system doesn't know much it will probably not achieve your goal and so it's important that you design your reward function to give it gradual increments to make it smooth and continuous so that even when the system is not very good it achieves the goal now if you give your state your system a very sparse reward where the reward is achieved only when you reach a final state then it becomes very hard for normal reinforcement learning algorithms to solve a problem because naturally you never get the reward so you never learn no reward means no learning but here because you learn from failure as well as from success this is this problem simply doesn't occur and so this is this is nice I think you know let's let's look at the videos a little bit more like it's nice how this is it confidently and energetically moves the little green buck to its target and here's another one you okay so we can skip the it works on spawn on the face if you do it on physical robot as well but we can skip it so I think the point is that the hindsight experience replay algorithm is directionally correct because you want to make use of all your data and not only a small fraction of it now one huge question is where do you get the high level states where do the high level states come from because in the work of showing you so far the system is asked to achieve low level States so I think one thing it will become very important for this kind approaches is representation learning and unsupervised learning figure out what are the rights what are the right states what's the state space of goals that's worth achieving now I want to go through some real meta learning results and I'll show you a very simple way of doing seem to reel from simulation to the physical robot with meta learning and this is where my pain growl was an a and encouraged a really nice intern project in 2017 so I think we can agree that in the domain of robotics it would be nice if you could train your policy in simulation and then somehow this knowledge would carry over to the physical robot now we can build we can build simulators that are okay but they can never perfectly match the real world unless you want to have an insanely slow simulator and the reason for that is that it turns out that simulating freaky simulating contacts is super hard and I heard somewhere correct me if I'm wrong that simulating friction is np-complete I'm not sure but it's like stuff like that so your simulation is just not going to match reality there will be some resemblance but that's it how can we address this problem and I want to show you one simple idea so let's say one thing once one thing that would be nice is that if you could learn a policy learn a policy that would quickly adapt itself to the real world well if you want to learn a policy that can quickly adapt we need to make sure it has opportunities to adapt during training time so what do we do instead of solving a problem in just one simulator we add a huge amount of variability to the simulator we say we will randomize the friction so we will randomize the masses the length of the different objects and their I guess M dimensions so you try to randomize physics they simulate in lots of different ways and then importantly you don't tell the policy how you randomized it so what is it going to do then you take your policy and you put it in an environment then says well this is really really tough I don't know what the masses are and I don't know what the frictions are I need to try things out and figure out what the friction is as I get it responses from the environment so you're building you you learn a certain degree of adaptability into the policy and it actually works let's want to show you this is what happens when you just strain a policy in simulation and deploy it on the physical robot and here the goal is to bring the hockey puck towards the red dot and you will see that it will struggle and the reason it struggles is because of the systematic differences between the simulator and the real physical robot so I can even the basic movement is difficult for the policy because the assumptions are violated so much so if you do the training as I discussed we train a recurrent neural network policy which learns to quickly infer properties of the simulator in order to accomplish the task you can then give it the real thing the real physics and it will do much better so now this is not a perfect technique but it's definitely very promising it's promising whenever you are able to sufficiently randomize the simulator so it's definitely very nice to see the closed-loop nature of the policy you consider it would push the hockey puck and would correct it very very gently to bring it to the goal yeah so that that was cool so that was very that was a cool application of meta learning I want to discuss one more application of meta learning which is learning a hierarchy of actions and this was work done by France at all actually kept in France the ancient who did it was in high school I mean he wrote this paper so one thing that would be nice is if reinforcement learning was hierarchical if instead of simply taking micro actions you've had some kind of little subroutines that you could deploy maybe the term subroutine is a little bit too crude but if you had some idea of which action primitives are worth starting with now no one has been able to to get actually like real value add from curricula reinforcement learning yet so far all the really cool results all the really convincing is also reinforcement learning do not use it that's because we haven't quite figured out what's the right way for reinforcement learning for her ocular reinforcement learning I just want to show you one very simple approach where you use meta-learning to learn to learn a hierarchy of actions so here's what you do you have in this specific work you have a certain yeah let's say you have a certain number of low-level primitives let's say you have two ten of them and you have a distribution of tasks and your goal is to learn low level primitives such that when they are used inside a very brief run of some reinforcement learning algorithm you will make as much progress as possible so the idea is you want to get the greatest amount of progress you want to learn policies that result in the great story you want to learn primitives that result in the greatest amount of progress is possible when used inside learning so this is a meta learning setter because any distribution of tasks and here we've had if we've had a little maze here the distribution of a mazes and in this case the little bug learned three policies which move it in its fixed direction and as a result of having this hierarchy you're able to solve problems really fast but only when the hierarchy is correct so horican reinforcement learning is still working progress and this was an and this work is an interesting proof point of how Haruko reinforcement could be like how heretical reinforcement learning could be like if it worked now I want to just spend one slide addressing the limitations of high capacity method learning the specific limitation is that the training test distribution has to be equal to the test test distribution and I think this is a real limitation because in reality you the new test that you want to learn do in some ways being fundamentally different from anything you've seen so far so for example if you go to school you learn lots of useful things but then they go to work only a fraction of this of the things that you've learned carries over you need to learn if you need quite a few more things from scratch so metal owning would struggle with that because it really assumes that the Train the training data is that the distribution over the training task has to be equal to the distribution of the test tasks that's the limitation I think that as we develop better algorithms for being robust when the test tasks outside of the distribution of the training tasks the metal on would work much better now I want to talk about self play the links of play is a very cool topic that's starting to get attention only now and I want to start by reviewing very old work called TD gammon it's back from all the way from 1992 so it's 26 years old now it was done by Jerry to cero so this work is really incredible because it has so much relevance today what they did basically they said okay let's take two neural networks and let them let them play against each other let them play backgammon against each other and let them in tray let them be trained particularly so it's a super-modern approach and you would think this was a paper from 2017 except that then you look at this plot it shows that you only have ten hidden units twenty hidden units forty and eighty for the different M colors where you notice that the largest neural network works best so in some ways not much has changed and this is the evidence and in fact they were able to beat the world champion in backgammon and they were able to discover new strategies that the best human a backgammon players weren't ever not noticed and they've determined that the strategy discovered by TD gammon actually better so that's pure self play with cue learning which is which remained dormant until the DQ and work with Atari mid mind so now other examples of self play include alphago zero which was able to learn to beat the world champion and go without using any external data whatsoever another result of this vein is by open AI which is our dota 2 BOTS which was able to build the world champion on the 1v1 version of the game and so I want to spend a little bit of time talking about the allure of self play and why I think it's exciting so one important problem that's a that that's that we must face as we try to build truly intelligent systems is what is the task what are we actually teaching the systems to do and one very attractive attribute of self play is that the agents create the environment by virtue of the agent acting in the environment the environment becomes difficult for the other agents and you can see here an example of an iguana interacting with snakes that try to eat it unsuccessfully this time so we can see what will happen in a moment the iguana strains best and so the fact you have this arms race between the snakes and the iguana motivates their development potentially without bound and this is what happens in effect in but in biological evolution now interesting work in this direction was done in 1994 but Carl says there is a really cool video on YouTube by Carl seems you should check it out which really kind of shows all the work that he's done and here you have a little competition between agents where you evolved both the behavior and their morphology when you when the agents is trying to gain possession of a green cube and so you can see that the agents create the challenge for each other and that's why they need to develop so one thing that we did and this is work by advance a little from open ai is we said okay well can we demonstrate some unusual results in self play that would really convince us that there is something there so what we did here is that we created a small a small ring and you have these two humanoid figures and their goal is just to push each other outside the ring and they don't know anything about wrestling they don't know anything about standing your balance in each other they don't know anything about centers of gravity all they know is that if you don't do a good job then your competition is going to do a better job now one of the really attractive things about self play is that you always have an opponent that's roughly as good as you are in order to learn you need to sometimes win and sometimes lose but you can't always win sometimes you must fail sometimes you must succeed so let's see what will happen here yeah so it was able to do so the green humanoid was able to block the ball in a Cell in a well balanced self play environment petition is always level no matter how good you are or how bad you are you have a competition that makes it exact exactly of exactly the right challenge for you on one thing here so this video shows transfer learning it takes a little wrestling humanoid and you take its friend away and you start applying a big large random forces on it and you see if it can maintain its balance and the answer turns out to be but yes it can because it's been trained against an opponent it pushes it and so that's why even if it doesn't understand where the fresh force is being applied on it it's still able to balance itself so this is one potentially attractive feature of subway environments that you could learn a certain broad set of skills although it's real hard to control the square the skills will be and so the biggest open question with this research is how do you learn agents in a software environment such that they do whatever they do but then they are able to solve a battery of tasks that is useful for us that is explicitly specified externally yeah I also want to want to highlight one attribute of self play environments that we've observed in our dota BOTS and that is that we've seen a very rapid increase in the competence of the bots so over the period over the course of maybe five months we've seen the bots go from playing totally randomly all the way to the world champion and the reason for that is that once you have a self play environment if we put compute into it you turn it into data self play allows you to turn compute into data and I think you will see a lot more of that as being an extremely important thing to be able to turn compute into essentially data generalization simply because the speed of neural net processors will increase very dramatically over the next few years so neural net cycles will be cheap and it will be important to make use of this new of newly-found overabundance of cycles I also want to talk a little bit about the endgame of the self approach so one thing that we know about the human brain is that it has increased in sized fairly rapidly over the past two million years my theory the reason I think it happened is because our ancestors got to a point where the thing that's most important for your survival is your standing in the tribe and less the tiger and the lion once the most important thing is how you deal with those other things which have a large brain then it really helps to have a slightly larger brain and I think that's what happened and there exists at least one paper from science which supports this point of view so apparently there has been convergent evolution between social apps and social Birds even though in terms of various behaviors even though the divergence in evolutionary timescale between humans and birds has occurred a very long time ago and humans and humans apes and humans apes and birds have very different brain structure so I think what should happen if we succeed if we successfully follow the path of this approach is that you should create a society of agents which will have language and theory of mind negotiation social skills trade economy politics justice system all these things should happen inside the multi-agent environment and it will also be some alignment issue of how do you make sure that the agents we learn behave in a way that we want now I want to make a speculative digression here which is I want to make the following observation if you believe that this kind of society of agents is a plausible place where truly where the fuller fully general intelligence will emerge and if you accept that our experience with the dota BOTS we've seen a very rapid increase in competence will carry over once all the details are right if you assume both of these conditions then it should follow that we should see a very rapid increase in the competence of our agents as they live in the Society of agents so now that we've talked about a potentially interesting way of increasing the competence and teachings of an agent's social skills and language and a lot of things that actually exist in humans as well we want to talk a little bit about how you convey goals to agents and the question of the main goal to eight calls to agents is just a technical problem but it will be important because it is a lot more likely than not that the agents of evil train will eventually be dramatically smarter than us and this is work by the opening eye safety team by Paul Christiana at all and others so I'm just going to show you this video which basically explains how the whole thing works you there is some behavior looking for and you the human gets to see pairs of behaviors and you simply click on the one that looks better and after a very modest number of clicks you can get this little simulated leg to do back flips and there go picking out the back flips and in this to get this specific behavior it took about 500 clicks by human annotators the way it works is that you take all the so this is a very data efficient reinforcement learning algorithm but it is efficient in terms of rewards and not in terms of the environment interactions so what you do here is that you take all the clicks so you've got your here is one B here which is better than other you fit a reward function a numerical reward function to those clicks so you want to fit a reward function which satisfies those clicks clicks and you optimize this reward function with reinforcement learning and it actually works so this requires 500 bits of information you've also been able to train lots of Atari games using several thousand bits of information so in all these cases you had human and human annotators or human judges just like in the previous slide looking at the pairs of trajectories and clicking on the one that they thought was better and here's an example of an unusual goal where this is a car racing game but the goal was to ask the the agent to train the white car drive right behind the orange car so it's a different goal and it was very straightforward to communicate this goal using this approach so then to finish off alignment is a technical problem it has to be solved but of course the determination of the correct goals we want array assistance the systems to have will be a very challenging political problem and on this note I want to thank you so much for your attention and I just want to say that will be a happy hour at Cambridge Brewing Company at 8:45 if you want to chat more about AI and other topics please come by I think that deserves an applause so back propagation is a or neural networks of bio-inspired but back propagation doesn't look as though it's what's going on in the brain because signals in the brain go one direction down the axons whereas back propagation requires the errors to be propagated back up the the wires so can you just talk a little bit about that whole situation where it looks as the brain is doing something a bit different than our highly successful algorithms our algorithm is going to be improved once we figure out what the brain is doing or is the brain really sending signals back even though it's got no obvious way of doing that what's what's happening in that area so that's a great question so first of all I'll say that the true answer is that the honest answer is that I don't know but I have opinions and so I'll say two things but first of all given that look if you agree if we agree like so rather it is a true fact the back propagation solves the problem of circuit search this problem feels like an extremely fundamental problem and for this reason I think that it's unlikely to go away now you also write that the brain doesn't obviously do back propagation although they've been multiple proposals of how it could be how it could be doing them for example there's been a work by Tim little crap and others where they've shown that if you use that it's possible to learn a different set of connections but can be used for the backward pass and that can result in successful learning now the reason this hasn't been like really pushed to the limit by practitioners is because they say well I got TF to the gradients I'm just not going to worry about it but you are right this is an important issue and you know one of two things is going to happen so my personal opinion is that back propagation is just going to stay with us till the very end and will actually build fully human level and beyond systems before we understand how the brain does what it does so that's what I believe but of course it is a difference that has to be acknowledged okay thank you do you think it was a fair matchup for the dota bot and that person given the constraints of the system so I'd say that like the biggest advantage computers have in games like this like one of the big advantages is that they obviously have a better reaction time although in DotA in particular the number of clicks per second over the top players is fairly small which is different from Starcraft so in Starcraft stuff up is a very compact mechanically heavy game because of a large number of units and so the top players that is click all the time in DotA every player controls just one hero and so that greatly reduces the total number of actions they need to make now still precision matters I think that will discover that but what I think it'll really happen is if you'll discover that computers have the advantage in any domain or rather every domain not yet so do you think that the emergent behaviors from the agent were actually kind of directed because the constraints already kinda in place like so it was kind of forced discover those or do you think that like that was actually something quite novel that like wow it actually discovered these on its own like you didn't actually am biased towards constraining it so it's definitely discover new strategies and I can share an anecdote where our tester we have a probe which would test the bots and he played against for a long time and the bots would do all kinds of things against the player the human player which were effective then at some point that Pro decided to play against the better plot Pro and he decided to imitate one of the things that the bot was doing and this image but by imitating if he was able to defeat a better pro so I think I think the strategy discovers are real and so like it means that like this very real transformative Tran you know I would say I think what that means is that he because the strategies discovered by the bot of the humans it means that we like a fundamental game plays deeply related for a long time now I've heard that the objective of reinforcement learning is to determine a policy that chooses an action to maximize the expected reward which is what you said earlier would you ever want to look at the standard deviation of possible rewards does that even make sense yeah I mean I think for sure I think it's a really application dependent one of the reasons to maximize the expected reward it's because it's easier to design algorithms for it so you write down this equation the formula you do a little bit of derivation you get something which amounts to a nice-looking algorithm now I think there exist like really there exist applications where you'd never want to make mistakes and you want to work on the standard deviation as well but in practice it seems that the just looking at the expected reward covers a large fraction of the B the situation as you'd like to apply this door Thanks we talked last week about motivations and that has a lot to do with the reinforcement and some of the ideas is that the our motivations are actually connection with others and cooperation and I'm wondering if they're thrown off and I understand it's very popular to have the computers play these competitive games but is there any use in like having an agent self play collaboratively collaborative games Yeah right that's an extremely good question I don't think one place from which we can get some inspiration is from the evolution of cooperation like I think cooperation like we cooperate ultimately because it's much better for you the person to be cooperative than not and so I think what should happen if you have a sufficiently open-ended game then cooperation will be the winning strategy and so I think we will get cooperation whether we like it or not Hey you mentioned the complexity of this simulation of friction I was wondering if you feel that there exists open complexity theoretic problems relevant to relevant to AI or whether it's just a matter of finding good approximations that humans of the types of problems that humans tend to solve yeah so complexity theory well like at a very basic level we know that whatever algorithm we gonna run is going to run fairly efficiently on some hardware so that puts a pretty strict upper bound and the true complexity of the problems we're solving but by definition we are solving problems which aren't too hard in a complexity theoretic sense now it is also the case that many of the problems so while the overall thing that we do is not hard from a complexity theory makes sense and indeed humans cannot solve np-complete problems in general it is true that many of the like optimization problems that we pose to our algorithms are intractable in the general case starting from a neural net optimization itself it is easy to create a family of data sets for a neural network with a very small number of neurons such that find a global optimum is np-complete and so how do we avoid it well we just try gradient descent anyway and somehow it works but without question like we cannot we do not solve problems which are truly intractable so I mean I hope this answer the question hello it seems like an important sub-problem on the path towards AGI will be understanding language and the state of generative language modeling right now is pretty abysmal what do you think are the most productive research trajectories towards generative language models so I'll first say that you are completely correct that the situation with language is still far from great although progress has been made even without any particular innovations beyond models that exist today simply scaling up models that exist today on larger datasets is going to go surprisingly far not even large datasets but larger and deeper models for example if you trained a language model be the thousand layers and it's the same layer I think it's gonna be a pretty amazing language model like we don't have the cycles for it yet but to think it will change very soon now I also agree with you that there are some fundamental things missing in a current understanding of deep learning which prevent us from really solving the problem that we want so I think one of these problems one of the things that's missing is that or that seems like patently wrong is the fact that we train a model then you stop training the model and you freeze it even though it's the training process where the magic really happens but the magic is that if you think about it like the training process is the true general part of the whole of the whole of the whole story because you tends to flow code doesn't care which data set to optimize it just says whatever just give me the data set I don't care which one solve I'll sew them all so like the ability to do that feels really special and I think we are not using it at test time like it's hard to speculate about like things which you don't know the answer but all I'll say is that simply train bigger deeper language models you'll go surprisingly far scaling up but also doing things like training a test them and inference the test time I think would be another important boosts the performance hi thank you for the talk so it seems like right now another interesting approach to solving reinforcement learning problems could be to go for the evolutionary roots using evolutionary strategies and although they have they their cave Hut's I wanted to know if I'd open a I particularly you're working on something related and what are what is your general opinion on them so like at present I believe that something evolutionary strategies is not great for reinforcement learning I think that normal reinforcement learning algorithms especially with big policies are better but I think if you want to evolve a small compact object like like a piece of code for example I think that would be a place where this would be seriously was considering but this all you know evolving a beautiful piece of code is a cool idea hasn't been done yet so still a lot of work to be done before we get there hi thank you so much for coming my question is you mentioned what is the right go is a political problem so I'm wondering if you can elaborate a bit on that and also what do you think would be their approach for us to maybe get there well I can't I can't really comment too much because all the thoughts that you know we have we now have a few people who are thinking about this full-time at opening I I don't have enough of a super strong opinion to say anything too definitive all I can say at the very high level is given the size like if you go into the future whenever soon or late you know whenever it's going to happen when you build a computer which can do anything better than a human it will happen because the brain is physical the impact on society is going to be completely massive and overwhelming it's it's very difficult to imagine even if you try really hard and I think what it means is that people who care a lot and that's what I was alluding to the fact that this will be something that many people who care about strongly and like as the impact increases gradually with self-driving cars more automation I think we will see a lot more people care do we need to have a very accurate model of the physical world and then simulate that in order to have these agents that can eventually come out into the real world and do something approaching you know human level intelligence tasks that's a very good question so I think if that were the case be in trouble and I am very certain that it could be avoided so specifically the real answer has to be that look you learn the problem so we learn to negotiate you learn to persist you not a lots of different useful life lessons in the simulation and yes you learn some physics too but then you go outside to the real world and you have to start over to some extent because many of you are deeply held assumptions will be false in one of the goals so what was that's one reasons I care so much about never stopping training you've accumulated your knowledge now we go into an environment for some of your assumptions of valid you continue training you try to connect the new data to your old data and this is an important requirement from our algorithms which is already met to some extent but it will have to be met a lot more so that you can take the partial knowledge if you've acquired then go in a new situation learn some more literally the example of you go to school ballon useful things then you go to work it's not a perfect it's not you know you pour your four years of CS and undergrad is not going to fully prepare you for whatever it is you need to know it work it will help somewhat you'll be able to get off the ground but it will be lots of new things you need to learn so that's that's the spirit of it I think of a toes of the school one of the things you mentioned pretty early on in your talk is that one of the limitations of this sort of style of reinforcement learning is there's no self-organization so you have to tell it when it did a good thing or did a bad thing and that's actually a problem in neuroscience is when you're trying to teach a rat to you know navigate maze you have to artificially tell it what to do so where do you see moving forward when we already have this problem with teaching you know not necessarily learning but also teaching so where do you see the research moving forward in that respect how do you sort of introduce this notion of self-organization so I think without question one really important thing you need to do is to be able to infer the goals and strategies of other agents by observing them that's a fundamental skill we need to be able to learn to to embed into the agent so if for example you have two agents one of them is doing something and the other agent says well that's really cool I want to be able to do that too and you go and do that and so I'd say that this is a very important component in terms of second every word oh you see what they do you infer the reward and now we have a knob which says you see what they're doing now go and try to do the same thing let's say this this is as far as I know as far as I know this is was one of the important ways in which humans are quite different from other animals in way which in the like scale and scope in which we copy the behavior of other humans might ask a quick follow-up work go for it so that's kind of obvious how that works in the scope of competition but what about just sort of arbitrary tasks like I'm in a math class for someone and I see someone doing a problem a particular way and I go that's a good strategy maybe I should try that out how does that work in a sort of non competitive environment so I think that this will be I think that's going to be a little bit separate from the competitive environment but it will have to be somehow either way you know probably baked in maybe volved into the system where like if you have other agents doing things they're generating data which you observe and the only way to truly make sense of the data that you see is to infer the goal of the agent the strategy their belief state that's important also for communicating them if you want to successfully communicate with someone you have to keep track both of their goal and of their belief state instead of knowledge so I think you will find that there are many I guess connections between understanding what other agents are doing inferring their goals imitating them and community successfully communicating them all right let's give in the happy hour a big hand [Applause] you [Applause] you
Info
Channel: Lex Fridman
Views: 92,578
Rating: 4.9519615 out of 5
Keywords: deep learning, openai, agi, mit, ai, reinforcement learning, Ilya Sutskever, self-play, deep rl, deep reinforcement learning, meta-learning, artificial general intelligence, recurrent neural network, imagenet, dota 2, gpt2
Id: 9EN_HoEk3KY
Channel Id: undefined
Length: 60min 15sec (3615 seconds)
Published: Wed Apr 25 2018
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.