OpenAI - Meta Learning & Self Play - Ilya Sutskever

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
everyone it's a great pleasure to introduce Ilya sutskever Ilya did his PhD at the University of Toronto working with geoff hinton then from there went on to do a postdoc at Stanford then founded a company got acquired by Google from there founded open AI the AR Institute in San Francisco largely funded by Elon Musk and along the way Italy has written a lot of the papers that a lot of us are building up on these days for example Ilias paper on image net from 20 2012 the first big deep learning results for image recognition sparked the whole activity in the field after that papers on more specifics on how to do this drop out also paper on sequence the sequence that show that actually deep learning works for discrete objects like language namely establish new state of ER the machine translation some of the learn to execute papers for neural turing machines and more recently a lot of metal learning and reinforcement learning work one of the I think most notable things is that even though Ilya is still super young turns out da da just in 2017 his papers were cited 20,000 times just in one year please join me in welcoming Ilya [Applause] about some of the work we've done at opening over the past year and this is like a a narrow subset that focus the talk will be a subset of that work focusing on meta learning and self play which are two topics I like very much but I've been told that this is a more slightly broader a little bit more of a general interest talk so I want to begin the presentation by talking a little bit about why deep learning actually works and I think it's not a self-evident question why deep learning works it's not self-evident that it should work and I want to give some perspective which i think is not entirely obvious on that so one thing that you can actually prove mathematically with the best possible way of generalizing it's completely unimprovable is to find the best short program that explains your data and then use that to make predictions and you can prove that it's impossible to do better than that so if you think about machine learning you need to think about concept classes what are you looking for given the data and if you're looking for the best short program it's impossible to generalize better than that and it can be proved and the proof is not even that complicated and like the intuition of it basically is that any regular any regularity that can possibly exist is expressible the short program if you have sight piece of data which cannot be compressed with a slightly shorter problem than that piece of data story random so you can take my word read that it therefore follows the truck programs are the best possible way to generalize if only we could use them problem is it is impossible to find the best for problem describes data at least given today's knowledge the computational problem of finding the best problem is intractable in practice undecidable in theory so no short programs for us but what about small circuits small circuits are the next best thing after short programs causes a short small circuit can also performs in an obvious computation if you have a really deep really wide circuit maybe you know many many thousand layers and many millions of neurons wide you can run lots of different algorithms on the inside so it comes close it comes close to four programs and extremely fortunately the problem of finding the best small circuit given the data is solvable with background and so basically what it boils down to is that we can find the best small circuit that explains the data and small circuits are kind of like programs but not really they are a little bit worse it's like finding the best parallel problem the transfer 100 steps or less 50 steps that solves your problem and that's what the generalization comes from now we don't know why don't know exactly why back propagation is successful at finding the best short circuit even given your data it's a mystery and it's a very fortunate mystery it powers all the progress that you've made in all the progress that's been made in artificial intelligence over the past 6 years so I think there is an element of luck here we are lucky that it works one thing which I one useful analogy that I like to make them thinking about generalization is that models learning models that in some ways of greater computational power generalize better so you could make this you could make the case that the deeper your neural network is the closer it comes to the oil the ultimate best short programs and so the better if you generalize so that's that tries to touch on the question of where does a generalization come from I think the full answer is going to be unknown for quite some time because it also has to do with the specific data that we happen to want to solve it is very nice indeed that the problems we want to solve happen to be solvable with these classes of models one other statement I want to make is that I think that the back propagation algorithm is going to stay with us until the very end because the problem that it solves is so fundamental which is given data find the best small circuit that fits to it it seems unlikely with this problem that we do not want to solve this problem in the future and so for this reason I feel like backdrop is really important now I want to spend a little bit of time talking about reinforcement learning and so reinforcement learning is a framework for describing the behavior of agents you've got an agent which takes actions interacts with an environment and receives rewards when it succeeds and it's pretty clear that it's a very general framework but the thing that makes reinforcement learning interesting is that there exist useful algorithms in reinforcement learning so in other words the algorithms of reinforcement learning make the framework interesting even though these algorithms have Steve a lot of room for improvement they can already succeed in lots of non abuse tasks and so therefore it's worth pushing on these algorithms if you make a really good reinforcement learning algorithms perhaps we'll build very clever agents and so the way the the way the reinforcement learning problem is formulated is as follows you have some policy class where policy is just some function which takes inputs and produces actions and for any given policy you can run it and you can figure out its performance its cost and your goal is just to find the best policy that minimizes cost maximizes reward rewards now one way in which this framework formulation is different from reality is that in reality the agents generate the rewards to themselves and the only true cost function that exists is survival so if you want to build good reinforcement on any reinforcement learning algorithm at all you need to represent the policy somehow so how are you going to represent anything the answer is always to using a neural network the neural network is going to take the actions and produce take the observations and produce actions and then for a given setting of the parameters you could figure out how you could calculate how good they are and then you could calculate could figure out how to compute the way to change these parameters to improve the model so if you may if you change the parameters of the model many times and make many small improvements then you may make a big improvement and very often in practice the improvement ends up being big enough to solve the problem so I want to talk a little bit about how reinforcement learning algorithms work the model wants the model three ones the one that were the ones that everyone uses today and take a policy and you add a little bit of randomness to your actions somehow so you deviate from your usual behavior and then you simply check if the resulting cost was better than expected and if it is you make it more likely by the way I want to look for I'm actually curious how many people are familiar with the basics please raise your hand ok so the audience here is informed so I can skip through the introductory part strong doors alright I'll skip only a little bit but the point is you do something randomly and you see if it's better than usual and if it is do more of that and do a lot of them repeat this many times so in reinforcement learning there are two classes of algorithms one of them is called policy gradients which is basically what I just described and there is a beautiful formula above which says that if you just stick the derivative of your pulse function and do a little bit of math you get something which is exactly as described where you just take a random and take some random actions with a little bit of randomness and if the result is better than expected then increase the probability of taking these actions in the future then there is also the Q learning algorithm which is a little bit less stable a little bit more sample efficient I'll explain to detail in too much detail how how it works but it has the property that it is off policy which means that it can learn not just from its own actions and I want to explain what it means on policy means that you can only learn you can only learn at all if you are the one who's taking actions while off policy means that you can learn from any ones that other anyone's actions it doesn't just have to be on so it's a bit more it seems like a more useful thing although it's interesting that the algorithm which is more stable the stable algorithms tend to be policy grading based on policies the ones that are the cooler q-learning which is off policy is also less stable at least as of today things change quickly now I'll spend a little bit of time it was written how Q learning works even though I think it may be familiar this may be familiar if it makes many people and basically of this Q function which tries to estimate for a given state and a given action how good or bad the future is going to be and you have this trajectory of states because you follow your agent is taking many actions in the worlds relentlessly pursuing a goal well the cue function is recursive property where the cue function of Si is basically just the cue function of s prime a prime plus the reward you got earlier so you got this recursively and you can use this recursively to estimate the cue function and that gives you the cue learning algorithm and I want to explain why it's off policy all you need is two for the purposes of this presentation just take my word for it and now what's the potential here why is this exciting so yes the reinforcement learning algorithms that we have right now they are very sample inefficient they are really bad at exploration yet although progress is being made but you can kind of see that if you had a really great reinforcement learning algorithm that would be just really data efficient and explore really well and make really good use of lots of sources of information then we'd be in good shape in terms of the goal in terms of building intelligent agents but we still have work to do they still will be data inefficient so now I want to talk a little bit about meta learning which will be an important part of this talk and I want to explain what it is so so there is the abstract the the dream of meta learning the abstract idea meta learning is the idea that you can learn to learn kind of in the same way in which biological evolution has learned the learning algorithm of the brain and spiritually the way you'd approach this problem is by training a system not on one task but on many tasks and if you do that then suddenly you've trained your system to solve new tasks very quickly so they'll be a nice thing if you could do that be great if you could learn to learn you will need to design that within ourselves you use the learning algorithm that you have right now do the rest of the thinking for us you're not quite there yet but metal learning has had a fair bit of success and I just want to show um explains the dominant the most common way of doing metal a the most common way of doing metal learning is the most attractive one where you basically say that you you want to reduce me the problem of metal learning to traditional deep learning we basically take your familiar supervised learning framework and you replace each data point be the task from your training set of tasks and so what you do is that all all these algorithms have the same kind of high level shape where you have a model which receives information about the task + and it tasks instance and it needs to make the prediction and it's pretty easy to see that if you do that then you will train a model which can receive a new description of a task and make good predictions there and there have been some very some pretty successful compared compelling success stories and I'll mention some of them a lot of a lot of meta learning work was done em in Berkeley as well but I'll mention some of the visual ones theory ones that I think are notable because you see this task right here it I took this figure from a paper by Brandon Laker all this but I think the data set came earlier so this in the right situation but one of the criticisms over one of the ways in which neural nets were criticized is that they can't learn quickly which is kind of true and the team in Geoff's Tenenbaums lab had developed this data set which was a very large number of different characters and a very small number of examples for each character specifically as a challenge for neural networks and it turns out that the simple meta learning approach where you just say that I want to train a neural network that can learn to recognize any character really quickly that approach works super well and it's been able to get superhuman performance and as far as I know the best performance is achieved by an issue at all and I believe it's the work done with Peter and they it's basically superhuman it's just a neural net so meta-learning sometimes work pretty well there is also a very different take on meta learning which is a lot more which is a lot closer to the approach of instead of learning the parameters of a big model let's learn something compact and small like the architecture or even the algorithm which is what evolution did and here you just say why don't you search in architecture space and find the best architecture this is also fun of meta learning and also generalizes really well because this work if you do if you learn an architecture on a small image data set if you work really well in a large image they're set as well and the reason it generalizes well is because the amount of information in an architecture is small and this is work from a Google buy as up and lead so meta-learning works sometimes there are signs of life the promise is very strong it's just so compelling yeah I just set everything right and then your existing learning algorithm they learn the learning algorithm of the future that would be nice so now I want to dive into the detailed description of one algorithm that you've done it's called hands-on experience to play and it's been a large collaboration with many people driven primarily by intercultural and this is not exactly meta learning this is almost meta learning and basically what happened there is that let me to think about what this algorithm does is that you try to solve a a hard problem by making it harder and as a result it becomes easier and so you frame one problem into the framework into the context of many problems you have very many problems if you run into solve simultaneously and that makes it easy and the problem here is basically a combination of exploration we're in the enforcement learning we need to take the right action if you don't take the right action you don't learn if you don't get rewards how can you improve all your effort that doesn't lead to what will be wasted it'd be nice if you didn't have that and so if our rewards are sparse if you try to achieve our goal and to fail the model doesn't learn so how do we fix that so it's a really simple idea it's super intuitive you basically say you have the starting point you try to reach state a but you reach the state B instead and so what can we learn something from this well we have a data we have a trajectory of how to reach the state B so maybe we can use this flawed attempt at reaching a as an opportunity to learn the state B and so this is very correct directionally means that you don't waste experience any but you need enough policy algorithm in order to learn it and that's why I've emphasized the phone system earlier because your policy tries to reach a but you're going to use this data to teach a different policy which which is B so you have this big parametrize function and you just simply tell it which state you which it's too super long it's super super straightforward and it's intuitive it works really well too high inside experience to play so I'm going to show you the video it's it's pretty cool and so in this case the reward is very sparse in binary and so that I should just say because the reward is sparse in binary this makes it very hard for traditional reinforcement learning algorithms because you never get to see the reward if you were to shape your reward perhaps you could solve this problems a little bit better although if we still founded and you know when the people that were working on this tribe it they still found it difficult but this algorithm just works on these cool tasks in just the videos look cool so let's keep watching you get these very nice confident looking movements from the hindsight experience to play algorithm and it just makes sense like anytime something happens you want to learn from it and so we want this to be the basis of all future algorithms now again this is in the absolutely sparse binary world setting which means that the standard reinforcement learning algorithms are very disadvantaged but even if you try to shape a reward what's one thing that you discover is that shaping rewards is sometimes easy but something is quite challenging and here's the same thing working on real physical blocks okay so this is the this this basically sums sums up the haunted experience replay results high insight experience replay and what you can see is like if you want to one of the limitations of all these results is that they the state is very low dimensional and if you have a general environment is very high dimensional inputs and very long histories you've got a question of how do you represent your goals and so what it means is that representation learning is going to be very important and unsupervised learning it's probably doesn't work yet but I think it's pretty close and we should keep thinking about how to really fuse on supervised learning this reinforcement learning I think this is a fruitful area for the future now I want to talk about a different project on using on doing transfer from sim to Rio with meta learning and this work is by paying at all and multiple people who did this work out from Berkeley unfortunately I don't have the full list here so it would be nice if you could train our robots in simulation and then deploy them on physical robots simulation is easy to work with but it's also very clear that you can't simulate most things so then can anything be done here and I just want to explain one very simple idea of how you could do that and the answer is basically you train a policy the doesn't just solve the task in one simulated setting but it solves the task in a family of simulated settings so what does it mean you say ok I'm going to randomize the friction coefficients and gravity and pretty much anything you can think of the lengths of your robotic limbs and their masses and the frictions and sizes and your policy isn't told what you've done you just need to figure it it needs to figure it out by interacting with the environment well if you do that then you develop a robust policy that's pretty good at figuring out what's going on at least in the simulations and if this is done then the resultant system will be much more likely to generalize its knowledge from the simulation to give the world and this is an instance of meta learning because in effect you are learning a policy which is very quick identifying the precise physics reason so I would say this is a little bit i mean calling it meta learning is a bit of a stretch it's more of a kind of a robust adaptive dynamic thing but it also has a meta level filtered I'm going to show this video of the baseline so this is what happens when you don't this is what happens when you don't do this a bustah fication of the policy so you try to get the hockey puck into the red dots and it just fails really dramatically and doesn't look very good and if you add these robust if occasions then the results look better then it's like you know even when it pushes it around and it overshoots it's just no problem so it looks pretty good so I think this store example illustrates the approach of training a policy in simulation and then making sure that the policy doesn't solve just one instance of the simulation but in many different instances of it and figures out which one it is then it could succeed to generalize into the real to the real physical robot so that's encouraging now I want to talk about another project by France at all and it's about not doing hierarchical reinforcement learning so heretical reinforcement learning is one of those ideas that would be nice if we could get it to work because one of the problems between forcibly the reinforcement learning as it's currently done today is that you have very long horizons which you have trouble dealing with and you have trouble dealing with that exploration is not very directed so it's not as fast as you would like and the credit assignment is challenging as well and so we can do a very simple metal learning approach where you basically say that you want to learn low-level actions which make learning fast so you have a distribution over tasks and you have a distribution of a tasks and you want to find a set of low-level policies such that if you use them inside the reinforcement learning algorithm you learn as quickly as possible and so if you do that you can learn pretty sensible locomotion strategies that go in the persistent direction and so here it is the three policies the high level and the the system has been learned to find to find the policies that will solve problems like this and there is a specific distribution over this kind of problems what solved as quickly as possible so that's pretty nice now one thing I want to mention here is the one important limitation of high capacity meta-learning so there are two kinds of there are two ways to the meta-learning one is by learning a big neural network that can quickly solve problems in your distribution of tasks and the other one is by learning an architectural or an algorithm still on a small object so if you're not an architecture if you learn an algorithm in a metal ions setting it will likely generalize to many other tasks but this is not the case or at least it is much less the case for high capacity metal learning if you just wanted for example train a very large recurrent neural network you want to learn a very large recurrent neural network that solves many tasks if you be very committed to the distribution of tasks that you've training on and if you give it a task that's meaningful the outside of the distribution will not succeed so as a kind of a slightly the kind of example I have in mind is well let's say you take your system and you train it to do math in a little bit of math and a little bit of programming and you teach it how to read could it do chemistry well not according to this paradigm at least not obviously because it really needs to have the task to come from the same distribution of the training and it in a test time so I think for this to work we need to improve our the generalization of our algorithms further now I want to finish by talking about self play the sneaks off play is a really cool topic it's been around for a long time and I think it's really interesting and intriguing and mysterious and I want to start by talking about the very earliest work on self play that I know of and that's TD gammon it was done back in 1992 it was by this our single author work and in this work they've used q-learning with self play to train a neural network that beats the world champion in backgammon so I think this may sound familiar in 2017 and 2018 but that's in 1992 that's back when you are CPUs world of like I don't know 33 megahertz or something and if you look at this plot you see it shows the performance as a function of time means different numbers of hidden units you say okay have 10 hidden units which that's the red that's the red the red curve and 20 hidden units is the green curve all the way to the purple curve and yeah it's basically nothing changed in 25 years just the number of zeros and the number of hidden units and in fact they've even discovered unconventional strategies that surprised experts in banking so that's just amazing but it's that this work was done so long ago and it had some it was looking forward into the future so much and this approach basically remained dormant people would try out a little bit but it really was revived by their terror results and defined and you know we've also had very compelling self play results in alphago zero where they could train a very stronger player from no knowledge at all to beating all humans same is true about our daughter to results it again started from zero and just did lots and lots of play and I want to talk a little bit about why I think self way is really exciting because you get things like this like you can self play makes it possible to create very simple environments that support potentially unbounded complexity unbounded sophistication in your agents unbounded scheming and social skills and since relevance towards building for building intelligent agents and there is work on artificial life by Carl seems from ninety four and you can see that already there it looks very very familiar you see these little evolve creatures whose morphologies are involved as well and here they are competing for the possession of a little green cube and again this was done in 1994 on tiny computers and just like many and just like other promising ideas that we may that you have familiar with didn't have enough compute to really push them forward but I think that this is the kind of thing that we could get with large scales of play and I want to show some work that we've done just trying to revive this concept a little bit and I'm going to show this video this will work by band cell at all was a productive summer internship there is a bit of music here let me shine enough actually we can keep it on and I can't I can't but the point is what's the point you got the super simple environment which in this case it's just a sumo ring and you just tell the agents you get a plus one when the other agents get gets outside the ring and the reason I find this so well I personally like it because these things look alive but they have these breaths of complicated behaviors would they learn just in order to stay in the game and so you can kind of see that if you let your imagination run wild then yeah so this self play is not symmetric and also the human these humanoids are a bit unnatural because they they they don't feel pain and they don't get tired and they don't have any you know a whole lot of energy constraints I blocked it that was good so that's pretty good too so here's the goal you can guess what the goal is wow that was that was a nice Dodge and now this so this is example so one of the things that would be nice is that if you could pick these self play environments trainer agents to do some kind of tasks from the self play and then take the agent outside and get it to something useful for us I think if that were possible that would be amazing and here there is like a tiniest the tiniest of tests where we take the sumo-wrestling agent and we just apply the put it we put it isolated and alone inside the ring it doesn't have a friend and you just apply big forces on it and see if it can balance itself and of course it can balance itself because it's been trained because it's been trained against an opponent tried to push it so it's really good at resisting force in general and so kind of the the mental image here is that imagine you take a ninja and then you ask it to to learn to become a chef because the ninja is already so dexterous it should have been really fairly easy time to be a very good good cook that's the kind of high level idea here it hasn't happened yet but one thing I'd like to ask ya and so and so I think one of the key questions in this line of work is how can you set up a type of software environment which once you succeed it can solve useful tasks for us which are different just from the environment itself that's the big difference between games in games the goal is to actually win the environment but that's not what we want we want it to just be generally good at being clever and then sort of solve our problems you know do my homework type agent now one of em yeah I wanted to show one one slide which you think is interesting so one of the one of the reasons like if you like I would like to ask you to let your imaginations run wild and imagine that neural net the hardware designers of neural nets who build enormous giant computers and this self play has been scaled up massively one thing that's notable but we know about biological evolution is the social species tend to be tend to have large brains they tend to be smarter you know that this is true for any it is very often the case that whenever you have two species which are related but one is social one isn't then the social one tends to be smarter we know that human biological evolution really accelerated over the past few million years probably because at that point well this is a bit speculative but the theory here my theory at least is that humans became sufficiently competent with respect to their environment so you're not stop being afraid of the lion and the biggest concern became doubt of human what the other humans think of you what are the gossiping about you where you stand in the pecking order and so I think this kind of environment created an incentive for the large brains and I was able you know as is often the case in science it's very easy to find some scientific support for a hypothesis which we did so there exists a paper in science [Music] which supports the claim that social environments stimulate the development of larger collabora brains and specific evidence they present there is the convergent evolution in smart social Apes and smart birdies but Krauss apparently they have similar cognitive functions even not have very different brain structures now I'm only 75% confident in this claim but I'm pretty sure that birds don't have the same kind of cortex as video because the evolutionary split occurred a long time back in the past so I think it's interesting I think it's I like I like I think this is intriguing at the very least but yeah you could create your Society of agents and just keep scaling it up and perhaps you're gonna get agents are going to be smart now I want to finish one with one observation about environments that are trained to be self play and this is and this is a plot from our from the the strengths of our dota boat as a function of time going from April all the way to August and basically you just fix the bugs and you scale up yourself the environment and you scale up the amount of compute and you get a very rapid increase in the strengths of the system and it makes sense in software environments the compute is the data so you can generate more of it so I guess I want to finish with a provocative question which is if you have a self sufficiently open-ended software environment will you get extremely rapid increase in the cognitive ability of your agents all the way to superhuman and on this note I will finish the presentation thank you so much for your attention yeah before before before it's before start the question-answering am session I want to say that one one important thing I want to say is that many of these works were done in collaboration with many people from Berkeley and especially Peter a bill and I wanna I want to highlight that okay great I wonder if you can show the last light because you would seem like it was a very important conclusion but you went over it very quickly yeah so this is a very this is a it's it's am it is a bit speculative and it really is a question of the specific statement here is that if you believe that you want to get 20 smart human level agents as a result of some kind of massive scale self play will you also experience the same kind of rapid increase in the capability of the agent that you see that we saw in our experience with dota and in general because you can convert computing to data so you put more compute this thing gets better yeah so I mean that's that sort of a general Romanus li you do you compute more you get you get better results but I didn't quite grasp the the difference between these two panels well so so it's really a question of so it's actually boils down to this the question of where does the what are the limits to progress in the fields and in capabilities are do the limits come from like in other words given the right algorithms which currently don't yet exist once you have them how will the increase in the cup in the actual capability of the system look like I think there is a definitely a possibility that it will be like on the right side that once you have you know you figure out your Oracle reinforcement learning we figured out concept learning or you go you're supervised learning is in good shape and then the massive neuron that Hardware arrives and you have a huge neural net much bigger than the human brain and this will happen how quick how will the plot look like over time so you're projecting gathered we've only seen the very beginning okay so let's throw it up with the questions and I see you already have thank you for that um you mentioned hierarchy and I'm wondering if you have an example of hierarchical self way that would you know increase the slope of this curve yeah so if you don't have her we have not tried her records of play this is more statement from our experience with our dota bot where you start at basically losing to everyone and then your true skill metric which is like an elo rating just increase pretty much linearly and all the way to the best humans so that's and I think this is a general it seems like it could be a general property of self play systems dota oh yeah hey very nice talk thank you I had a question on environments you have any thoughts on going beyond like sumo-wrestling environments like what are good environments do to study well these are the question of what makes a good environment so there are two ways of getting good environments one of them is from trying to solve problems that you care about and they naturally generate environments I think another one is to think of open-ended environments where you can build lots so one of the one of the slightly unsatisfying features of most of the environments that we have today is that they're a little bit not open-ended you got a very kind of narrow domain and you want to perform a task in this narrow domain but one but some environments which are very interesting to think about a one where there is no limit to the depth of these environments and some of these examples include programming math even minecraft in Minecraft you could build structures of greater and greater complexity and you know at first people be little homes in Minecraft then the big castles and now people you can find people who are building entire cities and even computers inside Minecraft now obviously Minecraft has an obvious challenge which is a problem which is what we want the agents to do there so it needs to be addressed but kind of directionally this would be nice environments to think about more some up here this is this is sort of similar to that last question but I was wondering what the effective if you know of complicated of non agent objects and non agent entities in the environment is on how well self play works for instance in the sumo environment the reason that the self agents can become very complex and use very complex strategies is because that's necessary in order to compete against this other agent which is also using very complex strategies if instead you were working maybe not against another agent but against a very simple agent doesn't trained but through some very complicated system you had to operate a lot of machines in this environment or something like that how does that affect the effectiveness of us yeah I mean I think I think it depends a little bit on the specifics like for sure that you know if you have a complicated environment or complicated program is produced somehow then you will also need to develop a pretty competent agent I think the thing that's interesting about the self way approach is that you generates the challenge yourself so the question of where does the challenge come from is answered for you there's a white problem oh there is my problem might be a mic problem and you let's let's continue any more questions okay so oh man quite a few going back a bit to the hindsight experience policy you talk about the example of you know you trying to reach the red spot a and you instead reach some spot B and you're gonna use that to train I guess I was wondering if you could elaborate on that a little bit more I mean I'm not very familiar with DD PG so perhaps that's critical to understanding this but I guess what I'm wondering is how do you turn every experience into you know hitting the ball this way translates into this motion without doing it in a reward-based way yes so basically you just say you're you you have a policy which is parametrized by a goal state so then you say ineffective and family of policies one for every possible goal and then you say okay I'm going to run a pole I'm gonna round the policy the tries to reach state a and it's which state B instead so I'm going to say well this is great for training data for the policy which reaches state B so that's how you do it in effect like if you want more details if you could talk about it offline okay so what is a very simple question about about hgr again so if a task is difficult for example you know hitting a fastball in baseball right so even the best humans can do it the 38% of the time or something like that right so the danger is that if you miss you can say oh I was trying to miss so so now I take this as a training example of how to miss alright which is not and right you were actually doing the optimal action right will your perceptual library just contract the ball fast enough so that's the best you can do so it seems like you would you run into trouble on tasks like that I mean okay so that should answer the first question before you ask me so let's do that so the method is still not absolutely perfect but on the question of what happens when you miss we're trying to actually succeed then you will have a lot of data on how to not reach the state like so you're trying to reach certain desired state which is hard to reach you try to do that you reach a different state so you say okay well I'm gonna I will train a system to reach this state but next time I'm going to say a similar it what it means is that for that specific problem the approach of this approach will be less beneficial then for approach for an approach for the tasks are a bit more continuous where you can have a more of a hill climbing effect we gradually I could say in a programming in the in the setting of context programming you learn the problem simple programs you learn to write different subroutines and you gradually increase your competence the set of states you know how to reach so I agree that when there is a very narrow state which is very hard to reach then you cannot help but whenever there is a kind of a continuity to the states then this approach will help okay so the second question is about self playing so when I saw your title what I thought you were gonna say was was there so if you if you think about alphago right if we try to train alphago by playing it against the existing world champion since it would never win a single game for the first 50 million games right we learn nothing at all but because we play it against itself it always has a 50% chance of winning so you're always going to get a gradient signal no matter how poorly you play yeah that's very important now the so that the question is you know is there some magic trick there that you can then apply to toss that are intrinsically difficult to get to get any reward signal right so if you take spider solitaire for example if you watch an ordinary human play spider solitaire they lose the first hundred games and then they give up they say this is impossible you know I hate this game there's no rewards in it because you're just not good enough to ever win and so is there a way you can convert spider solitaire into a two player game and somehow guarantee that you always get a gradient signal so that's a very good question it's a very good and portable to what you said is a very good point I just wanna before before I am elaborate on your question I just want to also talk about the fact that one of the key things of soft place at the other seven equal evenly match the point and what it means that you also have potentially an indefinite incentive for improvement even if you are really really competent if you have a super competent agent the opponent will be just as competent and so if done right the system will be incentivized to improve and so I think yeah I I think it's an important thing to emphasize and it's also by the way why the exploration problem is much easier because you explore the strategy space together with your opponent and it's actually important not to have just one opponent but to have a whole new family of them for stability but that's that that's that's basically crucial now on your second question of what to do when you just can't get the reward it's very often if the problem is hard enough I think there isn't much you can do without having some kind of deep domain you know site information about the task but one approach that is popular and proceeded by multiple groups is to use like asymmetric self play for exploration you've got a predictor which says the predict what's going to happen and you've got a policy which tries to take action which surprise the predictor so the particular is going to say okay well if you're going to I basically have opinions about what will be the consequence of the different actions and the reactor tries to find regions of space you surprise the predictor so you have this simple kind of self play it's not exactly subway it's more of a kind of a competitive adversarial scenario where the agent is incentivized to cover the entire space it doesn't answer the question of how to solve a hard task like so like spider solitaire because if you actually need to be super good I think I think that's stuff but at least you can see how this can give you a general guide of how to move forward in general I said a question back here what do you think is exciting in terms of new architectures such as building they've been adding like memory structures to your own that's like a D&C paper yeah so will you see the role of new architectures playing and actually achieving what we want for generalization learning yeah so I think I think this is a very good question a question of architectures and I'd say that it's very rare to find a really a genuinely good new architecture and true in true in genuine innovation in architecture space is uncommon that's a the biggest innovation in architecture space over the past many years has been soft attention so soft attention is legitimately a major advance in architectures but it's also very hard to innovate in architecture space because the basic architecture is so good I think that better generalization will be achieved not and this is my opinion it's not backed by data yet I think the better generalization will not be achieved by means of just improving their architecture but by means of changing the learning algorithm and possibly even the paradigm of the baby think about our models I think things like minimum description length and compression will be a lot more popular but it's not I think these are not obvious questions but basically I think architecture is important whenever you can actually and goo-goo architectures for the hard problems how about curriculum learning so to learn to hit a fastball start with a slow ball yeah for sure curriculum learning is a very important idea so how human it's it's how humans learn and it's very I guess a pleasant surprise that our neural networks also benefit from curriculums one nice thing about self play is that the curriculum is built in it's like intrinsic what you lose in subway is the ability to direct the self play to a specified point so I I have a question you showed us the nice videos the wrestlers and the robots and so forth and ice I assume it's similar to deep learning in the sense that there's a framework of linear algebra underlying the whole thing so is there in there other than the linear algebra I mean you just have neural net I mean so it's not it's even easy just stick to agents and you apply reinforcement learning algorithms and reinforcement learning algorithm is a neural net need a slightly different way of making the parameters so it's all it's all matrix multiplication all the way down just want to multiply big matrices as fast as possible okay so you mentioned something about transfer learning and the importance of that what do you think about concept of extraction and transferring that and if that's something that you think it's possible or people are doing right now so I think it really depends on what you mean by context concept instruction exactly I think it's definitely the case that our transfer learning abilities are still rudimentary and we don't yet have methods that can extract like seriously high-level concepts from one domain and an applied in another domain there are ideas on how to do to approach that but nothing that's really convincing on the task that matters not yet we really had a lot of questions and the reason is that you gave very short succinct answers for which we are very grateful thank you very much let's give [Applause]
Info
Channel: The Artificial Intelligence Channel
Views: 14,758
Rating: undefined out of 5
Keywords: singularity, ai, artificial intelligence, deep learning, machine learning, deepmind, robots, robotics, self-driving cars, driverless cars
Id: AopSlxNYqX8
Channel Id: undefined
Length: 58min 34sec (3514 seconds)
Published: Sat Apr 07 2018
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.