well welcome back everybody it's my great pleasure to introduce Ilya scoot over who is one of the true luminaries of deep learning he was there at the very beginning of the current revolution getting his PhD with Geoff Hinton at Toronto where he was one of the co-authors on the very seminal paper on Alex net which is really the network that by winning the imagenet competition in 2012 kind of demonstrated everybody what what deep learning was really capable of since then he has done his own deporting startup that got acquired by Google and it worked at Google brain where he did the sequence to sequence model and contributed to tensorflow he is I found her at open AI where he is now and he's going to tell you about some of the recent results there in particular how they've been able to get AI to play games as well or better than humans I've been asked to remind you that this talk is being shared on Nvidia's YouTube channel and it's being shared publicly and so please in the Q&A session don't say anything Nvidia confidential so thanks we turn it over day oh yeah thank you very much for the introduction yeah all right so let's let's start an open ai our goal is to build safe AGI and to make sure that it's beneficial and that its benefits are highly distributed when you think about AGI you identified some components that it should have for example it would be good if we could achieve difficult goals and simulation it would be good if we could take the skills that he learnt in simulation and take them outside it would be good if you could learn great world models and be excellent if we essential to be precise please address the issues around safety and employment deployment so in the technical part of my presentation I'll tell you about three of our recent results that I am quite excited about opening a five-hour daughter bot that can play as strong as some of the best humans in this game dactyl are robot which has achieved a very strong level of dexterity and our results on unsupervised language understanding opening at 5:00 this is our daughter bot so the game of dota here's the video from it it's a really complicated game it is very messy it combines short-term tactics and long-term strategy it has of the largest professional scene of any eSports game and it has an annual prize pool greater than 40 million dollars so the game is popular this o you can't really see it well on the projector but this is a photograph from this year's ti which is the International this is where we had our BOTS play against against a top pro team two top proteins and you can't see it well at the projector but this is a giant hole and there are 20 at the giant stage and there's 90,000 people in it so I want to allow elaborate a little bit more about why this game is hard so I mentioned you've got tactics because there are lots of certain things going on and every strategy simply because the game is longer it's a long game it's a now a single match lasts an hour you have partial observability you don't see the full map you only see part of it you have a very large number of heroes who's completely complicated interactions in them you have 20,000 actions per game and you have a massive action space it's almost like it's essentially a continuous action space because you can select a unit out of a pretty large number of units and tell it where to go and one other important thing is that the professional players they dedicate the lives game they put in tens of thousands of hours of deliberate practice of being as good at the game as possible so it's not an easy game to play and the other thing that's very interesting unimportant about this game is that unlike previous games which were used for AI dota is closer to the real world course it's also not the real world but it is closer so how did we do it we used large-scale RL that's it the we used an LST M policy which is large you know a large of us DM policy I mean calling it larger guess it's a matter is a little bit of a subjective but you call it large or not it's definitely large for an RL policy right now anyway we have an Ellis game with 4,000 neurons so it has about 100 something million parameters and in terms of numbers in terms of its number of flops it's like the honeybee brain so used self play and we also use reward shaping a little bit of reward shaping was important so what's the key scientific discovery that we made during this work is that reinforcement learning actually works so we already knew that deeply that supervised learning actually worked we supervised learning we can pretty much solve any problem you want if you have a large training set of input-output examples doesn't matter if it's a vision text whatever domain on the input side out beside supervised learning can solve it and if your model doesn't work well you just need to make it larger and get a little bit more data and then it will work and that's the the miracle of supervised learning and we've shown that the same thing holds for RL we've shown it in RL if you have a hard problem it can be a really hard problem you can achieve super hard with super high performance superhuman performance if you just appropriately scale it up long horizon that was a big deal turns out not so much and I want to point out that nearly all reinforcement learning experts in the world had a pretty pessimistic view towards RL they were certain that reinforcement learning cannot do long horizons which justify rationale and I justified a lot of work in Huracan reinforcement learning and it was just believed that reinforcement learning can't do things pure enforcement learning has only been applied to very simple environments like simple games and a little simple humanoid little simulated robots so don't do those our toy problems and you can say okay well maybe reinforcement learning can solve on into a problems and then there's been additional skepticism about reinforcement learning there's this paper by Henderson at AU which I liked it showed some issues with reinforcement learning for example here you see two curves which are the random average over five runs but it's the same algorithm in the same hyper parameter just different random seeds so okay from this you can conclude clearly this stuff's hopeless and forget about it but our results show that it is not the case if you scale things up then suddenly you can solve very hard problems this is not to say that additional innovation reinforcement learning is not important for example it would be desirable to be able to achieve these difficult goals even much less experience that you use however the scientific conclusion from a work is this if there is a problem which is sufficiently variable to solve and it's a reinforcement learning problem it can be solved so I want to talk a little bit about reinforcement learning just explain it to you because just like the rest of machine learning reinforcement learning is also very simple here is the core idea of reinforcement learning it's just this slide do something and not a little bit of noise to your actions if you did better than you're expected then make sure that you you do those same actions more often in the future that's it this is the core idea of reinforcement learning it's such a simple idea it's kind of crazy that it works I'm still amazed now I want to discuss the core improvement on this idea it enabled that made it possible to solve something as hard as the daughter game and this is the idea of the active critic with something like the game of dota you have 20,000 actions per game so that means you're going to add noise to 20,000 actions and then see if that leaves a little bit better than normal than that then would you expect or not that's going to work too but can we do a little bit better than that and the key idea of act of the after critic method is that you will a function that tells you how good the state is the value function it tells you how good things are and the idea of an active object the critic method is you had a little bit of noise to your actions and then you check and then instead of running the game all the way to the end you consult the value from your value function to see if things have improved or not so you're able to reduce the the noise and it turned out to be very important and it works so this bootstrapping that your value function you say instead of running the game to the end I'm just going to add a little bit of noise and see and then ask again look at the value function and see if things improved or not so it's a bit technical it's not really important for understanding for the rest of the talk but I thought you'd find interesting next there the policy it's just an lsdm just an LS can be first had a thousand neurons then we increased it to four thousand euros right now but the lsdm which played against NT I had only 1000 euros it's pretty cool wait I am 75 per century at a thousand years it is small chance that at two thousand yahrens now we have four thousand yards and I want to show you the diagram which shows the architecture and basically you got all this complexity and then it's all fed into the lsdm so that the lsdm and then it's extracted out the reason we do this is simply because your input your observations are twenty thousand dimensional and you need to like cleverly use them bed into kinda to feed them in a way that you can that the other stream can consume and this you know this this is important this if you could figuring this stuff out is important but fundamentally this is you just want to do something sensible so you can consume your observations and you can produce actions do the right format I also wanna talk a little bit about self play which is interesting most of the games are against the current version of the bot and then twenty percent I think twenty percent of the games are against previous versions of the bot now I want to share some more cool facts so the biggest experiments use more than more than 100,000 CPU cores more than 1,000 GPU cores Walters the the dam horizon of over L has been 9:97 and I think we've doubled it since then so if you're talking about ten minutes of time of a half-life of time horizon so it's a pretty good horizon I want to share some other cool cool facts about what it's like to work with reinforcement learning the thing about reinforcement learning is that you just can't tell if you about this or not it's impossible because you look at your performance and your performance can keep on increasing and you may even have a system which achieves state of the art or maybe that even like does really well much better than you expected it and you can still have bugs in your code and you just need to keep recuperating the same lines of code again and again and again and as you fix the bugs your performance goes up another cool thing that we've discovered with our once the scale thing once we run larger experiments is that this issue is completely going away when we're on our experiment many times the curves track each other almost perfectly all this bad behaviors disappeared so the high level conclusion from all this is that if you do things right you fix all the bugs and you scale up reinforcement learning you can solve very hard problems kind of like is already the case with supervised learning so that is not that is a pretty good state of affairs one other interesting thing that we've done was introduced the introduction of the team spirit parameter C the game in the game you have five players versus five players so in order to accelerate learning we made it so that each age is at each each of our you know each each player on our team would be selfish and only maximize its own reward and later on as the game progressed we increase the team spirit parameter so that everyone's so that everyone received the rewards of everyone else and you can see how if you are given short term rewards which are dedicated to use and you learn faster and doing that indeed accelerates our learning quite a bit I also want to show a little bit to talk a little bit about the rate of our progress so this is a graph on the x-axis you see this disbands may be formed from May to August so that's four months that's a four months time period and the y-axis is the estimated MMR and MMR is kind of like an ela rating but not exactly and so it maybe beat the best team of players does that happen to work at opening I and then and then in June we beat a team of casters then gradually we reduce the restrictions yes so here it was still the mirror match here we introduce more heroes here we had a drafting and here oh yeah here's another fun fact so the game has menu it's a complicated game and it has many rules in order to make our work easier we've added restrictions to the game so that we'll be able to make easier progress before we fix all the bugs and be gradually were removing other restrictions one of the big restrictions that we had and right up until the match was the single courier versus multiple careers so there is the scene in the game called the career and what it does is that it brings items to your heroes in our previous before the before the public map before the large last public match we've had five careers five invulnerable couriers which sent items to our heroes and as a result it allowed it allowed the bots to use a more aggressive strategy and people who watch the game they felt that it wasn't quite the real thing so for Ti for a public match in late August we switched to a single courier now here's a funder here here's an instant fact we've only had five days of training with a single courier before the scene before the the pub the the debate is the biggest public matches and despite that it did very sensible things but probably with a few more weeks of training to get a larger model it will do a lot better still so our remaining tasks is to decisively beat the best teams many times but the real conclusion here is that actually if you want to solve a hard problem is reinforcement learning you just scan it's just gonna work just like supervised learning it's the same as the same the same story exactly it was kind of hard to believe that super is learning can do all those things it's not just vision it's everything and the same thing since the same thing seems to hold stupid reinforcement learning provided you have a lot of experience you need a lot of experience that's an issue it needs to be fixed but as a situation right now okay so this concludes the first sub part of the talk now I want to switch to another result from opening edit I'm really proud of and that's our Robotics result so one of the issues of training agents in simulation with a huge amounts of experience is that you can say well but that can never possibly do useful things outside of that simulation well here we addressed it a little bit the goal of this project was to get this robot hand to reorient this block and the way we did it is by training it in simulation in a clever way such that it will transfer to the real world now it's important to emphasize that our simulation is imperfect we don't model friction very well we don't model a lot of things there are many things about the physical hand which we don't know how to measure I will describe to you once the point of this part of the talk is to tell you about a very simple idea that seems to work one other nice thing about our approach is that we were able to apply to multiple objects durables we also able to rotate this token all prism and not just a block the core idea that made it work is called the main randomization it's not a new idea people working on this idea for a long time what we've shown is that this idea works really really well the idea of the main randomization is this if there is something in your simulation which you can't measure you randomized it and you require your policy to be able to solve it for any value of the randomization so what do I mean by that let's say that we don't know what the friction should be because we just don't have a good way of measuring that what we will do is that we will say that our policy needs to solve the problem regardless of the value of the randomization I'm going to put it in a simulated world and the policy doesn't know what the friction is it needs to interact with the world to quickly figure it out and deal with it so that's the main randomization it's that simple we also did it for perception as well so here we have you know these are examples of the images that the camera the Cemal is a synthetic images that the camera has seen which takes you know you see there's a robot hand he's different colors and different backgrounds and lightning Lighting's and all that and if you can deal with all that then you can probably do with the real world that's it that's the idea of the method main randomization it's not a new idea the thing that's interesting is that it worked and especially it worked in with the physics and we randomized some tens of numbers of variables and I want to tell you show you a nice nice nice graphics of how it'll look like oh yeah we did there was something really cool that we did and that is you were able look I want to tell you about the way we trained the perception module so we designed the system in such a way that it's we have a controller which trains which takes as inputs the coordinates so it doesn't get to see the image now there is an advantage to training your simulated policy without vision is that you don't need to render images so you can get a lot more experience and training much better so how do you include vision so we train the separate neural network which takes images and produces a prediction and then we require that that policy which was trained with with the true state will be the correct state also sometimes use the prediction by the convolutional but by by the perception module so instead of using the true state it would sometimes use that and just and it was able to learn to adapt to this kind of inputs very easily so the point is you were able to factorize the Train of the of the control and perception and that allowed us to save a lot of compute and then when it's terminates down you just give it the real images and you give it to the real the real sm state estimation of the fingertip locations you feed it to the lsdm you get the actions and the whole thing works and you know fixing all the bugs here was challenging as well you know things like latency mattered a lot the speed of the computer or the trance the lsdm policy be we've observed we were surprised to observe a speed-up when we changed the computer we should run the policy to a slightly faster computer so then they the neural net run faster but don't see was reused but the idea is simple domain randomization if your simulation is different from the real world you just randomize the thing you don't know and you require that your policy deals with all these values and this idea goes surprisingly far it's not a new idea it just turned out turns out that it's a good idea so the way we've trained both the dota bots and the the control of the the controller which manipulated the block was done using rapid our reinforcement learning infrastructure and it actually there is a lot of lots of shared code between the dota bot and the the robots training and and the code which trained the manipulation policy in that you know they're obviously some differences as well but it turns out that you know because it's so hard to right good scaler good scalable reinforcement learning code it's worth reusing it so that was nice oh yeah I got another cool picture which shows you the three different cameras that look at the location of the block so you got these three cameras they look at the block and that's how they estimate its location got few more images of the vision architecture which just takes the three cameras runs them through a neural nets and I'll put the positions and the control policy which is basically an STM it's pretty amazing how simple all these architectures are if you want to use vision just do a cognate it's always going to work so this is the so this concludes the part about our dexterous manipulation results now I want to switch to talking about our language understanding result we done supervised learning and I want to tell you the fundamental thing about this result which is all you do is you train a very good language model and then you find you need two language tasks two language understanding tasks and you get a big improvement a very big improvement over state-of-the-art in many cases that's it it's the original idea of pre training and find doing actually working the trick was to have a sufficiently good language model that's quite nice I want to show you give you a sense of the improvements so you can see that on many so these are a bunch of different tasks the left column shows the before and the right column shows the after and the number on the right is almost always larger and sometimes by a large margin and you you may not be able to see it all but these so let me let me show so these three rows show you the the three tasks of the improvement from our model was the largest and these are tasks that require multi sentence reasoning and understanding I'm going to go over this example just give you an idea of what is required so the example says Karen was assigned a roommate her first year of college her roommate asked her to go to nearby city for a concert Karen agreed happily the show was absolutely exhilarating and then one Karen became good friends with her roommate Karen hated her roommate which is more likely it's that kind of thing and just training a very good language model findings on this task big improvement of a state of the art and there's every reason to believe that if we train even bigger and better language models the gap will increase even further now I'll tell you a little bit about the details the model for the transformer I won't elaborate on the details of that but I will say that I think it's one of the most important innovations in neural nets architectures in we in the past few years the data set is a corpus so is a large corpus of books the size of the context is 512 so in other words the language model gets to look at the previous 500 words which is a nice context and it's been trained on 8p1 hundreds for one month and I want to tell you to show you a little bit about how the transformer was used so you've got this transformer which takes this so this this is a diagram with a transformer there are some details but you can ignore it it's like details so this part is to transform details but if you're curious I recommend you look up the paper attention is all you need and then here we describe how we simply represent the different problems and freedoms of transformer we do a bunch of sensible things for example if you have a multiple choice question you feed each you feed the context and the possible answer to that we feed the concatenations you get your three representations and any patterns rule in your model and that's it so really simple stuff it's just that if you have a really good language model you can solve language understanding tasks and if your language model is better your language and saying you'll be better as well so that's nice it looks like unsupervised learning is starting to show signs of life it's an encouraging result next I want to switch to the following to the to the last part of the presentation which is look at the trend that we have right now and try to understand if the current AI boom can reach all the way to AGI or not and what's the probability of that and the goal of this part of the talk is really to make the case that it's hard to lower bound the place will be in let's say five to ten years it's very hard to lower bound that we may get that the the the probability of an Ag ike of getting to AG I can no longer be discounted and here I want to talk about big technological revolutions that's already happen in the past so the book there is a book called profiles of the future by arthur c clarke which is a really good book because it analyzes many of these technological revolutions and it has lots of cool factor it's there a lot of the things that it concludes is that with every big technological revolution such as the airplane spaceflight and nuclear power you had very vocal and very eminent detractors people that felt that it's definitely impossible and for example with the airplane people various people said that it cannot be done and then it was when it was done at the same people said that well sure you can do it for one person but it will never be economically viable with spaceflight an interesting thing that happened there is a mistake which arthur c clarke calls failure of nerve where the u.s. analyzed the question of sending objects to space and concluded that it's impossible because you need to build the 202 on rocket so the Russians went on and built the 200 on rocket and in fact the astronomer astronomer rial of the UK said that space travel is utter bilge one year before the Sputnik went into space so so that's that's pretty interesting next I want to go and talk about the history of AI when we looked at the history of AI we discovered that our own understanding of the history of AI was not accurate so what is the old understanding of the history of AI is that is that the field went through a sequence of excitements and pessimism about different technologies so it was excited about perceptrons sore symbolic systems and perceptions the next persistence and that propagation and support vector machines now be excited about neural networks again and then in the future will be excited about something else again but the reality is a little different in the following way so when Rosenblatt presented the perceptron he was really excited about it and he made the following statements he said so that was in 1959 and it's very interesting what these statements are so specifically he said that it's an embryo of an electronic computer that will be able to walk talk see write reproduce itself and be conscious of its existence later perceptrons will be able to recognize people and call out their names and instantly translate speech in one language to speech and writing in another language it was predicted that was 1959 so Rosenblatt became really popular with the Pope with the popular press and he got all the funding so then Minsky and Papert got really upset so popper admits that they wanted they they wanted to stop they felt that this direction was on promising and they wanted to stop progress in this field they admitted there was hostility in their book when they wrote the perceptron and they felt that the claims that Rosenblatt was making were misleading and they were taking the funding away and Minsky directly admits that he was concerned that other areas of a I were not getting funding and they wanted to make is that in their book the progress in neural networks is impossible and then in the 80s computers became cheaper and the cheaper computer has increased interest in artificial intelligence in its renewal networks and then in this context the backpropagation algorithm was invented and there is a funny quote from Minsky and Papert about the back propagation algorithm we have the impression that many people in the communications community do not understand that back propagation is merely a particular way to compute the gradient and have assumed the back propagation is a new learning scheme that somehow gets around the basic limitations of hill climbing another very interesting thing another very interesting thing that happened is that so let's see so what where does it lead us so then the alternative interpretation is that neural Nets research and the wave of neural nets that we see right now is not a 5-year wave it's a 60 year wave start with a perceptron and as computers getting better the result became more impressive in the early 90s they already had TD gammon which was a self play reinforcement learning system which was able to be the best humans in bekommen and one interesting fact about TD gammon by the way is that the total compute that was required to produce TD gammon is equivalent to five seconds over Volta so now that we have the alternative the other interpretation of the history of AI in other words that neural nets have have been the one persistent thread in the end of in the history of the field which has been growing and getting better as computers been increasing now I want to survey a sequence of results over the past five years and to see how they changed our beliefs as to what's possible and what's not possible so with the original Aleks net results before that result it wasn't really believed that neural Nets can do anything and obviously the kind of division and it would be totally crazy that neural nets could do part could solve hard problems and by the way one cool thing is the image here which is which I got from Antonio de Alba which shows the performance of vision systems before neural networks so do you see this little red rectangle so it thinks that it's a car because here it is zoomed in and here's how it looks like when you apply once you apply the hog the whole vision transformer the whole feature transformer so it didn't work and it wasn't going to work and then it turns out that a large convolutional neural network these supervised learning can do pretty well envision then with dqn okay fine so maybe you can do vision but it tend turned out that you can take neural nets and turn them into agents that learn to achieve goals and what that did is that it gave everyone the researchers the idea that it makes sense and it's a sensible it's a sensible question - it's a sensible research direction to give neural networks the goal you know put to use neural networks to build agents of the chief goals now after vision came noodle machine translation and the belief was sure you can do perception but you can't do things like translation I mean come on that requires tens of thousands of lines of complicated code and various state machine algorithms and graph algorithms but turns out that now if you just use a big neural net correctly you can just do it then alphago arrived and before alphago the kind of belief you had about require a reinforcement learning is that it's actually not good for anything it only solves tiny toy problems but now with alphago turns out that reinforcement learning in the form of Monte Carlo tree search can solve a very difficult task a truly difficult task after the open f5 well fine sure you can solve something like a computer go because you have a small action space the game is discrete it's nothing at the real world surely you can't solve a game like like like a dota Starcraft which is continuous and messy and more similar to the real world but it turns out that if you just scale up reinforcement learning you can do it no problem ok so fine so maybe we can do things in simulation but you definitely can't think think take things outside of the simulation because you need so much experience inside the simulation how can you possibly use these algorithms outside but it turns out that if you change your simulation a little bit you can in fact transfer skills from inside the simulation to outside as we've shown in our work on the active robot ok so then you can say well fine so maybe you can achieve goals whenever you have a cost function that clearly describes what you want to do so in supervised learning you want to minimize your trainer and your enforcement alone you wanna maximize the reward but you can't possibly do on supervised learning that would be too much but it turns out that you can done supervised learning as well if you simply train a very large neural network to predict the next bit of your signal and so far we've shown it for language needs to be shown for us domains as well finally I want to talk about the underlying trend that was powering it all and that's the computer trend so it is pretty remarkable that the amount of compute from the original Aleks net result to alpha glows ero is 300,000 X and you're talking about five year gap those are big increases it's three and a half months doubling time and I want to show you a visualization of the scale so this shows all the different results and we have basically zooming out with this scale so you see let's see yeah it took a while we've included some of the early results from the 80s so that's we took about to get to the point where you know the drop of meth and Alex Nancy are even shown he is consider end but it keeps going then you have the sick - sick computers becoming small the vdg computers becoming small but it keeps going so that gives you a sense of the increase in compute that occurred over the past five years and finally we get to a point where even alphago zero starts be invisible yeah now a lot of it is being powered by data central computing in other words there are limits to the amount of compute you can put on a single chip but let me just get main chips together and that's going to be more important moving forward and I think that one thing that you'll probably happen is that just like with the very large Rockets the rocketed the Russians built in order to go to space it will be important of very large clusters in order to get to the truly large amounts of compute but it's probably going to happen so to conclude the point of this part of the talk was to show that while highly uncertain it's not it's not possible to determine a lower bound to progress in the near term and maybe the current wave of progress you actually this to AGI and what it means is that it's worth proactively thinking about the risks about addressing questions like machines pursuing miss specified goals machines subverted by the deployed system subverted by humans and just general very rapid change out of control economy these are good questions to think about and that's all I have to say thank you very much thank you we have got time for some questions and answers now there are microphones at both sides of the room so the people on YouTube and remote sites can hear please go to the microphone if you have a question yeah I mean the precise statement is that supervised learning can solve any problem the human can solve in a fairly small number of seconds hi I'd like to ask a question about your thoughts on safe reinforcement learning and both directions of safe reinforcement learning dealing with huge imbalances and datasets when you have high importance examples what directions do you think are interesting for person so you ask about say for enforcement learning and Himba and data imbalance so I mean I think when it comes to so let me answer the easy the easier question first beta imbalance there are lots of standard tools this lots of standard lots of approaches you could do which are pretty standard you could train a small model that will try to recognize the important examples issue and then it would fit to the large model these are things like this and this already been done travel safe reinforcement learning I think that they're like the kind of the kind of ones that we do is for example learning learning learning reward functions and preferences from human feedback that is one example of an area which we've pursued and other good areas include basically safe exploration would be another one we try to limit to the change to the environment as you explore there's be another example over there very nice talk thank you so I'm curious you mentioned some of the criticisms of deep learning over the years and I'm wondering so sample complexity I guess is one big issue I'm curious you know critic today might say it's horrendously sample and efficient what are some things that you see and is that even an issue or what are some things that you think might be just that thank you so I think some so if there is no sample complexity is an important issue which has to be addressed there's no question about it and the you know some of the more promising ideas right now look like transfer and training your system on other tasks so for example with the language results which I presented you train a big neural net on to predict the next word in a very large text corpus and by doing that you greatly reduce the sample complexity of this model for other language tasks so this is an example of how you might go about doing that yeah an argument that the Creator could make is that the problems where you've shown the best results so far are problems where there is high signal in terms you have any thoughts on other areas where you have worse signal-to-noise you can have an example medicine medicine yeah so what would go in in order to move to environments like this several things going to happen we need to be good at we need to get really good at unsupervised learning and we will need to get really good at inventing or discovering reward functions for ourselves you could then optimize so in other words once the agent can choose a sensible looking reward function for itself and then optimize that if you both gain skill and again gain new data for it's unsupervised understanding thanks for the talk one thing you mentioned was that in vision people seem to have really converged on deep confidence as kind of a one architecture that can solve basically all the problems that you run into we haven't really seen that with sequence models you guys use LS teams in some places transformers in other places there also the sequence convolution models do you think there's gonna be a similar convergence for sequence models or are we going to continue to have kind of a zoo of different things and what's going to work best is going to depend on the application I mean it's hard to predict that I think it's definitely possible that there may be I mean it feels so I think it's very possible that there will be several alternative architectures for four sequences I mean to be fair even for images you have new candidate architectures like the image transformer which may potentially become a more dominant architecture than a convolutional than the conventional convolution so I think in some sense yeah I mean there is a chance you'll have Malta some would say two or three alternatives but on the other hand it's only three alternatives it's not so many thanks in the case of the deep deep q-learning I remember there were there was a result from several years ago that that that they couldn't solve the roulette problem right because if it has no understanding that the roulette wheel has to be balanced just from the samples you're always going to think that some pieces is just like lucky for a period of time so I'm just curious just in general do you think that there's that's not really like an issue anymore in terms of with enough samples you can learn the rules of the universe or do you have to still code some of those in for for cases where the rewards are like really almost designed to be you know to have high enough of variants words words difficult to learn it you know just by averaging the outcomes yeah so I can I can talk about the you know your broader question I haven't her I didn't three understand what he meant by the roulette problem okay I can explain it very quickly so so this was the example for the double Q Learning Network and the reporting wasn't actually proved in that in that paper that with regular yeah Q learning you know you're basically it's the the outliers are such that if you don't know the property is the roulette wheel that every single pointers has to do the same and random if you just treat them if you trick if you treat the roles with the roulette wheels independent variables then you know no matter for how long you run you're just always going it's just never gonna come for the answer that all the numbers are negative there's always a long enough for a role sorry I'm explaining this poorly but you know so I mean so it sounds like and the question is but then it could but then the question is something broader profession was and therefore sparks rewards I think white ass in a question yeah well I'm saying you solve that problem very easily by just sort of specifying that all the points have to have you know have to sort of have the same underlying probability but without coding it in if you just look at them independently just from you know you can have an infinite number of samples you never really learned that all do all the numbers are negative yeah so you definitely want to in the long term in the long run you want to be in a place where you don't hard-code because is the set of problems you want to solve is so vast that I don't see how humans would be able to hard-code useful things you know we have been able to convert called something useful like the convolution or the recurrent net that's pretty useful it's also very general if you do want a hard code very general assumptions you want your model to models to make use of all available information the way the way you will probably deal with very you know situations where you don't know what's going on is by benefiting from so other sources of information that's what people do when we approach a new problem you don't stuff from scratch we have all a life experience and the things are confusing we try to go to Google and talk to someone else and that will be at a high level the baby will do movies totally totally new domains but I think that it is definitely desirable not to hard-code things because it makes life easier and it just seems unlikely to me that we will be smart enough to be able to hard-code things for the truly difficult problems so I'm also probably speaking bearish on this approach yeah I completely agree that's just a funny example with games you know these things are independent but it's hard for the algorithm to learn that in real life you actually don't know yeah so I haven't I I need to look in this example in detail to form a definite opinion hello thank you for the talk what's the next hardest game in your opinion is there any reinforcement learning can learn I mean there's definitely things that reinforcement learning can learn I mean one of the downsides of the way we've learned DOTA is it been aided millennia of experience so while we can learn so while while you can learn very hard problems if you are willing to gather sufficient experience how do you do that with less experience I think that is a big that is a better description of the challenges that are coming next in terms of solving hard games I mean I don't think that if you don't restrict the amount of experience you get I don't think that there are games that really were so so others have been used in NLP but not with that much success like in abstract of summarization and things like that so what is the general view about that and what according to you is a good task in an LP where adults can be used sorry what is what has not been used in our own language and saying that reinforcement largely yeah yeah what according to you is like why is that the case and what would be a good task in NLP for yeah I mean I think that makes sense we then of either our RL requires that you can figure out the reward function and if you have an environment and I don't pee you don't have books so I thinks things like there let's say like assistance like like dialogue systems they can benefit from RL or for example you know have you seen that google duplex mm-hmm like that's the kind of thing where you can say hey like yeah of ten thousand people talking to your system and you know they if the system makes the mistake or doesn't do what they it's required to press the button to give you the negative reward that would be an example okay so you're positive about using either I mean for sure for sure I just think it will be it will look a little different from the current applications and in particular I don't think that so NOP is mostly data set driven so where is there are L so you need to move away from the data set in it of an environment so you either have agents talking to each other but then they want to talk real language or you have agents talking to humans but then it's just logistically difficult and only they run that many research labs that could do it okay thank you okay if there are no more questions let's thank you again for his talk and I'd like to give him a little something to thank him for taking his valuable time and sharing his thoughts with oh thank you very much you're welcome
One interesting note: Ilya mentions one of methodology papers showing great instability between runs; but apparently if you use enough parallelism, it goes away (11:50) and the training curve becomes very stable. Does this mean that the instability is simply due to noisy gradients/exploration when updates are done based on just a few rollouts/small minibatches?
And for a historical comparison, he notes that TD-Gammon's total computation requirements is equivalent to 5 seconds on a contemporary Volta GPU. Which leads into his snark about the endlessly moving goalposts of DL/DRL critics: https://youtu.be/w3ues-NayAs?t=36m10s (if you can do it, it isn't AI anymore...)
At the end [49:43m] someone asks, [now that we have superhuman performance on all Atari games,] "what's the next hard game to solve?".
"I don't restrict the amount of experience you get, I don't think there are games that are worth solving".
He thinks that given enough time and compute we can solve most games. The next benchmark should be sample efficiency - beating a game given only a certain amount of time to play. IMO it would be great to see RL catch up to humans, even given pretraining in place or prior knowledge in humans.