Deepmind AlphaZero - Mastering Games Without Human Knowledge

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

2017 NIPS Keynote by DeepMind's David Silver. Dr. David Silver leads the reinforcement learning research group at DeepMind and is lead researcher on AlphaGo. He graduated from Cambridge University in 1997 with the Addison-Wesley award.

Recorded: December 6th, 2017

👍︎︎ 1 👤︎︎ u/Smoke-away 📅︎︎ Jan 30 2018 🗫︎ replies

Captions

with that let's welcome David Thank You satinder it's really a great pleasure to be here so today I'm going to talk about alpha zero which is deep mines latest deep reinforcement learning architecture it's actually achieved superhuman level in many games now without human knowledge beyond the game rules and this is joint work with many great collaborators a deep mind that I listed at the bottom there so gonna start with the game of Go which is the oldest and most deeply studied game in human history it's been it's around 3,000 years old and it's played currently by around 40 million active players around the world so in some sense you can think that it represents the pinnacle of human knowledge in any game it's been studied by professional players who've developed this enormous pyramid of knowledge over many many years and so in some sense it makes a perfect testbed to really understand how far we've come with our machine learning and artificial intelligence based approaches relative to that the best that of human knowledge it's also been viewed as a grand challenge to artificial intelligence and this is largely due to the enormous size of the state space so this oft-cited figure of around ten to the hundred and seventy states in the game of go make it very challenging for traditional search based methods so I'm just gonna start by visualizing what that search space looks like so in the game of Go there's a branching factor of around say 200 and at each one of these positions that you're in you have 200 possible actions that could be taken and leading to new positions where again there's 200 more decisions that could be made and so forth and and this actually lasts for several hundred moves and so we actually have something like 200 to the power of 200 possible sequences in the game tree and so this really renders the game of go completely intractable to traditional search methods and as meant that something new was required to make progress so I'm going to start by talking about the original version of alphago which became the first computer program to defeat a human professional player and then the first program to defeat a human world champion so at the heart of alphago are two convolutional all networks that represent knowledge about the game of Go so the first of these is the policy network which is illustrated here so the policy network which we denoted by this green P in this talk represents alphago's recommendations of moves to play so in each position it's fed in a representation of the position as different planes representing the board such as where the black stones are where the white stones are where the empty intersections are and this is fed through many convolutional layers to process this position into increasingly abstract features representing knowledge about the game would go and finally this is brought together into a policy distribution over moves that actually recommends which moves to play so in this situation for example we can see that that tall bar is suggesting that in this particular position one particular move is considered to be a very good move and is suggested and recommended with high probability whereas many of the other moves such as playing in the corner would have near zero probability so this is how alphago recommends which move to play the second neural network in alphago is called the value network and this represents alphago's positional evaluation knowledge and essentially what it does is it predicts the winner of the game so again it takes a representation of the position of the game of go as an input it processes it through a convolutional network but this time it kind of takes that and it aggregates all of that knowledge together in a final layer that combines this together into a single scalar value between minus 1 and plus 1 where +1 means a prediction there of certain win for alphago and minus 1 means a prediction of certain loss so these neural networks are a trained in alphago and I'm going to represent a slightly simplified representation here of the training pipeline in alphago which is based on a pipeline of classical machine learning approaches and at the beginning of this pipeline we start with a human data set a data set of positions and for each position is we have a label that says what the move was that was chosen by a strong human player and what we do is we train our policy network first of all by supervised learning to classify that human move to say in this position we would I can network to predict the same move that that human played in that particular position and once we have this policy network this policy network P were able to play games against itself using the policy network for now without any steps at all absurd plays games against itself the policy network picking move for both players for both black and white all the way through the game until the end of the game at which point we we have a winner of that game and what we do is we then train our value network this value network which we indicate by V here this pink V we train the value network by reinforcement learning to predict the winner of the game from each one of those positions so from every position we would ask our value network to correctly predict predicts which of those two players black or white will win when the policy network plays against itself so that was the original alphago arm training pipeline and the main idea of alphago is to really use the policy network and value network to make search tractable um so again we see here our kind of cartoon representation of this enormous search basin that we have in the game of Go we're here we're just looking at a cartoon where we have just two different moves that are possible from each of these small states but in reality there will be several hundred moves and this this tree would be would be vast and so what can we do how can we make this search tree more tractable using up using our neural networks well the first idea is to use the policy network to reduce the the breadth of the search tree so recall that the policy network suggests moves to make in each position and so can actually reduce the breadth of this search by only considering moves that are recommended by the policy Network so instead of searching 200 or so moves from each state we might consider just a handful and just going to expand that handful of moves in the search tree to dramatically narrow the set of arm possible sequences that we need to consider the second idea is to use the value network to actually reduce the depth of the search so the value network predicts the winner of the game from any position and so that means that we can actually use the value network to replace any subtree of our search tree with a single number so that means that we can actually instead of searching all the way to the end of the game we can truncate our sequence at a leaf node and we can replace the subtree that we would have had to search systematically all the way to the end of the game we can replace that whole search tree with a single evaluation of our of our neural network and this dramatically reduces the the size of the search space that we need to consider so in practice we actually use a more sophisticated search algorithm than that cartoon suggested which is based on Monte Carlo tree search and so I'm just going to briefly illustrate how this works if you would like more details you can you can read the papers and so forth but the algorithm proceeds in in three stages so in the first stage the algorithm traverses the tree from root to leaf picking each node it picks a child based on a kind of upper confidence rule to decide which action it should select so this rule actually selects actions that have been evaluated most highly in pass simulations that this Q value from each node and it also prefers to choose actions that the policy network likes that's the Cu term the U term kind of is a bonus term that includes what the policy network thinks of the move in the second stage once we've actually traversed this this simulation down to a leaf node we expand the leaf node and evaluate this new node with both the policy and the value networks and finally in the third stage we back up the evaluation so whatever this evaluation was according to the value Network we back that up through the search tree so that every node we maintain the mean evaluation from each of those edges so each Q value there is basically storing the mean evaluation that it's seen from that point onwards in the search at this process this process Monte College research can very effectively expand a very large search tree by just considering that the most important parts of that search tree get expanded and it systematically kind of gets deeper and deeper and deeper but I'm in a very selective fashion so only considers the important parts of the search tree so we applied alphago in a in a match against the winner of 18 world titles Lisa doll so he's widely considered to be their greatest player of recent decades and this match was played between alphago and and Lisa doll in 2016 and alphago actually won the match by four games to one but that one game that we lost actually highlighted some of the deficiencies of alphago so in particular we found that in some kinds of position alphago can systematically miss evaluate misunder down those positions we used to call these delusions it would enter into a type of position where it failed to understand what was going on there and that kind of misunderstanding could persist for many different positions throughout the game so it could persist for 10 or 20 or 30 moves and we really wanted to understand this and this led to us actually progressing further to try and understand and progress alphago towards something which could play much more strongly without exhibiting these kinds of weaknesses and this led to a subsequent version of alphago which we called alphago master so for alphago master we trained new policy networks and new value networks using much deeper state-of-the-art residual networks and using them many more iterations of reinforcement learning with more precise kinds of evaluation and so alphago master we played earlier this year against the worlds ranked number one player co-chair and we won three games to zero but perhaps more encouragingly than that still we played online in a series of 60 matches 60 games against that the top-ranked human players in the world and we won by 60 games to zero so I think what this shows is that principled reinforcement learning combined with principled deep learning can address these kinds of systemic delusion that we we saw that just by by sticking to our principles and doing things just further and better we can actually start to address things which which people believed at the time would require are new kinds of research but really all the principles are already there we just needed to do it do it very well so I'm now going to move on to a more recent artwork which we call alphago zero so the goal of alphago zero is to take every form of human knowledge that went into the previous version alphago and remove it from the training process except for the rules themselves so the idea of alphago zero is really to learn to play the game of go tabula rasa starting from first principles and really understanding and discovering knowledge about the game completely for itself without any human actually having to tell it the kind of things to do or any human data to guide it or anything else like that and so the differences between alphago zero in the previous alphago are essentially listed on this slide the first and most important is that it news no human data whatsoever so it learns solely by self play by playing a games against itself and training by reinforcement learning starting from completely random games so in other words we initialize on your own network to random weights and it plays games against itself starting with those random weights so it knows nothing about the game and everything else is discovered from there onwards the second difference is that alphago 0 uses absolutely no handcrafted features there's nothing in there beyond the raw board so all it sees when it takes us out with the only thing that the neural network sees and received as an input is actually a representation of the raw board so it justice has a plain saying you know whether there's black stones in this position and plain saying whether there's white stones in this position and that's it that's all it sees but there are some additional differences one of the most notable ones is that we actually unified the policy network and value network into a single neural network again based on a state-of-the-art residual network and this actually helped to regularize the network and make it much more robust against the kind of overfitting which is possible to see if you just have a value network trained by itself for example in addition alphago zero actually uses a simpler search so in previous Monte Carlo tree search --is and to an extent the previous version of alphago there was actually these randomized Monte Carlo rollouts in other words when you reach the leaf node it used to do something where would run a random game all the way to the end and then evaluate the that node based on how well that random rollout performed and so you'd be in this situation where if random games tended to win more often you would start to believe that that that was a very effective position so actually in alphago zero we removed those rollouts altogether we really wanted to have the simplest possible approach with the minimal level of knowledge so we removed those and we only use the neural network to evaluate positions and maybe one of the principles which were after an alphago zero is this idea that by by doing less by actually removing complexity from the algorithm enables us to actually become more general we want to have something which can not only be applied to the game of Go which after all is just one domain we want to get at the heart of what it means to really we want to have something which can really learn from first principles potentially in any domain of course the game would grow as a first step but it's a challenging first step and if we can understand how to learn from first principles to discover knowledge and and reach the highest levels of human capabilities there maybe this is something which has more capability to transfer to other domains so this is the goal less complexity more generality so I'm now going to describe the algorithm which we actually used in alphago zero so the main idea is that alphago actually becomes its own teacher so in a sense what we want is we want to have the highest quality data to train our algorithm on we want to train on your networks from really really high quality data really high-quality move selections and really high quality outcomes and the best possible data that we can get our hands on actually comes from playing alphago against itself so the idea is really to do the following that from each position we execute a Monte Carlo tree search using our current neural network so this policy and value combined Network P and V there that kind of guides this this tree search to suggest ultimately at the end of the tree search a move is suggested we play that move we run another search play another move and so forth all the way until we've completed a game when that game is completed its scored and we determine the winner of the game the next stages we train a new neural network so the policy part of the network is updated to predict the move that was actually played by alphago itself in each position so this new policy network P Prime from each of these positions that were actually reached in this game we want to to basically get the neural network to predict the move that was actually played we want the raw neural network to directly predict the action that was chosen by an entire MCTS so in other words we want to kind of compress everything which was achieved using look ahead into the new neural network so that can distill all of that knowledge into its direct behavior at the same time we also train a new a new value Network so this is the value part of our neural network and that at the same time gets trained to predict the winner of this game so at the end of this again we're looking at this game that was played by self play and from each the positions that was reached during this game we would like to predict the winner of that game so and we simply train our network to to predict both the winner and also the move that was played so we have this joint training rule with these just two terms to it and a simple regularization and that's it that's the whole of alphago zero and finally once we have this we iterate the procedure so we now have a new policy and value network and this could be used in the next iteration of of alphago zero and the key idea is that each time we iterate this process we end up with an even stronger player and because we have a stronger player it generates even higher quality data so each time we improve the quality of the search it means that we actually end up with outcomes which are more indicative of optimal play and then we can train our new generation of neural networks based on those higher quality searches this higher quality move selection and the higher quality outcomes and this process can be repeated it almost indefinitely and intends to lead to just stronger and stronger play so if we relate this to what's been done before in reinforcement learning this relates to a very well-known idea in reinforcement learning which is the classic policy iteration algorithm the twist here is that we actually incorporate alphago search inside the policy iteration so we call this search based policy iteration and the idea is that you know in a policy iteration we interleave two phases there's policy improvement where the action selection is somehow made stronger and then there's a policy evaluation where the quality of these selected actions is evaluated and the whole idea of policy iteration is that if you alternate between these two stages you end up getting a stronger and stronger policy which ends up eventually ideally approximating the optimal policy and so what we do in in alphago zero is we actually use search based policy improvement so the idea is that instead of say using a greedy policy improvement as it's traditionally done in RL we use the whole of our search as the policy improvement so in other words we start off with our raw neural network which picks an action and we massively improve that policy in one shot by running a search and the action which is picked by the search is typically much much stronger than the action which was selected by the raw neural network because it's incorporated all of this look ahead in a we incorporate search into the policy evaluation so we used search based policy evaluation and what this means is that we play ourself play games with alphago actually running it search at every step and so what this does is it means that we're actually evaluating the improved policy we're not evaluating the raw neural network we're evaluating the improved policy and this gives us very very high quality outcomes which enables us to really in every step have very high quality data that leads to very precise training signals for our neural networks so this all relates to prior work command and also recent work that's been done elsewhere in NRL starting with McHale lega lega darkus in 2003 who had a paper on classification based reinforcement learning but in all of this some other work what's happened is there's been a focus on the search based policy improvement in some sense but there hasn't been previously this idea of search based policy evaluation and that turned out to be absolutely critical in alphago zero that this was the step if you had to pick out the one step which led to you know thousands of you know improvement in performance is that second one but the overall algorithm is very effective and seems to be rather robust and reliable as we'll see so if we look here a learning curve this is a learning curve starting from zero knowledge and running for 40 days so time here is the x-axis it's actually we measure in time this corresponds if you want to know to around 30 million actual games but actually for us as developers time is the thing which matters because this is what determines how many iterations of the algorithm we can run and as developers and researchers that's there that's the thing you altom utley care about and on the y-axis we have the elo rating ela ratings of their communist form of evaluation in games like chess and go and so forth so what we see here is that it started off playing completely randomly at the start of this timeline so really had completely randomized initial weights but after three days it already surpassed the version of alphago which we call alpha goalie that defeated Lisa Dahl in 2016 so that happened in just 72 hours after 21 days we saw that alphago zero surpassed alphago master this previous best version that we'd had and that defeated human players by 60 games to nail and eventually after 40 days we ended up with our strongest person of alphago zero although was still continuing to improve so that potentially it could have been left her to climb even more so if we actually try to evaluate alphago and look at it and try and quantify what these improvements really mean we started out with some kind of previous state of the art on this elos scale which were previous programs and one way to understand this which is maybe more understandable to to us in the audience here is by instead of looking a low ratings we can look at the number of stones of difference between programs so this roughly says how many extra moves one program could give another and still win and so between the original version of alphago which we published in our first nature article and the previous state-of-the-art was around a four stone difference and this is the version that defeated the European champion by five games to nil we subsequently developed the version alpha goalie that defeated the world champion Lisa doll by four games to one and that was around three stones stronger than the previous version although you have to take all of these these stone evaluations with a pinch of salt because it turns out the we we do very very well at defeating weaker versions of ourselves because we've essentially trained by self play and therefore we are better at beating previous versions of ourselves than we would necessarily be at beating some held our other program and from there we went on to develop alphago master which was another three stone stronger and this was the version that beat the top professional 60 games to nail and finally we ended up with something around 5,000 elo which was the alphago zero and this was the one which was trained completely from first principles so all of the knowledge had been removed so one of the nice things about alphago and alphago zero particular is that it can actually learn to discover our human knowledge for itself so it starts from random weights which means that everything it learns it learns for a reason it learns because it finds some pattern which it has discovered is important in games of self play and so over time it really starts to understand the game and we can really think of this as some form of understanding it's developed the intuitions that are important for it to be able to evaluate and play this game effectively and so in particular we we analyze the actual opening patterns these are called joseki and the game of go these are like corner sequences that the strong human players used to to play out town positions in different in different situations and and all of these known opening patterns or joseki were discovered as training proceeded so it started from nothing but as training proceeded this is like a timeline we see here starting from zero hours and going up to 72 hours we see that it started to pick out really well-known joseki that humans play but what was interesting is that some of them then get dumped discarded that you know starts to play them but after a while it discovers that actually it's found some better variation and it discards them it prefers our new patterns and new ideas which actually human players didn't necessarily know about and so this is now adding to the knowledge base and enabling humans to play in new ways so finally I'd like to talk about the most recent work so this was just published a couple of days ago this is a something we called alpha zero and really I think this comes from the idea that an algorithm it can't really be claimed to be general unless it's applied to multiple domains apps of course the motivation for alphago zero was to be general by using less knowledge but we really wanted to demonstrate that fact and so we wants to apply the same algorithm up to three different games and so we applied it not just to go but to the games of chess and shogi so the game of chess has a lower complexity than go but it's still a really considerable challenge there's around 10 to the 48 states however chess is interesting to us for very different reasons actually to go so computer chess is actually the most studied arguably the most studied domain in the history of artificial intelligence so it turns out they're actually the pioneers of computer science are people like Babbage Turing Shannon and von Neumann were absolutely fascinated by computer chess and developed sometimes just on paper the very first algorithms to play computer to play chess with a computer and subsequently chess became the Drosophila of artificial intelligence the kind of arm the platform that everyone used the fruit fly in other words there everyone used to conduct their experiments on four decades of research and actually this research develops enormous progress and huge numbers of ideas were thrown in and programs got stronger and stronger and stronger and culminating in these highly specialized systems which were very success in chess and ultimately that the program deep-blue from IBM defeated Garry Kasparov in 1997 but in the 20 years since then the state-of-the-art continued to progress to a level where they're now indisputably human I think they're the top programs are something like 600 elos stronger than the deep blue was so this gives us a you know really fantastic case study of a domain where where the expertise of humours has been leveraged and specialized to really produce the best that can possibly be be done in that one domain so we also looked at the game of shogi so this is Japanese chess and shogi is interesting to us because it's some both harder than chess in terms of computation so it uses a larger board but also it has a much larger and more interesting action space and that's because in the game of Japanese chess shogi when you capture a piece it actually kind of flips around and becomes one of your own pieces and then on your move you can place it down at any position on the board and this really blows up the action space and leads to enormous be complicated tactical sequences and what what this means is that actually in the game of shogi the the strongest computers despite a lot of effort I've actually only recently achieved human world champion level and so it becomes an interesting case study for us and so both of these cases for chess and and shogi the state-of-the-art engines are based on alpha-beta search so alpha-beta search is a form of minimax search where where the minimax optimal solution is found by a kind of search tree and that search has been that alpha beta search has been augmented by enormous amounts of knowledge so in particular what goes into it typically is a handcrafted evaluation function that's been tuned by human grandmasters over many many years the whole history of computer chess has led to these grandmasters creating these very complicated and specific features which enable just programs to effectively and efficiently evaluate positions and in addition they have a huge number of search extensions that are highly optimised to the particular game of chess so I think it's useful to actually make this concrete so we can actually start to dissect one of the top programs so he'll look at the program stockfish which was the 2016 well computer chess champion and this is a really beautifully engineered program with a vast array of specialized components in it and what I'm going to put up is is a list here in which every item represents a major component of stockfish and most of these components that would also you'd also find in other top chess programs or you'd find equivalent components and each one of these items kind of you can think of as corresponding to potentially years of research in the computer chess community to get this idea or this particular search extension right and the majority of them aren't some of them are very general purpose but the majority of them are actually used a lot of specialized knowledge which is specific to the game of chess and so we would like to replace this with a system which learns for itself and which can therefore be applied to other domains and so if we look at the equivalent to anatomy of alpha zero we can essentially remove every single one of those components and replace them with principled ideas based on self play reinforcement learning and self play Monte Carlo search only and literally there are no none of these in there there's no transposition table there's no opening book no endgame tablebases no no extensions so then the question is how well can an approach which is really very pure actually do compare to these traditional approaches which have been optimized for decades well I think to answer that question we really need to understand some of the properties of chess and shogi compared to the game of go because one of the reasons we wants to study this is to see whether the kind of success we saw and go was specific to go or whether it would actually generalize to a game with very different properties so for example chess unlike go doesn't have perfect translational invariance that's because you've got pawns which can move in different ways of different parts of the board they promote it their eighth rank and they move two steps from the second rank and things like this they don't have this kind of arm perfect locality that you have in the game of go so the game of Go the rules are very locally they're defined in terms of the adjacencies of the stones to to the next stone whereas the rules of chess are defined in terms of like Queen moves which can jump you from one side of the board to the other and so these first two properties may be things which are which make which make training more amenable to convolutional neural networks for example the third point is is symmetry that that alphago actually exploited symmetry and the original version because the game of go is perfectly symmetric with this eight fold dihedral symmetry whereas chess actually is not symmetric because there are there are there are symmetry breaking things in it there are so there's pawns always move forward and there are castling rules which are different on both sides of the board so we actually threw out all of the symmetry augmentations that were used in alphago and we want this to be as general as possible the action space is also more interesting in chess and shogi because it has these kind of compound actions where you pick up a piece and then you move it somewhere else and so it's not just like I'm pointing to one particular location on the board there's this compound structure to it and finally in the game of Go there are no drawers and this means that you actually have a binary outcome meaning that the value function actually has this very nice interpretation as a probability of winning I mean it may be that that led to particularly nice dynamics for their game um and so for chess we actually have draws in fact there are so many draws that they tend to dominate in high-level chess play particularly with computers and it's believed that the optimal outcome of chess is a draw and so we really needed to evaluate something which took account of all possible outcomes and we chose to use the expected value and as per a standard value function where the -1 for a loss plus 1 for a win and 0 for a draw and so the question is whether we need to do anything different for all of these properties but despite all of these difference we applied exactly the same algorithm as I've described earlier in the talk using the same network architecture and the same settings for all games forgo chess and shogi so if we look at how it does this is the learning curves for the three games so first of all again we see that this x-axis represents a kind of measure of time this is now the number of training steps so this is the thousands of mini batches which get updated by the training algorithm so so this is a hundred thousand training mini-batches 200,000 mini batters and so forth and the y-axis is again an e low rating which should be interpreted differently for each of these games for example in shogi there and go that the scale seems to be more elongated whereas in chess it gets more compressed because of the prevalence of draws and so we see that in chess we actually starting from scratch and starting from random within just four hours around 200,000 armed steps outperformed this previous world champion stockfish in the game of shogi it took just around a hundred thousand steps or two hours to surpass the world champion in in the computer shogi Association and in the game of go as a control we compared to the previous version of alpha zero and actually despite not using symmetries we actually learned to outperform the previous version of alpha zero and and not only that but we defeated the alpha goalie the versions defeated Lisa Dahl in just around just a few hundred thousand steps here so around eight hours of training so we also played once these were fully trained we played matches against the previous state-of-the-art in chess showed you and go and so what you see here is that in a hundred game match against the world champion stockfish we didn't lose a single game when we played against the current world champion in shogi we won around 90 percent of the games and finally we were able to defeat the previous version of alphago zero using equivalent network architecture so we also wanted to look at the scalability of alpha zeroes Monte Carlo tree search and we really wants to compare this to the alpha beta search engines they've been used in previous programs um and this is interesting because alpha beta search has dominated in these domains for 40 years now and they've been many studies that suggested there actually MCTS or any other algorithm than alpha beta could not be competitive and yet what we see here is that not only did we outperform these other programs but the MCTS actually scaled up more effectively than alpha beta so the more thinking time we give the program the more it starts to outperform stockfish and chess and elmo and go and not only that but it actually used around a thousand times fewer evaluations per second in these programs so stockfish evaluates around 70 million positions per second whereas alpha zero is picking out just something like 80 thousand positions per second and this is interesting because actually when Shannon was looking into computer chess he proposed that are actually two ways to consider planning he called them type A and type B he said the type A algorithms would essentially be doing brute-force search where they'd be systematically evaluating every kind of position you can come across and that type B algorithms would be using a more brain like system where they would actually be carefully deciding which of these positions should be expanded next based on an understanding of what's going on in the position and it turned out that for decades these type a these kind of more systematic searchers have dominated but it looks like what's happening here is perhaps something more like that original spirit of type B whether by despite evaluating orders of magnitude fewer positions if you actually evaluate those in the right way and you're very careful about what you evaluate you can actually get better performance and better scalability so we think that one reason why m tts is so effective compared to alpha beta when you start to use these approximate function approximate is like neural networks is that like a neural network inevitably is going to have some kind of approximation error in it and what happens when you do an alpha beta sir so alpha B to search the kind of minimax search so you can think of it as like a great big glorified max operator although it's alternating with mins as well and what happens is that will pick out the biggest errors in your function approximator and propagate them right to the root whereas MCTS is actually averaging over the evaluations that you have and so this actually tends to cancel out these errors that you have in your search and we just speculating at this point but this may be one of the reasons why MCTS can be so effective when it's combined with with deep function approximate errs so as in the game of Go we see that alpha 0 also discovered human chess knowledge for itself so what we did was he actually picked out the 12 most common human openings that were played we all them all the openings that were played more than a hundred thousand times in our online database I've shown the first set here and we checked out how often alpha zero played these games during self play and that's this plot that you see here it's the frequency by which it was played during games of self play and what we see is that alpha zero played most of these variations extensively but it ended up dismissing some of these variations like the French defense later on as it discoverer and starts to prefer different variations in addition we also tried playing a match against Doc fish from each one of these openings to make sure that we could actually perform robustly and we hadn't just discovered like one exploit against Doc fish and we actually beat stop fish robustly from all of these twelve positions so it just wants to conclude by just talking about what happens when we move beyond games so of course the idea of alpha zero is actually to be able to take us beyond go and into the realm of general-purpose reinforcement learning now at the moment it's still early steps and I would of course hesitate to say that that the real world is anything like the game of Go or the game of chess but nevertheless one thing which we're very interested in is trying to progress these methods to general-purpose deep reinforcement learning ideas and so what we've been looking at are deep mind in general and not necessarily using the kind of planning methods that we saw in alpha zero but using a whole host a whole array of different approaches these deep reinforcement learning algorithms so for example here is one of the state-of-the-art algorithms we have which is called unreal and which is based on a much more model free approach which is learning to play in these deep MindLab domains by learning about a whole host of different targets and I'll actually be talking about this particular example in a couple of data the hierarchical reinforcement learning workshop but what we'd like to do is keep progressing beyond these games and the key idea and the thing which I think I want to emphasize from this talk if more than anything else is that the more we take out specialized knowledge the more we enable our algorithms to be general-purpose and the more we can hope and start to believe that they will actually transfer into something which will be helpful and successful in domains beyond the ones for which it was designed and tested and I think there's a lesson there which is you know every time you specialize something you actually hurt your your generalization ability so that's really it I just wants to conclude by asking for questions and I'll put this online but there's some links if you're interested in any more resources and games and so forth thank you very much okay so there are microphones scattered in the aisles if you'd like to come up and ask questions we can have a few now hello thank you for the presentation was wonderful one of the things that I had seen that has been worked on in chess was that even though computers were on superhuman level but when the and analysis that computer was doing was given to the human the human plus computer could outperform to hear the computer by itself so I wonder if you have done any that of that sort of analysis to see if given this type of reinforcement learning solution that we have it humans still have a new perspective a way that the human intelligence works that could add to the solution and it could actually ask perform alpha zero itself so that's a great question I mean I should say for chess in particular we literally just put up these results a couple of days ago so they're brand-new only haven't had really a chance to perform these kind of extensive investigations I think I think the question of these centur programs which combine human and and a machine together is very interesting maybe I can just I don't want to speculate on what would happen because I genuinely don't know but I think maybe one comment would be which I can't pass on is the reaction of the chess community to the style of alpha zero and the way it plays chess and the reaction has been that it plays in a much more human way than previous chess programs so actually plays in a way where it tends to play quite aggressively freely this kind of open style which which humans tend to favor but of course with the additional precision of computer abilities and I think in addition to that it's found this new style which hadn't been seen before it can do things like um be quite free about its ideas about when when material is worthwhile so it will for example give up a night not because it sees that it can it can win back that material within its search tree but because it's understood within it within its deep understanding of what's going on into position that it can exert some advantage over the opponents it's worth more than a night we have an example and the games that we put out where it it takes more than 50 moves before it wins back the the material from this night sacrifice for example but yeah I think interesting question thank you okay two more questions darling so you mentioned that the only human knowledge that you embedded into alphago's serum was the rules of the game so I guess I have a more this like basic question what are the output moves represented and then how do you prevent the network from making an invalid both okay so so first of all yeah I didn't have a chance to explain but though we we have to use so that when I say we use the rules of the game the rules of the game includes some basic encoding of those rules into the inputs and some basic decoding of the action from from back in to actually playing a move in the game so what we do at specifically in the game of chess is that we have well we tried a couple of things actually we tried a flap encoding which also worked by the way we just have a flat vector distribution and we also had a spatial representation which had basically where we used the piece that you pick up was spatially encoded and the place that you put it down was was the particular plane that you use so is that knowledge well it's it's a way to encode the rules you need some way to interface between what the Machine does and what's going on in the game and so I think to me that becomes part of what the rule definitions are I think the fact that it also works with a flat representation suggests that we're not too sensitive to that and that's something which we substitute between the different games for the legal moves yeah that's also included in knowledge of the rule so we do exclude illegal moves because and that's helpful in chess because you have a very large action space yeah so question for maybe out yep nice talk I would like to know some more about your experiments the curve you show don't have error bars is it because the error bars are too small to see or are we seeing just one run chosen at random so these runs are very expensive so this is just one run per game I mean this is using a large number of CPUs on Google compute so you've never run two experiments on this it's only you run only one experimental history of your of your project so we are reporting a single experiment here we have ourselves run multiple experiments and it's very reproducible so we get the same results essentially every time thank you yeah okay thank you very much [Applause]

Info

Channel: The Artificial Intelligence Channel

Views: 154,903

Rating: 4.9205747 out of 5

Keywords: singularity, ai, artificial intelligence, deep learning, machine learning, deepmind, robots, robotics, self-driving cars, driverless cars, AlphaZero, alphago, lee sedol, 2017 NIPS

Id: Wujy7OzvdJk

Channel Id: undefined

Length: 42min 29sec (2549 seconds)

Published: Mon Jan 29 2018