Pascal Poupart, Sum Product Networks, Oct 19 2017

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

okay good afternoon everyone so we're gonna get started so again it's my pleasure to give a guest lecture and then today's topic is going to be deep reinforcement learning okay so I'll first give an introduction to reinforcement learning and then I will exemplify what we're going to talk about with alphago which is this famous computer program that deepmind designed to essentially master the game of computer go and in fact beat some of the best human players from around the world okay so some of this work that I'm gonna talk about is actually coming from a paper that was published in nature's you've got the reference here and then so later on if you want to check this out and that's what I would recommend okay so first let's go back to what is machine learning and I'm assuming that in this course since it's a course about deep learning then you've seen already plenty of techniques and most of them are probably falling under supervised learning where in what this means is that we've got a teacher that essentially tells the learner what it needs to learn and then more practically you have some data typically the input and then you also have the outputs you have both the X and the y and then essentially because the answer is already provided so you've got the Y then you can learn from this and it's much easier okay at the other extreme we also have unsupervised learning where we only have the inputs we don't know what the output should be we don't have the Y and is it this typically happens in the context of clustering or if we're trying to do some form of just data modelling so this happens essentially in natural anguish processing also computer vision if you want to come up with genitive models so this would be often the case now in between we have also reinforcement learning where in this case we while the environment does not provide the answer as in supervised learning but it still provides something so it's not on supervised completely so there's some information and in this information is going to be in the form of some hints that we're going to call rewards or depending on the literature it may also be called costs okay but the idea is that these hints are going to provide feedback that will essentially tell the learner whether what it's doing might be good or might be bad but it doesn't tell it what is the the actual answer okay so that's why it's a weaker form it's in between but as a result it's quite flexible and then you can tackle some really complex tasks okay so more and more detail wise reinforcement learning so we can define it as a process by which we're learning to maximize a numerical reward signal and then again as I was saying before this numerical signal is simply some piece of feedback it does not tell us what is the correct actions is just giving us some information that we can still leverage okay so where does the name come from so reinforcement learning actually comes from animal psychology so it's been well known for a long time that you can train animals to perform various tricks so otherwise to get what to to to do certain tasks and and often you can get them to do those things by some reinforcements and then in the case of animals okay some of the reinforcements might be some form of pain or hunger and and then in in terms of well actually yes and in terms of negative reinforcement it might be pain or hunger but you might also have some positive reinforcement where it's going to be some form of pleasure or food and then so you can get a mouse to essentially go into a maze and reach a certain location where it can find some food as opposed to another location where it might get an electrical shock for instance okay in general I mean even if if we're not talking about mice I mean you might have a pet at home and then you can train it maybe to perform certain things and and then again you might use some form of reinforcement so here the ideas that we're going to do something similar with a computer and and then okay in the case of the computer I mean we're not gonna provide food and we're not gonna provide pain or anything like this but we can use numerical signals where internally inside the computer is it's going to have an optimization problem where it's going to try to maximize the sum of those numerical signals and that's what's going to drive the learning any questions so far okay good so this should be fairly straightforward now there's lots of examples where reinforcement learning can be used and has been used so today I'm gonna talk about the game of goals so in terms of playing games where there's a sequence of actions there's some uncertainty then reinforcement learning becomes very useful but then in terms of industrial applications there's lots of examples starting in operations research so in fact reinforcement learning is really a framework that is derived from Markov decision process that were proposed in the 1950s in in operations research and it has been used in in all kinds of applications so for instance pricing techniques on the web or even just with Airlines might be using some form of reinforcement learning vehicle routing is another one also elevator scheduling control of vehicles and there was a famous example where and ruling many years ago was able to get a helicopter to perform certain maneuvers and enter that helicopter into a contest where essentially outperform human pilots in terms of doing those maneuvers and it was all done through reinforcement learning a topic that is very important today our dialogue systems so if you've got an iphone then Siri is one of the classic examples where you can use various forms of learning to get the computer to understand and even respond I'm not entirely sure to what extent reinforcement learning is used within theory but in general it is a framework that has been used in in diverse companies to perform a certain task with respect to dialogue systems data center energy optimization so recently I was talking with some researchers at Google who were telling me that internally they are optimizing their data centers because they consume a lot of energy it is one of the major sources of cost and then this is now done by reinforcement learning also there's a lot of interest with networking companies to also develop self managing network systems so networks today while historically were mostly hardware now there are software-defined networks where essentially a lot of the control has come out of the hardware into a software layer and then you've got lots of sensors that can inform you about different statistics regarding the performance of your network and then you can optimize the parameters and then reinforcement learning is one of the techniques that is also being investigated for for this use so in any case there's lots of potential and but again today I'm gonna focus mostly on the first one but if time could permit I mean we could do one lecture on each one of those applications okay so if we want to define reinforcement learning more concretely a classic diagram works as follows where you have an agent so this agent is a computer program that's going to essentially select some actions so these actions in the case of a game might be selecting the next move if it's in a data center in terms of optimizing energy then it might be various controls with respect to different parts of a system that regulates the temperature ventilation and and so on and then the environment is going to be affected by this choice of action so here if it's a game then when you select a move perhaps there's a board configuration that changes as a result of that and now as part of the environment there would be an opponent who is going to play and and then change that as well and then you get to observe what is the current state of the environment after that and then with that observation typically there's a numerical signal that we can define that will tell us how good is the state of the environment so here the idea is that in this framework right the agent is trying to select actions that will influence the environment so that the environment moves into a state that is good for the agent so there are the rewards essentially numerical signals that simply tell us how good those states are and then the agent will simply try to select actions that will get the environment into good States overall and when I say good States overall there are many ways in which we could measure the overall goodness but one classic approach is that we can simply add up some rewards at every time step and then we can add them up while multiplying them by a discount factor so here gamma is is a number between 0 & 1 that will essentially discount future rewards so that we pay more attention to immediate rewards and then less attention to rewards later in the future ok so this is one framework it's not the only framework right so there are other ways of combining rewards but this is a very popular way of doing it okay so to summarize then you see there's this feedback loop here where the agent is going to select actions that will influence the environment then the environment is going to reveal what is the occurrence well what is its current state and then also I provide a numerical measure that indicates how good that was and and then the agent will presumably select an action at every step and and then this will influence the next state obtain a reward and then we simply want to maximize the sum of those rewards perhaps with discount factor okay any questions regarding this yes yes yes yeah that's a good point so yeah so here the the discount factor is actually something that is quite common and the intuition behind this is that let's say we're in an industrial setting where perhaps our rewards are going to be money but then there's an inflation rate that essentially makes it so that you know a $10 that you receive today is actually worth more than a $10 that you receive let's say five years from now because in five years with $10 you just can't buy as much stuff as today so there's a natural discounting that happens and then so based on this there's lots of industrial problems where it actually makes sense to do what to include a discount factor but then you have so they write that in some other situations like games there's no notion of discounting so the ultimate goal is to win and then if you can sacrifice something now so that you can actually win later on then by all means that that's a very good approach and and then we'll see in fact that for the game of Go we will not use a discount factor okay here yes it is constrained to be less than one but I guess it depends how we interpret it here okay so here the gamma is is used it is not strictly speaking an inflation rate but it's actually called a discounting rate that reflects the fact that due to inflation right then a same amount of money in the future just doesn't allow you to buy as much so the in terms of of actual value then we need to discount to reflect the the correct value in the future okay so a discount rate is also useful from a computational perspective whenever we consider infinite horizon domains because if I add up a number of rewards but that number of rewards is infinite right then I'm going to presumably get some Assam that's going to be infinitely large and and then it's difficult to do computation based on this but then if the discount factor is less than one then even though I have an infinite number of sums this will be a geometric series right and then the sum will be finite and and this is a nice property so beyond inflation I guess another reason that is simply a practical reason or a computational reason then people will often use a discount factor but then if you don't have an infinite horizon and instead it's it's a finite horizon so there's going to be an end there's a maximum number of steps then you don't have to worry about anything becoming Infinite and then at that point there's no practical reason for using a discount factor it's only if if it would make sense so often in games then there's going to be an end I mean the number of moves might actually be quite large but it won't be infinite and and that's the reason for for sometimes not not using any discount factor yeah so the powers at the end will just keep increasing yes so yeah that's a good point too so there is whenever you use a discount factor like this you could say that there's an effective horizon where after a certain number of steps indeed the rewards are discounted so much that they're effectively zero so so in so it's an interesting thing because it's still an infant horizon but effectively we've implicitly kind of bounded the horizon I mean it's not a hardbound right because I mean it's still infinite but but effectively like a you you you you you could you could see that anything after a certain number of steps right there's going to be so small that that it's negligible and you could you could ignore it yeah and and here just for completeness for the infinite horizon case another approach would be to take the average right and there there would not be any problem of infinite computation so that's another common approach okay but that's that's beyond the scope of this lecture okay any other questions yes yes okay so yeah so here if the state well if the information provided by the environment reveals the complete state of the environment then we can show mathematically that this is going to be a Markov decision process that satisfies a Markovian property and as a result the last state is sufficient to predict everything that will happen in the future and then as a result you can choose your next action just based on that last state and and that's sufficient now more generally the environment will not always reveal the entire state so the state will be partially observable and then it will be useful to consider many of those past observations to perhaps obtain as much information beyond just the last observation and then it's going to be a lot more complex and these are known as partially observable Markov decision processes but we won't cover that into the exact year yeah okay all right so these were a good set of questions but let's keep going okay so with what I've defined now so reinforcement learning as I mentioned before is essentially based on the theory of Markov decision processes so underlying a reinforcement learning pran there is always a Markov decision process I'm not going to go into the details of Markov decision processes but what is important to know at least just for today's lectures that there's going to be a set of states it can be discrete it could be continuous there's going to be a set of actions again these could be discrete or continuous and and furthermore we could allow the agent to select actions in a stochastic fashion so that when it makes a choice it's it's it's essentially randomizing by perhaps using a dice or some other random number generator okay so so we can have actions that are selected in a sarcastic fashion we also have some rewards these are going to be numerical signals there are real numbers in general and then those rewards might not occur right after each action so sometimes they might get delayed or at least the rewards that is associated with an action might only reveal itself much later okay so this could happen to my Yankees these are the main three components we're gonna have States we're gonna have actions and we're gonna have rewards okay so to make things concrete here's a toy example that is from robotics and and quite interesting so let's say that we've got a card and the card is on a plane like this and then what the card needs to do is to balance a pole okay so the stick here is a pole that can essentially fall down alright so the pole is just attached to the card with a hinge right and and then due to gravity right unless the pole is well balanced it will naturally fall down so what the card needs to do is to essentially move right or left to essentially try to preserve the pole into a standing position okay so this is a simple toy problem from robotics and then historically the way it was solved is that you'd work out all the physics behind this problem and and come up with all kinds of equations and and then derive some controls for this and and eventually arrive at a policy that would determine when you would move right or left but as humans you know you might have at some point played where you have a stick you put it in your palm and then you simply try to balance all right and you're not solving any mathematical equations you're not doing physics at least not explicitly right and somehow you're able to eventually learn how to balance the pole so an interesting question then is can we get a computer to do the same right without understanding all the laws of physics but simply by trial and error just like what humans would do and then the answer here is that we can use reinforcement learning to get a card like this to balance the pole and here we can formulate the problem by saying that any point in time the cart is going to receive information that would indicate the position of the pole the velocity of the pole the angle of the pole and then the angular velocity of the pole okay so a point in time it receives this information and then it will select an action that will be a force in one direction or the other direction and then it will receive a reward that will be one whenever the pole is still standing and it hasn't fallen yet once it falls then it's over okay so here it will try to essentially select some actions that will keep the pole balances as long as possible and now can we get this car to learn over time to obtain a good policy so here finding a policy is going to be essentially a mapping from States to actions okay our states are defined up there so again position velocity angle and angular velocity right and then we're gonna map this this is a continuous space two actions and here the actions could be simply a binary action of saying move right or move left that is enabled through a force in one or the other direction okay so let me show you now a video okay so this is a YouTube video here where somebody use reinforcement learning to solve this problem okay and then the in fact use a neural network oh okay it's it's not showing on your wait a second okay okay there we go okay so in this video you'll see the poll and somebody wrote some code and then each time you see it moves the card and then it often Falls but then eventually it succeeds in terms of balancing the poll and then you see that I mean it's balanced but it still has to keep on moving because gravity eventually pulls it down one way or the other okay so let's stop this okay so yeah so this gives you a concrete idea of what's going on okay now the question is how can we get the pole or how can we get the car to learn a good policy as we sign the video oh okay all right so to do policy optimization there's lots of techniques that one can use the most classic techniques are what are known as value based techniques so the idea is that you see we're trying to maximize the sum of rewards so here we can define a value function so it's a mapping from state to the sum of the rewards where hear the sound of the rewards is essentially the sum of the rewards discounted but then those rewards depend on the policy and then they also depend on some transition function in the environment to see how the state is going to evolve so here I'm simply going to write the expectation of that reward based on the current state and the current action okay so the objective is essentially to try to find a policy that gives us the best value function like this and once we have a good value function then there are ways in which we can extract a policy Delta and then one very common algorithm for doing this that you might have heard about is called Q learning but then there's lots of other algorithms and I mean there's a whole literature that goes back 30 years at least on on value based techniques now for this lecture I'm going to talk about another set of techniques that are known as policy search techniques so in their case what they do is that we search directly in the space of policies for a representation of the policy that would directly maximize the value function so I guess the difference is that in policy search techniques we work directly with a representation of the policy and then we we try to adjust this definition of the policy until we are taine a policy that gives us high rewards whereas in in the value based techniques we don't have an explicit representation for the policy instead we're simply trying to see what might be a good value function and once we have that then we recover a policy okay again I will focus only on on this set of techniques for this lecture just in the interest of time and in particular the technique that we're going to cover today is known as policy gradient okay so the reason why policy gradient is a good way to start so it's simply because if you're already familiar with supervised learning we're gonna see in a moment that a posse gradient technique is actually just one small tweak compared to supervised learning so if we were able to get the correct action at any point in time and we could use supervised learning and here we could consider let's see a space of stochastic policies that would be defined as a probability of actions given each state pat rise by some weights W okay and here this function can be represented in any way and I mean for the purpose of this texture let's just assume that this is going to be represented by a deep neural network okay so here the idea is that we have as input the state and then we have now a network and then there's going to be a layer at the top where it's going to produce the probability for let's see action 1 The Prodigy for action 2 and so on ok so let's assume here that we've got a neural network that takes as input the states it can be a convolutional neural network it can be any type of network and then at the top there's going to be let's see a soft max layer that will give us a distribution over possible actions okay so if we've got this and let's say that now we have some data where somehow somebody was able to tell us what is the optimal action for every state so this would be the supervised setting we're gonna relax this soon but for now let's say that somebody tells us what is the best action in every state if we have this then we could train our policy simply by maximizing the likelihood of the data so here we could maximize the likelihood of selecting the correct action in each state and we can do this simply by maximizing the sum of the log of the probabilities of each action given each state okay so then in a representation let's say we're a deep neural network like this right then you could use simply gradient descent use some form of automatic differentiation to figure out what would be the gradient and then adjust the weights in your network so that over time right it will converge towards having a high probability for the correct action in each state so the gradient update would simply be this equation that many platforms whether it's tensor flow PI torch and so on can compute for you automatically yes yeah so here okay I just took a step back and said let's just see again how supervised learning works okay so in supervised learning there's no notion of value function because we are told what is the correct action and this is what we would do if we could somehow just find out what is the correct action in every state okay so now let's consider reinforcement learning where again we consider a stochastic policy it's it's represented by a deep neural network in the same way okay but the difference is that now we get triples of state action reward and here you'll notice that even though I get an action this is not necessarily the optimal action that's on the previous slide when I use a star it means that this is the optimal action here I just means that this is some action and then with it we have a reward that indicates how good this action is okay so this is a relaxation of the problem it's more general because now I can get state actions where it can be any actions and and then a corresponding signal okay so with this we can also do some form of gradient descent so we're gonna see in a moment how we can derive this but then now the the problem that we want to optimize is going to be to find the set of weights that maximize the sum of the discounter the rewards okay and and here again I have an expectation of respect to the rewards because there's some stochasticity either due to the policy or otherwise due to the environment so that's our optimization problem and now in this optimization problem if we treat the weights W as a variable compute the gradient we can show that the gradient will correspond to this equation and if you look carefully I mean this equation is not very different from this equation here the main difference is just this here where we have a gamma raised to the power of T and then our T okay so that's the only difference so here our T is the sum of the rewards in one trajectory okay so let's see that we execute the policy and take a sequence of actions till let's see the end of the horizon which might be infinite that's why I've got an infinite sum here then I could sum up those rewards and it's essentially one sample according to the execution of my policy once okay so we're gonna see in a moment that computing the gradient in this case for a policy update is very similar to supervised learning the main difference is that we simply introduced this gamma rays ^ t times RT okay so how do we explain this so there's a theorem known as the great and policy theorem where we can show that if we want to compute the gradient of the value function for Policy part rise by W it corresponds to this equation where this equation is essentially saying that the overall value of the policy comes from the stationary distribution of the states when we execute the policy for a long time times here the the property of choosing actions in each state times here the value of the policy when we're in a certain State execute the action and then follow the policy thereafter okay so I guess there are the two things that I've introduced in this theorem that we're not defined previously is this notion of a stationary state distribution and then also discounted sum of rewards when we start in state as executor action a and then follow the policy parent rise by W thereafter okay so okay so I'm gonna show you now the derivation for this theorem and to be honest normally I would need more than one lecture to really explain in details where does that stationary distribution come from and where does that Q function come from so for now I've given you the I guess a definition in words I'm gonna leave it at that we're simply going to use those terms now to do the derivation but yeah it would take more than a lecture to explain this okay so this theorem can be derived with this set of equations I'm not gonna bother explaining all those steps they're essentially just rewritings that take advantage of what is the definition of each of those terms and then yeah at the end we arrive at the statement of the theorem so that shows that essentially computing the gradient can be decomposed in this fashion and now we can further rewrite this where the gradient can be rewritten in the form of some expectation and now we can sample just one term in this expectation which is going to give us essentially a sample gradient for doing stochastic gradient descent okay so so then in my previous slides you know when I explained how we could do a great and update with respect to a policy I was using this this term here and and this is the derivation that you need to obtain that okay yeah so I guess yeah the the key again is that we've got some definition in terms of terms like the Q function then the stationary policy sorry the stationary State distribution and then beyond that after that then we're simply going to take one sample from an expectation here that will give us this term and and that's what we can use then to do an update based on one trajectory when you execute the policy okay so this is a beautiful result and then it led to this algorithm that is actually called reinforce okay so it's not the only reinforcement learning algorithm but when it was proposed this was back in 1992 there were not so many algorithms for reinforcement learning and I guess at the time somebody figured I might as well call my algorithm reinforce and and that's the name it got and so it's quite clear that this is going to be for reinforcement learning and then it is in fact one of the most popular algorithm still today for policy gradient techniques it is often the base of most policy gradient techniques so there's lots of variants on this but that's that's that's essentially the base or like that the most common approach yes so it cannot be applied on sure you can do that yeah oh okay right so so yeah here RT is the sum of all the rewards we would get till the end of the horizon so I guess had this this would work in the context of episodic problems where for instance in the case of games then this would be the sum of the rewards till the end of the game so you could think of every game as one episode and then so you you could learn based on this but then yeah within one game then we would have to come up with a variant for this and then there are ways in which we could do some updates after every single actions but yeah for the purpose of this texture we're just gonna go with this basic form okay all right so now let me go on with the example of alphago and we're gonna see how this reinforce algorithm is indeed using enough ago and and was a key component okay so let me first introduce the game of go so it's a popular game especially in Asia and it is a very complex game although when we describe the rules it's a very simple game to describe so here I've got the rules where we've got two players black and white and then what they do is that they alternate in terms of placing a stone either a black stone or white stone at one of the intersections on the board when they place stones at one when they place a stone at one intersection then after each move we're going to check if there are some stones that do not have any Liberty we're here Liberty simply means that there will be an empty intersection adjacent to that stone or adjacent to the neighbors of that stone so for instance yeah if we have a set of connected stones of all the same color and none of them has a Liberty so an adjacent intersection that is free then they would essentially be surrounded and and captured and therefore they would be removed from the board so so that's actually how the game goes that everybody puts down one stone one after the other whenever some stones are surrounded then we remove them they're essentially captured and we keep on going like this until at some point in time where both players would agree on what region is being controlled by which player so for instance here we could perhaps see that there's a big region that is controlled by black because black essentially is surrounding this region and then we will essentially count all the stones of each player and then the regions that they control to determine the score okay so that's at a high level there there are actually a few additional rules that are important in the game but at least these are the main rules okay and and that that's really what you need to know at this point any questions about this okay good so so yeah so this game despite the fact that the rules are very simple is actually very complex is much more complex than chess for instance and in terms of getting a computer to play this game so there was a lot of work for decades and it was only in 2015 that it was a major breakthrough in terms of achieving a level of play that was comparable to humans okay so in 2015 the state of the art was that we had a bunch of other computer program to play go so canoe go Fuego patches and crazy stone and then they achieve a performance that was okay in the sense that in some cases they would beat beginners and then they could match some amateurs okay so so yeah so here this is a scale for ranking players and then higher is better and then in your rap there was an expert player his name was Phan we so according to this care so he was a professional player with three dan roughly okay so that's the green column and then in 2015 google deepmind designed alphago and alphago essentially matched and eventually exceeded Fenway okay so there was several matches that were performed between alphago and fan hooey and eventually alphago I believe won four out of those five matches okay so what happened here so how did alphago become better than then one of the top player or at least the top player in Europe and and also you know totally beat these other computer programs so those previous programs were essentially using Monte Carlo tree search so Monte Carlo tree search is another technique there is also used within reinforcement learning and we're going to see it in fact that alphago does use Monte Carlo technique too but not just that it also uses policy gradient update that we're going to see again in a moment when KCI Monte Carlo tree search was was good but not sufficient to achieve professional play and then with deep reinforcement learning then alphago was able to really break out and and beat Fenway okay so this was October 2015 and then in March 2016 alphago beated another top player his name was Lisa doe he was ranked 9 Dan so if I just go back so you see 9 Dan would be at the top of this scale so he was one of the top player but not the top player and at the time the world champion catcher made the following comment on social media so he claimed that alphago cannot beat him but then in 2017 there was a match between him and alphago and sure enough alphago defeated him and then at that point here he wrote well last year alphago was still quite human-like when it played but this year it became like a god of gold okay so and so at this point alphago is the uncontested program for playing go and it has been able to defeat the top human players of the game okay so how did it do this so the strategy there were four steps so first the use supervised learning to train some policy networks to essentially imitate the the professional players but then just imitating the professional players does not mean that you're going to beat them right you could get goodbye imitating by at some point if you want to surpass you have to come up with your own strategy so this is where reinforcement learning came in and then reinforcement learning was used to learn both policy networks and value networks and then they were combined afterwards into a Monte Carlo search Monte Carlo tree search technique and I'm going to explain those four steps okay so first the policy network was essentially a deep convolutional neural network where as input we have a state and then the state is essentially a board configuration so we saw earlier that a board configuration is essentially a grid that's 19 by 19 and then each intersection might be occupied by a white stone a black stone or simply free all right so so we take as input essentially a board configuration and then we have a multi-layer neural network that essentially applies convolutions and and then recognizes patterns and then at the top will produce a distribution over the next action and here the next action is essentially placing the next stone at some free intersection right so every intersection that is free is a possible action and then it would have this illusion over that okay so that's the policy network and then it was trained initially to imitate on the go expert based on a database that contain 30 million board configurations and and so there's a server called the the kgs goal server that contains this database and then it was used to to train this neural network yeah Oh certainly yeah okay so I guess what I'm doing now is I'm just going through the explanation of how alphago was designed so there was no recurrent neural network using in alphago but at least I'm not aware of it so what I'm describing in fact is the implementation that was used in in 2015 to defeat fan hooey now I do not know the details of what change for 2016-2017 and and then the state-of-the-art now and reinforcement learning many of the techniques do use recurrent neural networks okay but again since we only have one leg here you won't have time to go through those details and we're just going to stick to what what was implemented in alphago okay so all right so yes I was trained with the key GS goal server and and then so it is essentially imitated and it was able to obtain some reasonable policy okay so yeah to do the training again here this is a form of supervised learning because in the database there's some go expert that essentially agreed on what is the optimal action at each one of those board configuration okay so so then we simply apply rin for well we don't need to use reinforcement or an angel supervised learning here and then so we can define the gradient to be essentially the gradient of the log of the probability of that best action in each state and then we can update the weight of a neural network simply with this yeah oh yes there's there's lots of states because yeah the board itself is 19 by 19 so that's 361 now each one of those intersection might have a wine stone a black stone or might be might be free right so it means that overall there could be roughly 3 to the 361 and so that's a huge number and obviously here even though they had 30 million board configuration this is just a tiny fraction of the entire state space so this is where machine learning is very useful because it can generalize from a set of examples right so so here we only have a tiny fraction from this it will learn presumably some strategies and then it can extrapolate so that when it it is given a new board configuration it will select an action that hopefully is going to be good that will correspond more or less to some strategy that can be inferred from this entire database right yeah okay yeah no that's that's a good point but we'll see in a moment that he didn't stop there I mean this gave them a policy that in fact was not that great it was still a good policy but then reinforcement learning had to be used and and here this was essentially a way to kind of start with a reasonable policy so that we don't have to learn from scratch so it was essentially helping in terms of cutting down the amount of learning time okay right so now okay once they obtain the pasta network by supervised learning then in order to surpass those experts then the network needed to be able to learn it's its own policy and then it was true reinforcement learning that this was done where they used a reinforced algorithm that I explained earlier okay so so then again you compute a gradient the main difference is that you see now the gradient is not just the partial derivative of the log of the property of the action given the current state it's also multiplied by the discount factor and then the sum of the rewards for for that particular game okay now here the intuition is that you see if you play a game and you're not told what is the optimal action so you simply choose an action at every step right then there's no reason why you would like to compute a gradient that would try to help improve the probability of choosing that action again you would only want to do this if this is an action that has that leads to I guess a high reward so that's why you see by multiplying by RT right we're essentially scaling the gradient in a way that will make it more important if this is going to lead to a win where you get a high reward and then if it's going to lead to a loss then it's going to have a small gradient and in fact here you might have a negative gradient if you define a losses as a negative reward okay so then it will adjust the weight and in in such a way where by taking into account the rewards then it it it does the change in in the right direction okay so that's the intuition okay so in this case for each game if it's a win then RT is going to be equal to one if it's a loss RT is equal to minus one okay so so then going back you see this makes sense if if it's going to be a win it means that this action presumably was very good I mean we don't know if it was maybe you could win by a bigger margin so in the game of Go I guess you you you can count your stones versus the stones of the opponent and then you can win by a bigger or smaller margins but if we ignore that if we just care about winning pure here right then here I guess we we just we can just have this equal to one or minus one and then we're going to assume that if it's a win then this action was as good as any other action and therefore it's kind of like optimal and then this would correspond to the same as as doing supervised learning but if it's a loss then it will take a step in the opposite direction of the gradient because it's going to multiply by minus one and then it would try to essentially decrease the probability of that action okay so also the for the discount factor then it was set to one because here the game will eventually end so so then there's there's no need ready for discounting so based on the discussion that we had earlier it was not needed so the user discount factor of one so effectively this was the gradient that was used okay all right so yeah they would provide the best counteraction right okay yeah I guess yeah did not explain how I was like how did it play a game so essentially it played games against itself so so here I guess I were optimizing one policy given an opponent that uses the same policy but we're that policy let's say is fixed okay so it's essentially playing against itself and yeah so then then it will essentially try to improve with respect to his current policy that's effectively what this does yeah now here there's a danger when you play against yourself like this you might have a policy then you improve in a certain way but then after that you might get into a cycle where essentially coming up with sequences of policies they're always better than the other one but but then you're never really improving overall so to counter this when they play again when you get there the computer to play against itself they store all the previous policies so that this way it's it's trying to improve with respect to all the previous policies and then it's a lot harder to get stuck in one of those vicious circle so that it can improve with respect to everything that it comes up with exactly yeah yes okay so yeah so far I talked about the policy network but then next slide we're going to have a value network as well so this is essentially two deep neural networks one that corresponds to a policy there are one to the value function yeah and and then in both cases I believe they use some form of convolutional neural networks yes that's right so yeah deep reinforcement learning is just reinforcement learning where the function approximator is a deep neural network as opposed to some other function approximator yeah yeah okay so yeah the next thing that they did is they also trained a value network where the value network will essentially try to predict whether from a certain board configuration when you apply a policy whether that pass is going to win or lose so you see the the way we define the the rewards is that there's a plus one for winning minus one for losing so now when we train a value network so at the end it produces one number and that number is going to be a number between minus one and plus one if it's close to plus one it means when if it's close to minus 1 it means lose ok so so then it'll try to predict in this fashion who's going to win and yeah so this was trained essentially by by minimizing the Euclidean distance between what the value network currently generates and what is the outcome of the game okay so again they get a network to play against previous versions of itself and then if it wins this is going to be plus one if it loses this is going to be minus 1 and then it simply tries to minimize the Euclidean distance between those two okay so so again this is just applying deep neural networks where you train to to minimize Euclidean distance in this case okay and then finally what they did is they combined value networks and policy networks through a Monte Carlo tree search technique so the way this works is that well ok with the policy network at a point in time you could simply follow the policy network and choose an action or you could also look at your value network and try to consider every single action and then see if you end up into a state where you're going to win or lose and then based on that select again the action that would lead you to something where you're going to win so and so you could use either one but now especially with this this approach with the value network where you could just simulate one step see where you end up and then predict from that point on whether you're going to win or lose you don't have to just do one step you could actually simulate multiple steps and then the more or the deeper you do your simulation right then the more accurate this is going to be because you only have to make a prediction much later after many steps and so the traditional approach in fact for getting a computer to play a game is that you do a simulation where you are in a certain state you simulate what would happen for every possible move up to a certain depth and then either you go all the way down to the end or you you stop early and then you do an approximation you make a guess about what's going to happen for the rest of the game so here they use this approach to improve because if you can simply deal a the predictions that you need to make till after a few steps then at least those few steps are going to be correct hopefully and and then so you can obtain a better policy overall okay so Monte Carlo tree search is an approach to expand the search tree in a way where it will expand mostly the part of the tree that is likely to lead to to a good policy the problem with go is that you see there's lots of possible actions so when we start expanding the search tree let's say that this is our current state and now we're considering all possible actions well here in this graph have only considered two possible actions I can put a stone here or I can also put a stone up there all right so these are two possible actions but now again if I consider the entire board there might there's 19 by 19 intersections 361 of them so I might have like hundreds of possible moves right so so here this layer right could have hundreds of possible configurations and then every layer as I go deeper right you you increase the number of possible configurations by a factor that that is in the hundreds because there's so many possible actions all right so you clearly cannot expand everything exhaustively because it's exponential in the number of actions here which is in the hundreds so Monte Carlo tree search gives us a way of saying well let's consider the most promising actions only and then we're gonna do some simulations that will go deeper so we're not going to consider everything so like let's see here that there's really just two actions that that that that are considered good so let's expand them and then we keep on going and then we go deeper and in this way we we can try to refine or spend most of our energy on the actions that are good-looking as opposed to being exhaustive now the drawback of doing this is that unless you have a way of already pruning which actions are going to be bad that is correct then there's a danger that you're going to think that these two actions are the most promising actions but they turn out maybe to not be so good and you're ignoring the the best action and then this this would not work so Monte Carlo tree search is greedy and tries to go with the most promising actions but then there's some stochasticity and high chooses the action so that it will indeed in a long run consider every possible actions but then yeah with this trade-off it it it has formal guarantees where if you do enough simulation you can show that in the limit it will indeed investigate all possible moves and find the optimal one but then in practice it does not do that in practice it is a more portunity stick and and that tends to work well yeah so yeah so it's not fixed it all depends okay so we're gonna see in a moment how it does the search and the idea is that it will start with a tiny tree that it will grow both in terms of depth and in terms of width as it does more simulations okay so all right so I you said in industry what should be clear is that every node is a state and every edge is an action okay now with every edge we're going to store the cue function which is the sum of the rewards and then we're going to store as well what is the probability of choosing that action according to our current policy and then account that tells us how many times we've chosen this action in in this state is we're going to store these three things and then every time we do a simulation we're going to update or account so just incremented by one for every state action pair that that we visit and now the choice of action is going to be done as follows so whenever we're in a state and we're looking at possible children so that means possible actions we're going to select the action that maximizes the cue function plus a bonus defined by mu of si so this is an exploration bonus that's going to try to ensure that eventually we do explore lots of actions we are not going to be stuck just with the most promising actions which might be wrong if we don't have a good estimate to start with and this exploration bonus is going to be proportional to the probability of choosing okay I think I've got a typo here I believe this should be the property of a given s as opposed to s given eight okay so it would be the probability of choosing an action given a state with our current policy but then divided by one plus the number of times we've chosen that action so the ideas that you see at the beginning is going to focus on the actions that are most likely to be chosen according to our current policy but once it has tried many times an action that corresponds to our current policy then it will decrease the bonus simply because it divides by this sum ok so then over time it will be forced to explore gradually okay and then the Q function is not defined in a traditional way so the Q function here is going to be a weighted combination of the estimate from the value Network and the estimate that comes from doing a simulation all the way till the end of the game okay so RI again is whether we win or lose base on completing the game so plus one or minus one and this is the prediction based on on our value network and so we take a weighted combination of those two so this is really a you ristic okay so so lambda here is is just a number between 0 and 1 to ensure that we take a convex combination of those two and and here what this means is that we essentially take the average of all the iterations or all the previous trajectories where we essentially went through that state action pair okay and then look at what were the value estimates otherwise the actual win or loss okay so yeah so this gives us an estimate here of the value of choosing a certain action in a certain state and then again we recommend this with the exploration bonus so what this means is that we're going to start right in a certain board configuration then we need to choose an action so let's say we consider those two possible actions so we're going to select one of those two according to the formula up there and then if it is an action that we've chosen in the past then it's simple we simply go down that branch but if it is a new action that we've never chosen before then we create a new branch in the tree okay and and that's how the tree can get wider and and then we keep on going like this we're gonna reach eventually a leaf and then at that leaf we're going to in fact create a new child and then so we're going to grow as well in depth in this manner so so this is a very specific variant of Monte Carlo tree search holders there's a whole literature on Monte Carlo tree search and in Enki this is what alphago used but then presumably some of these other variants could have worked just fine too the main thing is that this variant was using the notion of a value network in here as well as a policy network in here okay so that's how I was combining both the policy and the value network yeah yes so they they were both learned by self play and then after that they were combined and and then to obtain an improved policy by doing these types of simulations what I am Not sure though is I think I think they were trained initially through softly and then when Monte Culture a search was used afterwards then they were fixed and I think there was no training but I'm not entirely sure and then the Monte Carlo tree search was mostly used here to essentially combine them and obtain an improved policy okay so the results of applying this in some of the games that alphago played with fan Hui so this was one of the top players in Europe yielded the following results so there were formal and informal games I believe the main difference is that yeah the informal games had shorter time control so that means players had less time to select their next action and it was just a formal games that were really counting in this tournament and then we can see here who won and yeah so in the end alphago won four of the five games that were formal okay so this completes what I wanted to present today so to summarize we talked about one technique policy gradient this technique is quite natural especially for people who have a good background in supervised learning because we saw that computing the gradient is just one small tweak we compute the gradient as usual but then we simply multiply by the the sum of the rewards with respect to one trajectory and then this was used as effectively an alphago there's lots of other techniques that also use that and and there are many other techniques that do not use that so if you want to learn more about deep reinforcement learning so here's a plug for a course that I will be teaching next summer so I will be teaching CS 885 deep reinforcement learning so in spring 2018 and then we're going to cover as well deep Q networks that were quite important recurrent deep are L so this answers a question that was asked earlier we're going to talk about memory networks in terms of applications we're going to see how this can be used in conversational systems as well as robotics yeah mm-hmm yeah but that's a very good point so yeah here one of the key to the success of alphago was that you can look at the board and then you can think of those black and white stones as just black and white pixels in an image and then it made tall sense to use a convolutional neural net which was successful in an all kinds of image recognition tasks so so I guess you have the the question that Ali is raising is now yeah if we take another game where the board does not look like a regular image is this gonna work right and okay in theory the answer is yes so in theory right then the deep neural network would stake as input some states right and then it will produce some actions and then so deep reinforcement learning has been used for all kinds of problems including problems in robotics where here the inputs might be various measures that come from accelerometers and and sensors on the robot not just video images and and then it does work as well there what happens is that I guess the neural network has to generalize in a different way right so it's not going to necessarily use convolutions to extract patterns in the same way that it would do in images but the idea is that the the neural network if it can automatically come up with some good intermediate representation that correspond to features that make sense for this problem then it will eventually be able to select the right actions now the problem is that yeah there's still a gap in our understanding of web doodles neural networks do in terms of coming up with these intermediate features and and when will it work so we know at least that for images then convolutional neural networks do a very good job and and they do generalize well but then for other types of networks or at least other types of problems there are not images yeah that that is not always clear I mean still I mean it has been used in in the context of robotics right ok yeah so yeah so here yeah we could we could still have an image like representation if we if we give a numerical score to each piece and in fact that's often what is done in chess because there's some heuristics for determining the value of each piece right and then you could just use that and and it would be interesting to see what happens so in fact I I don't know actually I'm I'm I'm ready to bet that somebody has tried to use a similar approach in the context of chess but I'm not sure and then even if somebody did there won't be a big press release about it if it does well simply because chess is is considered solved so just by doing an exhaustive search using classic techniques like alpha beta pruning then you can beat some of the best humans so yeah so the game is already has already been solved using classic technique I think maybe instead of using this is vanilla convolutional Network what are you can do is see something like a variational water motor for chess because that way you're learning yeah yeah yeah so yeah so there's there's plenty of things that we could do so maybe yeah I know we don't have a lot of time but let me just show you one video as well in a context of robotics so this it's a video that went viral by deepmind so some of you have probably seen this video okay so I guess yeah we've got a you manual that was trained and it it has to go under and over some obstacles okay I guess these are just ads from Google yeah that's pretty annoying okay okay good I got rid of this okay it's interesting how the arm is just you know up and then there's some interesting movements that it does with it [Laughter] okay yeah and we stopped it okay I guess yeah this is it for today and thank you very much [Applause]

Info

Channel: Data Science Courses

Views: 2,098

Rating: 5 out of 5

Keywords: Deep Reinforcement Learning

Id: HEc16L58wDc

Channel Id: undefined

Length: 80min 8sec (4808 seconds)

Published: Sat Oct 28 2017