Tensorflow and deep reinforcement learning, without a PhD by Martin Gorner

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

and welcome back for this last sequence on reinforcement learning who has heard something about reinforcement learning well a couple of raised hands but everyone here has heard about reinforcement learning because almost everything that comes out of deep mind including alphago and and the go championship about against Lisa at all all that is using reinforcement learning which is really the the hot area of research so the previous two hours I was sharing with you common architectures that that you can you can reuse in your projects I wanted to share with you the good practices in assembling the different Lego pieces and produce neural networks that were doing useful things now let's be a little bit more forward-looking and and and and try to see how this reinforcement learning principle works and for that we will be using Punk the game of pong is from the 70s and what we will try to do is to train a neural network to play pong from just the pixels of punk so initially this neural network it's not only that it doesn't know how to play Punk it doesn't even know what he's doing here doesn't know that it's playing he doesn't know what the rules are and if this was a person you would be teaching it by first explaining the rules and then having him play so that he gets reflexes right then maybe explaining some piece of strategy to him that's one approach but we have around us examples of learning that happen in a completely different way for instance in the more Kingdom a cat catching a mouse well the cat learns that somehow but nobody explains to it what a mouse is and that are you're supposed to catch it and that to catch it you're supposed to run and so on so this is this this reinforcement learning algorithm actually explores a different way of learning things and shall pong here's just an excuse just an example to see how this can work and again we will be learning to play pong from just looking at the pixels and of course having information about when a point was scored by us or against us so we take the pixels run them through a dense neural network which have represented here and this is we know the drill by now we want at the end to produce three probabilities so three neurons in the last layer which will be activated by softmax all of that is just as previously and we want basically to classify this boards position into a board position whether your next move should be to move the paddle up should move the paddle down or leave the paddle where it is all right so we will use the cross entropy laws and this time you will not be spared the formula so this is the formula of this cross entropy loss that I've been advertising since the beginning and it is simply the multiplication of the correct answer multiplied by the logarithm of the predicted probability okay so here if my network predicts a series of three probabilities P a pistol P down I take the logarithm and I multiply each by what I know is the correct answer how do I know that how do I know this is the correct move at this point if I know it well yes I can do this multiplication and that's my loss and then I continue my algorithm I have a logic during my networking and so on I don't know what the correct movie supposed to be here I have no idea so we'll do it slightly differently here once we have those probabilities we will sample from them what does it mean sampling from probabilities well that means rolling a loaded dice that will produce an random output but not completely random random according to the probabilities that were computed so if P was super probable it's PF that will come up most of the time and and if P down was very very improbable but not completely maybe P down will come up only one once in a hundred so that's called sampling from those probabilities we take one possible answer but by a but we respect the probabilities of those answers so let's say we sampled and we obtain up so we move the pedal up by one notch and we keep playing just a little bit a little technicality that has to do with pong specifically in pong if you feed the board state there is not enough information to to see where the ball is going if it is static here you can't see if the boy if the ball is moving forward backward so to give that information to the system you actually feel the difference between two consecutive frames oh no no no no difference between two consecutive frames you see the trajectory of the ball so that's just a wrinkle that is specific to pong all the rest is is generic for all kinds of games where you have moves to play and after you play a sequence of moves at one point a point is scored and there you know that there is at least something good in the sequence of move of moves if you managed to score a point so this time what you can do once you have scored your point for each move you will compute the this cross-entropy loss slightly differently at each move you know what your network predicted you have those probabilities so the read probabilities PRP dumpy peidong peidong P still pee down sorry they are known you can compute those and instead of the good answers you put the move that you actually played and the final difference is that you multiply all of that by the reward you got at the end so if you won the reward is plus one and you are having this positively if you look if you lost the reward is minus one and you are adding negative values to this loss and now we have enough so now we can actually start training this system we can compute this loss of course is computed from the probabilities those probabilities are computed by our neural network which has weights so we can derive this loss relatively to the weights that obtains a gradient and yada-yada-yada we can know how to move the weights so as to penalize moves that ultimately law made us lose and and and make more probable the moves that ultimately led to us scoring a point two little technicalities the first one is that in this sequence of moves of course what you did at the end is is is probably where you did something correct the fact that you bounced the ball the three minutes earlier is probably less relevant so it's it's not always the case but it's customary in reinforcement learning to apply a DK factor to your reward across the time steps and in here I simply simplifies this loss it's exactly the same formula but you realize that the sampled move is a vector that contains one one and two zeros so this formula always simplifies you always keep only one of those three terms in the sum and it ends up being the reward at some step multiplied by at that step the computed probability of up still or down whichever you chose at that step and the second trick that you usually have to apply as well is to normalize all those losses because well I won't dwell too much into this but usually you have many of those steps that led to you losing and only a few were initially by chance you you hit the ball and and score a point and when you normalize them well that assigns kind of equal weights to those two instead of having a training data as a data set that is heavily imbalanced towards you losing all right so this is how your training data set will look like for each move you know what your network predicted you know what you actually did and you know what the discounted reward for this move is because you played until the end you computed the reward and you discounted it across this whole point and of course you restart your discount every time you score a point you restart this this again so the data set looks like this it's just a big table of numbers and on this data set the loss is simply the multiplication of everything in one line a couple of hyper parameters because it's not you don't train these things exactly as as usual here for instance the batch size we are using 20,000 moves at once so we play many games until we accumulated 20,000 moves along with their rig with the rewards and all and they're computed probabilities and all of that and then we apply the backpropagation the loss and back propagation once on that batch it uses rmsprop as an optimizer there is a whole collection of optimizers we use this one and it worked and the the architecture really is as simple as a one hidden layer neural network this was to test whether this really needed complex architectures or if it could work with something as naive as that so how would you write this in tensor flow so again we won't go through the entire code and it's missing bits and pieces but I just want you to spot where those big pieces are initially we have a first here dense layer that will read the board observation here is as I said the difference between two frames of the board itself a second dense layer after it that implements the the softmax so you see here you have two dense layers and in the code you see them twice here then I this produces a set of probabilities I sample from those probabilities in King's and tensor flow you do this using this TF multinomial function and if you are surprised not to see the softmax here it's because it's done automatically inside multinomial then an you obtain a sample operation in the same way as when you ask an optimizer to minimize your loss you obtain a training operation this we haven't touched too much on it yet but we talked about it a little bit what I want to stress again that tensorflow uses this deferred execution model everything you're building tensorflow builds a graph so when I when I do T f dot multi no I don't sample at that step at that step I create a node in my graph that does the sampling okay and and it will be somewhere in the graph when I will want to actually sample it move I will do I will have to do something extra which I'll show you on the next slide so up to now and at the top here I have placeholders which will receive values so this is the placeholder that will receive the observations that will receive the the actions that were actually sampled and the one that will receive the rewards when we finish our game and we have all of those rows with the rewards with their discounted rewards that's where we will be feeding them and why do we need all those things well we need the observation because that's what we run through our neural network to produce the probabilities of the next move we need these actions because that's how we compute our cross-entropy the the loss function which usually is the difference between what and predicted what the network predicted and what we knew to be true in this case is as we said the difference between what the network predicted and the move we actually made okay and finally the rewards because the rewards is what allows us to compute our loss in the end and you remember this formula the reward goes here and so now how do we run it so up to now we have seen only high-level tensor flow using the estimator API where you do train and evaluate and I want you to show you low level tensor flow when you implement your training loop yourself so intent through flow as I said everything is a graph when you want to actually obtain a value by running this graph and computing something you have to use a tensorflow session that's what I define here and then you see I have two sessions of run operation one here one there the first one runs the sample up so that is what will actually run by neural network and actually sample from the resulting probabilities and actually give me a move to play and the second one is at the end is my training operation so this we've seen before this training operation is what actually modifies the weights and biases in the system to to to train the rest is pretty standard this here is one game we initialize the game stage we read the pixels to compute the difference between two frames then we actually we use our neural network to predict this the next step and we use the sampling operation to sample from the predicted probabilities and obtain the next step and we call our pong simulator with that action so as to obtain both the next board state and also the rewards the information about are we done is the game finished or not and so on and we stack all of this information for later once a game is finished it's finished in 21 points we count we we compute all the discounted rewards for all the points marked in the game and that's our one batch in our training data set the two last lines is that we we run one training step in this data set and you remember that in our model we have defined placeholders now I said if you want to actually run something you use session dot run but it you will not run understand unless you provide now actual values in the placeholders and that it the in tensorflow is done using this feed dictionary here feed Dex this feed dict where I fill in my place holders observations is my placeholder and this is the value actions is my placeholder and this is the value so that's about it do you want to see it play let's see playing here we go so in brown we have the computer which is a highly programmed artificial intelligence which uses the highly advanced algorithms of always killing leaving the paddle and the same hate as the ball it cannot be beaten the only weakness is that its battle speed is limited so if you managed to hit the ball at a very very steep angle the ball might go faster than the paddle speed and that's the only way you can score a point on the green side we have our neural network remember initially it's not trained it doesn't know how to play it doesn't even know what game it's playing it doesn't even know what it's doing here but we give it to walk on the head every time he loses and we give it a piece of candy every time he wins a point and using this reinforcement algorithm mechanism well it's not doing too badly it's only two points behind and look those seven points he scored them the hard way the only way you can win against this AI is to hit the the ball on the very border of the paddle - - - - to send it flying at a very steep angle and that's the only like like here so using this completely brute force algorithm I really don't find it clever at all it's it's totally blue brute force we actually managed look they are even we actually managed to teach the neural network to play this game and beat an opponent that is really hard to beat if you do it manually some things you can notice is that I don't know if you see it from time to time it will it jitters a lot when the ball when the ball is far away but when the ball is closer sometimes it just stays put and waits for the ball to come that means that it has learned to predict the trajectory of the ball it has enough neurons and in in this mini neural network to predict the fact that the ball will bond bounce on the wall and come towards the peril and he can just stay where he is and it 121 points so this is reinforcement learning and you can actually visualize the what is going on visualize the weights remember here you have this neural network so one neuron here has a weight for every pixel in in this frame and if you put those weights on on a picture you see how much attention it pays to each pixel in in in the image and if you look at some of those of course we have 200 of them so I just picked eight or eight examples well you see some of them don't do anything useful this is random noise so this neuron probably doesn't do anything useful and and we are using a very very very very primitive neural network here really just two layers dense layers so we are not expecting this to learn well this experiment was just to see if it will learn at all using this rather primitive but I think also brilliant reinforcement learning algorithm but you see other cases where you see the trace of the ball so this neuron is actually paying a lot of attention to a ball that will come hurling this way or here you see a trace here another one here you see also that for some reason it's it's it's very sensitive to the movements of the operands battle which it shouldn't be you you don't get in power you don't get much useful information out of that out of that but well I guess it has trained to win we are not saying that it has trained to optimum so why is this significant it's significant because it's it's it's the first mechanism that we have found for learning automatically having a system that learns automatically just by playing in a sense it's it's not completely unsupervised learning because you are providing information about you won you lost but it's fairly unsupervised learning and it's sensitive phillium supervised it has lots of applications there are lots of ways you can you can use this to learn useful behaviors and I have a couple of examples here and maybe they will inspire you so one is this pancake flaky with flipping robot initially they primed it by hand because priming a robot here the network is predicting is sending currents to all the actuators of the robot so sending random currents one might be dangerous for a robot that's why they prime it by doing it by hand but then they let the robot play of course the pain cake itself is instrumented so the robot has the information whether the pancake fell in the pan if it flipped or if it fell and fell on the ground and yeah well it's it's learning it's well it takes some time but it's learning and you will see that after 11 trials it's still not very good and its really an end-to-end learning system it has information from the pancake at one end and it has this neural network that produces currents for the actuators really current for the actuators on the other end and now it's flipping pancakes look it's actually quite good at it so in the same type of of learning we have this example from from deep mind so deep mind built a model where they used various skeleton structures and learned them to walk so here the reward is simply if it falls down ding on the head if it continues if it's if you if you if you make progress in the X direction you get a reward and the model is predicting within the constraints the physical constraints of the skeletal model that they put in it that the model is predicting how much power to inject in each muscle and look what they got well yeah it does funky arm movements but it has learned how to jump over obstacles it has even learned how to crouch under very low obstacles it look how it handles this yeah they also did a quad repeat model which was actually very good at jumping see this is this is amazing it it found how you can use your hind legs to to push on the side of the the plate and and jump across the gap this one is running through steps and look I will the jumps are pretty convincing this is what it tries to do an obstacle race over rough terrain as well and this is this is amazing I think have you seen this jump none of this was programmed it hasn't it had no information whatsoever apart from the physics of the skeleton and the fact that it will receive a ding on the head if it falls and from just that information using reinforcement learning you have a complex behavior like this this fantastic jump it uses his arm to gain momentum then stretches out then then then cushions to fall at the other end of the of the jump but this one I find it really impressive here it even learned how to jump across an obstacle now now they are being goofy they are pushing him around and and this is the training it does lots of experiments when it falls down that's a - reward here as well it learns from the mistakes and now he's up yeah and this one is very good at jumping oh and I like the way and yes this is no this is really nice well you can run like this you can roll like this as well and it's found it it found out that you can run like this as well and and it works again completely automatically so of course you all heard also about about alphago all right alphago also here I'm showing you move 78 in game four and normally when I say move 78 game for everyone goes wah because that was the good move that was the god move that lee sedol invented and allowed him to be the only human on earth to have ever beaten alphago but other other than that alphago was trained well it's a little bit more complicated than just reinforcement learning but there is a lot of reinforcement learning in the process and in a nutshell what they did well in a game like go the algorithm for winning is is is actually pretty simple you you play all the moves until the end and you find and and then you continue on the path where you win most often the only problem with that is that this is computationally too expensive to compute so they build this neural network that can look at the board and say well this is good for a white or this position is good for black and that neural network was built using reinforcement learning and that util network allows them to research the move trees to a much shallower depth because instead of running I don't know 200 moves until the very end to see who wins they can run only three or four moves and apply this neural network which will tell them who this is very bad for black or this is super good for white and and and they can stop there and finally I want to share this with you what is a we talked during almost three hours about neural network architectures what is a neural network architecture it's a layer like this with these parameters another layer like that with parameters layer layer layer that's the sequence of characters haven't we seen neural networks that produce sequences of characters yes recurrent neural networks they produce sequences of characters so what if we have a problem and we build a neural network that produces a sequence of characters that happens to be a neural network architecture that solves our problem well initially it will probably not solve it very well but we can train this neural network on our problem let's say detecting airplanes and at the end it will have some accuracy we can take this accuracy and make that into a reward using our policy gradient reinforcement learning technique to influence the weights and biases in this neural network to produce a better architecture and this is reinforcement learning used to generate neural network architectures that solve a given problem in in a in an optimal way this and this has actually will be been built at Google it's called ode to ML and well it's still a tool I mean you still need a human to program the the search space this will be researching some types of architectures so you you still need someone to specify that but in the end you have a system that can devise neural network architectures and and here I'm giving you two examples so this is a graph from of a standard lsdm cell so this is built by humans this is the know the recurrent neural network cell that we have seen previously that is used not the one that is used most often but another one and when the neural network was given this problem to produce a recurrent cell that would solve some problem very efficiently it found this architecture which doesn't well has common elements with this one you see there are there are additions here and so on but otherwise it is very very different from something that a human would have invented and it does solve the problem in a more optimal way so I think this is a quite intriguing as well to see that by combining these techniques to combining recurrent neural networks that we've seen combining the reinforcement learning techniques that we have seen as well you can build a circular system that will generate neural architectures for you this is very advanced this is not something that I recommend you for your next project but I I kind of like where this is going and with that I thank you very much and we have a couple of minutes open for questions and after that I will still be here to answer your questions thank you any questions questions yes okay maybe so the question was that I said that watching the operands peddle is probably not so important and and you think that it might give him good good hints yeah you're probably right you're probably right maybe as a human that's something you don't see immediately but as a computer it might give you good hands indeed other questions yes so for the training example the question was how do you learn it yes you need pairs of sentences there are actually today quite quite big academic corpuses that contain pairs of sentences that hand have been handcrafted specifically so that researchers could conduct research on translation algorithms so they are quite sizable datasets that exist for a lot of languages and yes you need a sentence and a sentence the algorithm does the alignment inside of the sentence so the attention mechanism finds out automatically which word corresponds to which word and and handles difficult cases like words one words translated into two in a different language or in versions and so on but at least for the research here you you have you translate sentence to sentence yes yes that's a very good question for binary classification so in the plane example why would I use two neurons yes you can use only one neuron and actually in the in the final version of the plane detector when when I have boxes you have seen that I produce a single confidence number C which tells me plane or no plane and that's a pain frankly because why I won't explain it mathematically mathematically it can be explained but practically it's a pain because when you have one number then you need a threshold and you need to say ah so what is a plane plane is above this threshold or under this threshold and that's another parameter to tune and you never know where to put it when you train two outputs with softmax one of them is always higher than the other and that's your threshold but you don't need to set a threshold and mathematically it solves the problem of classification in a more you know pure way so that's why for a classifier even a two-way classifier yes now you have two classes that's two neurons softmax cross-entropy that a very good question yes maybe a last question no you're hungry you want to go to lunch I see I will see you at lunch thank you [Applause] [Music]

Info

Channel: Devoxx

Views: 15,003

Rating: 4.9650655 out of 5

Keywords: DV17, Devoxx

Id: aRKOJHRbXeo

Channel Id: undefined

Length: 36min 45sec (2205 seconds)

Published: Mon Nov 13 2017