TensorFlow and deep reinforcement learning, without a PhD (Google I/O '18)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments

RemindMe! In a 1 day

๐Ÿ‘๏ธŽ︎ 1 ๐Ÿ‘ค๏ธŽ︎ u/_default_settings ๐Ÿ“…๏ธŽ︎ May 25 2019 ๐Ÿ—ซ︎ replies

will check it out

๐Ÿ‘๏ธŽ︎ 1 ๐Ÿ‘ค๏ธŽ︎ u/PrinceRaziel9 ๐Ÿ“…๏ธŽ︎ May 25 2019 ๐Ÿ—ซ︎ replies
Captions
hello welcome to this tensorflow session about rain enforcement learning so this is you Han I'm Martin and today we would like to build a neural network with you do you know about neural networks if I say a serious question if I say softmax or cross-entropy raise your hand if you know what that is like half the audience okay very quick primer this is a neural network ok layers of neurons those neurons they always do the same thing they do a weighted sum of all of their inputs and then you can layer them in layers the neurons in the first layer they do a weighted sum of let's say pixels if we are analyzing images the neurons in the second layer will be doing weighted sum sums of outputs from the first layer and if you are building a classifier let's say here probably you want to classify those little square images into airplanes and known airplanes you will end up on a last layer which has as many neurons as you have classes and if those weights are configured correctly we'll get to that then one of those neurons will have a strong output on certain images and tell you this is an airplane or the other one we have us will have a stronger output it will tell you no this is not an airplane ok there is one more thing in this there are activation functions so a little bit more details about what a neuron does for those who like it you have the full oops right here you have the the full transfer function here so you see it's a weighted sum plus something called the bias that's just an additional degree of freedom and then you feed this through an activation function and in neural networks that is always a non-linear function and to simplify for us here only to activation for that counts so for all the intermediate layers the white layers really that's the simplest function you can imagine you have it on the graphic okay just this it's nonlinear we love it everyone uses that let's not go any further on the last layer though if you're building a classifier typically what you use is the soft max activation function and that is an exponential followed by a normalization so here I have different classifier that classifies in ten classes and you'll have your weighted sums coming out of your ten final neurons and what you do in soft max is that you elevate all of them to the exponential and then you compute the norm of that vector of ten elements and you divide everything by ignore the effect of that since exponential is it's a very steeply increasing function is that it will pull the winner apart I made a little animation to show you like this this is after soft max this is before soft max so yours you see much clearer which of those neurons is indicating the winning class okay that's why it's called soft max it pulls the winner at parts like a max but it doesn't completely destroy the rest of the information that's why it's a soft version of Max okay so just those two activation functions with that we can build the stuff we want to build so now coming out of our neural network we have our let's say here ten final neurons producing values which have been normalized between zero and one we can say those values or probabilities okay the probability of this image being in this class how are we going to determine the the weights in those weighted sums initially it's all just you know random so initially our network doesn't do anything actually so useful we do this through supervised learning so you provided images which you have labeled beforehand you know what they are and all those images your network is going to output this set of probabilities but you know what the correct answer is because you are doing supervised learning so you will include your correct answer in a format that looks like what the network is producing it's the simplest encoding you can think of it's called one hot encoding and basically it's a bunch of zeros with just one one in the middle that's the index of the class you want so here to represent 2/6 I have a vector of zeros with a 1 in the sixth position and now those vectors look very similar I can compute the distance between them and the good of the people who studied this in a classifier they tell us don't use any distance use the cross-entropy distance why I don't know they are smarter than me I just follow in the entropy the cross-entropy distance is computed like this so you multiply element by element the elements of the vector the known answer from the top by the logarithm of the probabilities you got from your neural network and you then sum that up across the vector this is the distance between what the network has predicted and the correct answer that's what you want if you want to train a neural network you get this it's called an error function or a loss function from there on tensor flow can take over and do the training for you you just need an error function so in a nutshell those are the ingredients in our part that I want you to be aware of you have neurons you know the way it sounds you have only two activation functions that you can use either the rayleigh activation function on intermediate layers or if you build a classifier on the last layer softmax and the error function that we are going to use is the cross entropy error function that I had on the previous slide okay all good on this bit of code this is how you would write it in tensor flow so there is a high level API in tensor flow layers where you can instantiate an entire layer at once you see I instantiate the first layer here has 200 neurons and is activated by the Rayleigh activation function it's this layer here and this also instantiates the weights and biases for this layer in the background you don't see that it's in the background I have a second layer here which is this one is just 20 neurons again radio activation function and this I need to do do this little final layer so it's a dense layer again with two neurons and even if you do don't see it on the screen it is activated by softmax but you don't see it because I use its output in the cross entropy function which has softmax built in so it is a soft max layer it's just that you don't see region soft packs in the code finally this here is my error function which is the distance between the correct answer here one hot encoded and the output from my neural network once I have that I can give it to tensorflow take an optimizer ask it to optimize this loss and the magic will happen so what is this magic tensorflow will take this error function differentiate it relatively to all the weights and all the biases all the trainable variables in the system and that gives it something that is mathematically called a gradient and by following this gradient it can figure out how to adjust the weights and biases in the neural network in a way that makes this error smaller that makes the difference between what the network products and what we know to be true smaller that's supervised learning so that was the primary now why what do we want to build today we would like with you to build a neural network that plays the of pong but just from the pixels of the game it's notoriously difficult to explain to a computer and the rules of the game and the strategies and all that so you want to do way of all that and just give it the pixels and find some learning algorithm that will where it will learn to play this game and of course that is not the goal in itself because figuring out how to program a paddle to win Punk is it's super easy you just always stay in front of the ball you know and and you win all the time and actually we will train again such a computer-controlled agent the goal here is to explore learning algorithms because this has applications hopefully way beyond Punk so this looks like a classifier we just learned how to build a classifier you know we have the pixels we will build on your own network and it has three possible outcomes this is a position in which you want to go up stay still or you want to go down okay let's try to do this by the book in a classifier so we have a single intermediate layer of neurons activated with already of course and then a last layer where there's three neurons activated by softmax we use the cross entropy loss so you have the function here it's the distance between the probabilities that this policy network it's called the policy Network predicts probability of going up still or going down in the correct move but what is the correct move I don't know do you know you have no Martin you're exactly right here unlike in the supervised learning problem we actually don't know what's correct move to play here however the environment requires that we keep making moves of going up or staying still or moving down to the progress together so what we're gonna do is we're gonna same for the move which means picking one of the three possible moves of moving the pedal up Staines - you're going down randomly but not completely randomly though we're going to pick the move based on the output of the network so for example if the network stays the output probability of going up is 0.8 then we should make it so that there's 80% of chance which choose to move up as our next vote all right so now we know how to play the game the policy network gives us probabilities we are all alone a loaded dice and pick from that and we know what next moves to play but initially this network is initialized with a random weights so it will be playing random moves how is that how does that inform us about the correct move to play I need to put the correct move in my formula that's right so in this case we really want the correct both to mean moves that will lead to winning yeah we don't know that until someone has score a point and so that's what we're gonna do only when someone either our pet or the opponent hello has scored the point do we know we have played well or not so whenever somebody stores who are gonna give ourself a reward if our pedo scores the point we'll give ourselves the positive one reward point and if the opponent cattle scores don't give ourself a negative one reward point and then we're going to structure our loss function slightly differently than before over here once again over here you see some a loss function very much like the process should be function we saw before now in the middle here is the main difference where instead of the correct label in supervised learning problem we're just going to put the same whole move in there the move we played the move that we happen to play and well some of the moves may not be really moves our helpers and so that's why every loss function lost value is multiplied by the reward out front this way boosts that eventually lead to a winning point will get encouraged and moves that lead to a losing point will be discouraged over time okay so you do this for every move I can see how with this little modification it could to some learning but putting back my mathematicians hat on I see a big problem here you have this sample move that is a picking operation a sampling operation you pick one out of three that is not differentiable and to apply you know the minimization and a gradient descent and all that the last function must be differentiable you're right Marty I see the hard question here the simple move here it actually does depend down the modest ways and biases but the sampling operation is not differentiable and so we're not going to differentiate that instead we're going to look at the sample moves as if they're constants and then we play many games across many many moves to get a lot of data treating all of these sample moves as if they're constant cancer labels and only differentiating the probabilities that our output by the model in blue here on the screen and those probabilities directly depend on the models ways and biases and we can differentiate those with respect with the ways and biases this way we still get a valid gradient and we can apply gradient descent techniques oh okay so you kind of cheated the part that is problematic just regarded as constant you're going to play many games with the same neural network accumulate those plate moves accumulate those rewards whenever you know that you scored a point you accumulate those rewards and and then you plug that in and you only differentiate relatively to the predicted probabilities and yes you're right that still gives you a gradient that depends on weights and biases so we should be able to do that ok I get it this is clever so this will actually strain probably very slowly here we want to show you the minimum amount of stuff you need to do to get it to Train but in the minimal amount there are still two little improvements that you always want to do the first one is to discount the rewards so probably if you lost the point you did something wrong in the three five seven moves right before you lost that point and probably before that you bounced the ball correctly a couple of times and that is that was correct you don't want to discourage that so it's customary to discount the rewards through time backwards through time with some exponential discount factor so that the moves you played closest the scoring point are the ones that count the most so you see the here we discounted them with a factor of 1/2 for instance and and then there is this normalization step so what is that yeah that's very interesting so in experiments that we notice putting in the normalizations that really help training to make learning faster therefore multiple ways to think about this the way I like to think about this is in the context of playing again you see at the beginning of the time the model only has randomized weights and biases so it's gonna make a random moves so most of time is not going to make the right boat only once and wire is gonna score a point but that's really by chance by accident and most of time is gonna lose points and we didn't want to find a way to necessarily put more ways to the very rare winning moves there so they can learn what are the corrective moves and performing is normalizing and we were stabbed here naturally gives a very nice boost to the rare when they move so it does not get lost in you know a ton of losing moves okay so let's go let's train this but we need to play the game and accumulate enough data to compute this everything that we need for this function here okay so those are the sample moves we need the rewards and we also need those probabilities which the network can compute from us from the pixels so we are going to collect the pixels during gameplay that's how our data set collected data set looks like okay you have one column with the move you actually played one column where you would like to see the probabilities predicted at that point but you were actually going to store just the game board and run the network to get the probabilities in the last column with the rewards you see there is a plus one or minus one reward on removed and scored a point and on all the other moves you discount that reward backwards in time with some exponential discount so that's what we want to chew and once you have this maybe you notice in the formula you just multiply those three columns together and sum all of that that's power loss that's how the loss is computed let's build it you implemented this this this this demo so can you walk us through the code let's do that so up here you see a few tensorflow placeholders you should really think of them as function arguments that are required to compute alpha values for models and so with three placeholders for the input one for the observation remember this really means the difference between two consecutive frames unit game play actually yes because we didn't say that actually just in the game of pong you don't really train from the pixels because you don't see the direction of the ball with from just the picture but you train from the Delta between two frames because there you see the direction of the ball that's the only wrinkle we are doing here which is specific to pong all the rest is reinforcement learning vanilla reinforcement learning that applies to many other problems that's right yeah and for the actions placeholder is just going to hold all the simple to move the moves that the model happen to decide to play and the rewards placeholder work collect all the discounted rewards now with those as impose we're ready to build a model the networks really simple is like the one marketing show before it has a single dense layer a single ten hidden layer with activation of the riilu function 200 neurons followed by a soft max layer you don't receive calling us off the max function here and that's really because the next step the sampling operation already takes in Logies which is the lab values you get before you cross off max function and it came to perform multinomial sampling which just means you output a random number of zero one or two for the three classes that we need output based on the probability you specify the lot by the long and this one has the softbox built-in so already building this is a soft wax layer okay we don't see it on the cover but it is okay and just the parentheses for those not familiar with a tensor flow tensor flow builds and note a graph of operations in memory so that's why out of multi monomial we get an operation and then we will have an additional step to run it and actually get those predictions out and placeholders or the data you need to put in when you actually run a node to be able to get a numerical result out okay so we have everything we need to play the game still nothing to Train so let's do the training part for training we need a loss function so our beloved cross my cross entropy loss function computing the distance between our actions so the moves we actually played and logits which are from the previous screen that's what the network predicts from the pixels and then we modify it to you to use using the reinforcement learning paradigm we modified by multiplying this per move loss by the rewards by the per move rewards and now with the with those rewards moves leading to a scoring point will be encouraged moves leading to a losing point will be discouraged so now we have our error function tensorflow can take over we pick one of the optimizers in the library and simply ask this optimizer to minimize our loss function which gives us a training operation and we will when we will on the next slide run this training operation feeding in all the data we collected during gameplay that is where the the gradient will be computed and that's the operation when run that will modify the weights and biases in our policy network so let's play this game this is what you need to do to play one game in 21 points so the technical wrinkle is that intensive if you want to actually execute one of those operations you need a session so we define a session and then in a loop we play a game so first we get the pixels from the game stage compute the Delta between two frames okay that's technical and then we run session not run this sample operation remember sample operation is what we got from our picking multinomial function so that's what decides the next move to play then we use a punk simulator here from open AI gem we can give it this move to play and it will play the move and give us a new game state give us a reward if we score the point and give us information on whether this game in 21 points is finished ok that's what we need now we simply log all that stuff we log the pixels we log the move we played we lock the reward if we got one so we know how to play one game and we will come and play many of those games to collect a large backlog of moves right that's right playing games in reinforcement learning or in our experiment really is a way of collecting data now that we have collected love data playing one game only ten games why not we can start processing rewards as we plan to will discount the rewards so that the moves that did not get any reward during gameplay now case a discounted reward based on whether or not they eventually led to winning or losing points and how far they are from the moves that actually won or lost a point and then we normalize those rewards as we explained before now already we have all the data in place we have observations that's the differences between game frames the action stuff happens to plate and the rewards so that we know if those actions were good or bad and then we're ready to call the training up and training out is what we saw a couple slides before where we had optimized already initialized and this is going to do the heavy lifting and compute the gradients for us and modify the weights just slightly suppose we played 10 games we modify the way slightly and then we'll go back and play more games to get even more data and repeat the process and expectation or the hope is that the model will let you play a little better every time I'm a bit skeptical do you really think this is going to work well sometimes we see the model I should forget to play a little bit worse so we'll see okay live demo like that all let's go let's run this game so this is a live demo I'm not completely sure that we are going to win but we shall see so Brown on this side is the computer-controlled paddle very simple algorithm is just stays in front of the ball at all times so there is only one way to win to win it's vertical velocity is limited so you have to hit the ball on the very side of the paddle to say is to send to send it at a very steep angle and then you can overcome the vertical velocity of the opponent that's the only way to score on the right in green we have our neural network controlled agent in its slightly behind them so we'll see if it wins and if you want I want this side of the room to cheer for ground and this side of the room to cheer for AI okay it's very even right now one thing yeah go go go ai is willing hey I is winning I'm happy because this is a live demo there is no guarantee that a I will win actually win one thing that is interesting here actually is that this is learning from just the pixels so initially the AI had no idea of even which what game it was playing what the rules were it had even no idea which paddle it was playing okay and we didn't have to explain that we just give the pixels and on scoring points we get a public if we give a positive or a negative reward and that's it from that it learns and you see you see those emerging strategies like what I said hitting the ball on the side and sending again at a very steep angle is the only way of winning and it picked that up we never explain it it's just an emerging strategy this is looking good pretty close yeah it's looking good it's it's a bit close but it's looking good you think well when okay next point I want a loud cheer what AI wins because I hope this is going to work hey 20 okay one more one more yeah good job your hand this is fantastic all right so what was going on during gameplay actually remember how this network was built right here so neurons in this very first layer they have a connection to every pixel of the board did you weighted sum of all the pixels in the board alright so they have a weight for every pixel and it's fairly easy to represent those weights on the board and see what those neurons are seeing on the board so let's try to do that right here oops we pick H of those 200 neurons and here we visualize superimposed on the board the weights that have been trained and to the untrained eye doesn't see much here's maybe you can enlighten us a marketing we're looking at what the model sees so rather what the model cares about when you sees the game pixels that we gave it you see that one of neurons apparently Hedren pretty much nothing it's still looking like the white noise at the beginning of initialization that contribute much to the gameplay but for the only other neurons you see some interesting patterns there we see a few things that the model seems to put a lot of weights on it cares a lot about where the opponent paddle is and where is the moving it not cares about the balls trajectory across the game board and quaint Rossini also cares a whole lot about where its own paddle is on the right because like Martin pointed out before at the beginning of learning the model could even know which pedo is playing and so he has to learn to remember that's important piece of information to be able to play game well and when you think about this this is really consistent with how we human beings will believe consists of importing informations to be able to play the game well so we wanted to show this to you not to show our prowess at pong although it worked but mostly to explore training algorithms and you see mostly in your networks you do supervised training and in nature well sometimes when we teach people it looks like supervised training probably in class the teacher says this is you know the Eiffel Tower and the people say ok that's the Eiffel Tower and so on but if you think about a kitten jumping on a furball and missing it and jumping again until it catches it and there is no teacher there it has to figure out a sequence of moves and it gets a reward from catching the ball or not catching it so it looks like in nature there are multiple ways of training our own neural networks and one of them is probably quite close to this reinforcement learning way of kitchen you have you you build this model is there some other thoughts it inspired you yeah I have a technical insights maybe for you as a takeaway message in run experiments we see that there are some steps which are not differentiable particularly in this case sampling a move are the probability output from a network and playing a game I get a reward back yes those factors really do depend on the model itself but in a now differentiable way and so even with the powerful tools like tensor flow you wouldn't be able to very naively view the lost function and run gradient descent training but there are ways to get around it and that's exactly the techniques were showing today so what you're saying is that the reinforcement learning can solve many more problems that just Punk and it's a way of getting around some non differential step that you find in your problem great that's great so what is this going we want to show you a couple of things from the lab because this has had mostly lab applications and then one last thing what is this this is very interesting everyone what we're witnessing here is a human expert of pancake flipping trying to teach a robotic arm doing the same thing and there's a model in the back supporting the robotic arm whose output controls the joints movement or the motors in it what angle was speed to move toward and the goal of this is to flip the pancake in the pink it's not just any regular pin cable rather is instrumented with sensors so that it knows even be flipped if you land it on the floor or on the table or Bank in the frame head doesn't seem to be working is trying is trying in another experiment and so what is the reward here the reward probably is that if you flip the pancake correctly then you get the positive reward and otherwise negative reward okay and speaking of reward function in another experiment I was gonna say the experimenters say the reward to be a small amount of positive rewards for any moments the pancakes on the floor and that machine learned but why learned was to free the pancake as high as possible to maximize pancake airborne high never thought I would say something like that that way you can maximize the reward just because you take a long time for YouTube laying on the floor that's kind of funny but actually it's it's it's it's a nice illustration of the fact that you can change the learn behavior by changing your lost function exactly you have learned oh thanks yeah cool we know how to play pong and flip pancakes that's significant progress deepmind also published this video so they use reinforcement learning and they built those skeletal models here the neural network is predicting the power to send to the simulated muscles and joints of these models and the reward is basically a positive reward whenever you manage to move forward in a negative reward when you either move backward or when you fall through a hole or when you just crumpled to the ground the rest is just reinforcement learning as we have shown you today so all of these behaviors are emerging behaviors nobody taught those models and look you have some wonderful emerging behaviors it's coming in a couple of seconds look at this jump after this those are nice jumps but there is a much nicer one in a second you will see a jump or the athletic jump with the model swinging arms to get momentum then lifting one leg cushioning right here look at this this is a fantastic athletic jump it looks like from the from the Olympics and it's a completely emerging behavior optimized way of walking around in why do I not see people doing this I mean I could but well you know there are multiple ways of running probably the the last function didn't have any factor discouraging you know useless movements so again by modifying the last function you get different behaviors and actually one last one not this one this one is kind of funny yes still playing around but I la I like the look here it figured out how to run sideways this fantastic yes there are two ways of running and it did figure out how to move sideways this one you probably seen this is move 74 in Game four of alphago versus Lise at all and that's the one move that Lisa told late that was called the God move and he's a world famous for just that he played one correct move and managed to gain to win one game against alphago he lost four to one which is fantastic and alphago also uses reinforcement learning not exactly the same way here it was an entirely built out of reinforcement learning okay because for turn-based games the algorithm for winning is actually quite easy you just play all the moves to the end and then pick the ones that leads to positive outcomes the only problem is that you can't compute that there are too many of them so you use what is called a value function you unroll on your couple of moves and then you use something that looks at the board and tells you this is good for white and good for black or good for black and that's what they built using reinforcement learning and I find it interesting because it kind of emulates the way we humans solve this problem this is a very visual game we have a very powerful visual cortex when we look at the board go is a game of influence and we see that in this region black has a strong influence in this region wide has a strong presence so we can kind of process that and what they built is a value function that does kind of the same thing and allows them to unroll the move to a much shallower depth because after just a couple of moves their value function built using a reinforcement learning tells them this is great for white or good for black so these are results from the lab let's try to do something reals so some application what if we build a neural network it's a recurrent neural network okay that's a different architecture of neural network but for what you have to know it still has weights and biases in the middle okay and this neural network of recurrent neural networks are good for producing sequences so let's say we build one that produces sequences of characters int we structure it so that those characters actually represent a neural network yeah when you don't let work is a sequence of layers so you can figure out a syntax for saying this is my first layer this is how big it is and blah blah blah so why not produce a sequence of characters that represents a neural network what if then we train this neural network on some problem we care about let's say scoring airplanes in pictures so this will train to a given accuracy what if now we take this accuracy and make it a reward in a reinforcement learning algorithm so this accuracy becomes the reward and we apply reinforcement learning which allows us to modify the weights and biases in our original neural network to produce a better neural network architecture it's not just tuning the parameters we are changing the shape of the network that works better for our problem the problem we hear about we get a neural network that is generating a neural network for our specific problem it's called new neural architecture search and we actually published the paper on this and I find this a very nice application of a technology designed initially to beat Punk so Martin you're saying we have neural networks now learn to build other neural networks yes Steve man yep so you you had to finish you build this demo so can you tell us a word about the tools that you used yeah definitely so we used on tensorflow for for the model itself and his support for tracking the metrics of during the process training I really don't want to run the training on my laptop but I probably could so I used a cloud which is running energy for the training the model that was playing the game that we saw before I like them all took maybe about one day of training to accomplish that and so with ml engine you have this job based view and you can launch 20 jobs with different parameters like this and just let them run and it's just practicality right it will tear down the whole cluster when there when when every job is done and so on that's right yeah I used a managing a lot of as well for that there are many other tools in cloud for doing machine learning but one we just launched is Auto ml vision that's one when you do not program you're just put in your labeled data and it figures out the model for you and now you know how it works and also this uses lots of CPU GPU cycles so cloud GPUs are useful when you're doing neural architecture search and they are available to you as well them so thank you that's all we wanted to show you today please give us feedback we just released this code to github so you have the github URL if you want to train a punk agent yourself go into it you can take a picture there and if you want you the URL is still on the screen if you want to learn machine learning I'm not going to say it's easy but I'm not going to say it's impossible either we have this series a relatively short series of videos and code samples and code labs called tensorflow without a PhD that is designed to give you the keys to the machine learning kingdom so you go through those videos this talk is one of them and it gives you all the vocabulary and all the concepts and we are trying to explain the concepts in a language that developers understand because we are developers thank you very much [Music]
Info
Channel: TensorFlow
Views: 74,665
Rating: undefined out of 5
Keywords: type: Conference Talk (Full production);, pr_pr: Google I/O, purpose: Educate
Id: t1A3NTttvBA
Channel Id: undefined
Length: 40min 47sec (2447 seconds)
Published: Thu May 10 2018
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.