Neural Network Learns to Play Snake using Deep Reinforcement Learning

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
I was watching this video by code bullet where he was trying to train a neural net to play snake by just looking at the pixels like a human would now this stuff is really exciting to me as you can tell by my last few videos but code will have kind of quickly abandoned the idea of training the AI on the full image and then later abandon the idea of using DQ learning entirely in favor of pathfinding based techniques now don't get me wrong I love code bullets videos and those videos were amazing as usual but I was still disappointed and decided that if deep reinforcement learning can play dota and go then it should be able to play snake so I was like fine I will do it myself this video is a distillation of me trying repeatedly failing and then somewhat succeeding and getting a neural net to play snake perfectly on at least six by six in ten by ten games and training some networks that can in theory finish a twenty by twenty game as well if I had enough time and money I'm also going to discuss why normal DQ learning is a bad choice for this problem and what some of the alternatives are and where we can go next to beat larger games of snake in this fashion without needing insane computing power now if you're only interested in watching the AI train with some nice copyright free music in the background skip to the timestamp shown now so let's go tings started pretty easy I coated a game of snake in a live stream link in the description and I was like hey half of the work done right huh little did I know then now for those unfamiliar with q-learning I do have a few videos on the subject ranging all the way from explained like I'm one too let's code this in more detail than anyone ever asked the broad summary of q-learning and really all the reinforcement learning algorithms is you have some environment in our case the DM snake some ability to change that environment which for the snake would be something like going up down left or right and then some consequences of causing that change like the snake eating or dying we can attach a reward or penalty to these actions so that if the snake eats we're all like you done good son and give him one point but if the snake dies we're like get good and take one point away in some cases we also might slowly leech points away for every step because life is pain and also to avoid loops this reward can then be used to train the neural network by using this fancy equation that I'm showing on the screen I'm not gonna explain this equation here it's called the bellman equation and you can check out my other videos for a deeper explanation now if you're at all familiar with neural Nets you might be like but neural Nets need a lot of data and you would be right here take this cookie to generate this data we could just get a bunch of humans to play snake for many hours record all of that data and feed it to the algorithm but uh this video isn't called Mechanical Turk teaches an AI snake it's called AI learns to play snake which is where we run into our first major issue to learn the game you need to play it and to play the game you need to learn it to get around this chicken and egg problem we can sort of mimic human behavior if you're new to a game and know nothing about it you will start experimenting with it by just doing random stuff just pushing buttons and stuff as you do these random things and look at how the game changes you can then come to conclusions like oh this makes the snake go left or o if it hits the wall it dies after some time you will learn the game and will mostly be playing it based on your experience but as you get into newer newer situations you might still take some random moves to see what happens like take a chance on a new strategy the simplest version of this and reinforcement learning is something called Epsilon greedy and it is the first exploration strategy that I tried with the game of snake you start by acting completely randomly and then as the network trains more and more you slowly decrease the amount of randomness and mostly follow what the network says but sometimes pick any of the available actions it's actually very easy to solve a 4x4 game of snake using this approach it's not pictured here because I'm an idiot and didn't record it but this simple idea actually does not work well for even a six by six game of snake here's the issue even if the probability of taking a random action is really really small like one tenth of a percent about halfway through the game of 6x6 snake that's about 100 steps in the probability of a completely random action happening at least once note that this is very very likely to kill the snake is 63 percent 63% likelihood that the snake will die halfway through doing something new this only gets worse as the game progresses and it is much much worse for larger games a snake now I don't know what code bullet was doing exactly but if he was using Epsilon greedy then the result that he got actually makes a lot of sense here I'm showing an example of this technique not quite working well on a 10 by 10 game of snake the best it could do was a score of about 50 pathetic so at this point I decided to completely throw away the key learning and look for a new strategy and start from scratch I randomly landed on this excellent video by Alex Petrenko who's a PhD student where he uses the advantage actor critic algorithm to solve a six by six game of snake perfectly the more I read about this technique the more I liked it now not gonna get too much into the math or implementation of HSE in this video though I will link my code in the description and I might make a video on it in the future the part that's important to the game of snake is the fact that this algorithm outputs a probability distribution over all possible actions basically if the snake is in this situation the network will eventually learn that going up or right is good and give them high probabilities and going left or down is bad because the snake will die and will output really low probabilities to reflect that for explorations sake we can take the probability distribution and draw a sample from it this means that most of the time we will take the options that the network currently thinks are better but sometimes we'll also take what appear to be the wrong options which creates some semblance of exploration and creates more data for the snake to learn from okay I feel like that was a lot of new stuff let's look at this cat picture for a few seconds to decompress alright we good good so we can apply this algorithm to a six by six game of snake and hey it works whew I mean we knew it was going to work since Alex's example work but like let me celebrate this ok all right before we tackle a 10 by 10 game I wanted to give the snake a little bit of personality so I made some sprites for him oh look at that beautiful smile alright don't judge me this is the first time I'm making sprites alright and this is where the training montage begins what I'm showing here are 4 out of 64 snakes that are running simultaneously to help the newer and let learn it's important to have many copies of snake running at the same time otherwise the neural net might not generalize the actions of each of these snakes are decided by sampling the probability distribution that the neural net outputs when it's given an image of the game once each of these games take 16 steps we stop and train the neural network on all the data that they collected and then start all over again that's why you see this periodic hitch in these images now in the beginning the network knows absolutely nothing it doesn't even know how to see so it actually spends a fair amount of time training the convolutional layers which are essentially training the network's eyes until those layers get to some good weights the network can't learn to do really anything so it just keeps doing random stuff then comes a point when suddenly the network starts to be able to reach the fruit consistently this is a heavy exploration phase and the snake slowly gets longer and longer as it learns to get better and better its able to explore more because the probabilities become more and more disproportionate between the good and the bad outcomes and then finally we get to a point where the exploration is actually able to beat the game sometimes yay now this is great but it doesn't quite mean that we're done we're on the right track though so after the snake was able to win repeatedly during exploration I decided to stop the experiment and run some tests now this network does a really good job of winning the game most of the time but it can still get itself stuck so it could have used a little bit more training so if this was so easily done on a 10 by 10 game then a 20 by 20 game should be easy right it's the same principle right wrong I started training a similar network on a 20 by 20 game where 10 by 10 started to win after about 4 hours I had to stop the 20 by 20 games training after 18 hours because it was improving really really slowly by this point the neural net can play the game competently up to a score of about 80 and by now I'm so tired of trying to get snake to work is that I'll take that as a win I'm taking that as a win I estimated that if I left the training running for about eight days it might have gotten close to the max score of 400 but I didn't want to commit my desktop for that so I'd stopped I also did a fun experiment with a 40 by 40 game trying a method where the number of available fruit starts out really high and then slowly decreases to 1 I was doing this in order to fight what's called the Sparsit award problem so you have to help the snake out a little bit in the beginning and then you can slowly stop helping it this was actually working quite ok and it was able to reach a score of about 100 and it looked really fun while training but again it was not improving above that at all so why does this happen why does the 20 by 20 game seem to take so cool longer to Train I think there's two reasons for that one of them is that twenty by twenty isn't really twice as complicated as ten by ten I wouldn't even say that it's four times as complicated it's probably much more complicated than that the amount of possibilities just become really really large the second issue has to do with data inefficiency off the algorithm that I chose a to see and how we can't reasonably train on longer games altogether and instead each training set only considers a few moves in this case sixteen when your average game can have tens of thousands of moves sixteen is a very very small number now there's two possible ways to fix this which I'll try the next time I get the courage to Train snake I've been at this for over a month now so it kind of burnt me out there's an algorithm called proximal policy optimization or PPO which can get around the single training step issue and we can take training data from full games including all later stages in a single batch of updates the PPO algorithm often is able to converge much faster than a to C and is one of the de-facto algorithms use these days for reinforcement learning if you've ever seen that opening I hide-and-seek video that was trained using PPO the other solution is to use a model so far all the algorithms that I've described are what's considered model free snake is a pretty simple game so we can either get another noodle Network to learn what the next few frames are going to look like and use that to make a better decision or go the route that alphago went and run a bunch of copies of the game for future steps before making a decision about the current step I think the second approach is gonna be a better solution but I guess we'll find out all right if you made it all the way to the end and even if you didn't thank you so much for watching this video is a different style than my normal videos please leave a comment telling me what you thought of it and remember to leave a like and subscribe to my channel the next reinforcement learning video will probably be about teaching a robot to dock itself and until then I'll have a few programming related videos coming as well see you next time bye
Info
Channel: Jack of Some
Views: 12,268
Rating: undefined out of 5
Keywords: snake ai, ai learns to play snake, ai play snake, ai plays snake, deep rl snake, reinforcment learning snake, perfect snake ai, code bullet, codebullet, codebullet snake response, codebullet snake, educational ai, qlearning, policy gradients, a2c, actor critic, creature creator, neural network learns to play snake, neural network plays, neural network plays snake, deep learning, machine learning, reinforcement learning
Id: i0Pkgtbh1xw
Channel Id: undefined
Length: 10min 44sec (644 seconds)
Published: Fri Mar 13 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.