MarIQ -- Q-Learning Neural Network for Mario Kart -- 2M Sub Special

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

welcome back settling here after four years hovering between 1.9 million and two million subscribers I finally hit two mil to celebrate I decided to release Mario Q my newest neural network project that teaches itself how to play video games through trial and error without any human supervision I'll talk about the 2 million subscriber milestone at the end of the video but for now let me introduce you tomorrow I Q Mar IQ gets its name from a technique called deep Q learning it starts out without a clue how to play a super mario kart just trying random things when it makes forward progress through the course it receives a reward signal Mara Q's entire goal is to accurately predict how much reward it will receive by taking different actions and the better it gets at predicting that the better it gets at choosing the action that will take it through the course the fastest the use of reward signals to train a machine learning model is called reinforcement learning it took 80 hours to train the sterols network and here I've rendered one driving sample from each hour of the training process we'll also take a closer look at this later in the video to help you understand how Mara queue works I've added a bunch of graphical overlays here in the bottom left you can see the neural network inputs this is the simplified view of the course converted to 15 by 15 pixels of greyscale input you can imagine it's a bit more difficult trying to drive a cart while only being able to see this this is the same set of inputs made available to Mari flow my previous neural network which tried to mimic a human driver the inputs feed into a recurrent neural network this type of neural network has memory cells capable of retaining recent information in order to help better inform the decision making process the neural network has two layers of LS TM cells the first layer with 125 neurons and the second with 75 this is also the same type of neural network I used in Mari flow six times each second the neural network outputs three scores one score for driving straight ahead one for driving while turning rights and one score for driving while turning left doesn't have the option to release the accelerator button below those scores you can see the corresponding set of controller buttons with the highest score these outputs are represented above the card itself in the lane so these yellow arrows you'll notice that in this footage near the beginning of training the arrows are relatively short since it's not confident it can obtain very high scores whereas here later in the training the arrows get quite a bit longer periodically you'll see red and green numbers coming out of the cart these are the reward and Punishment signals which are the only means by which Mar IQ can learn the game divides each course up into several dozen checkpoints these checkpoints are used to determine which card is in first place eken plays etc at this point I need to thank one of my twitch mods mr. l-31 for for putting this graphic together for me my training code gives Mara Q plus 100 points when it moves forwards a checkpoint and minus 100 points when it goes backwards a checkpoint when the neural network output scores for going straight left and right it's trying to predict how much reward it'll be able to get in the near future as a quick aside mr. l-31 for also put together this graphic which shows you how mario karts built in AI works it contains the directions each AI cart should travel for any given spot in the course this doesn't have any relation tomorrow IQ I just thought it was pretty neat up here is Mark Hughes face this face is really just a representation of how long the longest of the three arrows is if the longest arrow is pretty long the face smiles because Mar IQ thinks it'll be able to get a lot of reward in the near future if it's frowning it's because Mara Q doesn't feel very good about its reward prospects coming up over here is something I call the trihard percentage in order from our IQ to learn it needs to understand two things first it needs to know how good its current strategies are at generating rewards and second it needs to experiment to see if other strategies might be even better when the trihard percent is at 100% it's only goal is to understand how good its current strategies are at getting a reward 100% of the time the course of action recommended by the neural network is the action taken if the trihard percent drops lower say down to 50% then half the time the driver will listen to the neural network and half the time it'll just try pressing random buttons in this way marek you can evaluate new courses of action to test whether they're better than the current way of doing things and finally here you can see the actual button presses being sent to the game the yellow accelerator button is always pressed the left and right buttons are the only ones that vary sometimes these button prices come directly from the neural network and if the trihard percent is below 100 then sometimes it's just random now let's take a look at the training process here's a sample cart taken from each hour of Mara Hugh's 80 hours of learning notice that earlier in the training carts tend to get stuck on the walls more often the neural network has special difficulty with thin walls because it can also see the road beyond them do remember however that some of the bad driving exhibited here is merely the result of a low trihard percentage as the old saying goes you can't make an omelet without breaking a few carts or something here's a graph of Marek Hugh's average reward per second through the training process and again remember these scores aren't as good as they could be because the trihard percentage isn't always 100% here's what it looks like when Mara IQs trihard is held to 100% and it always takes what it thinks is the optimal action in this mode it's average reward per second is about 145 whereas when I Drive manually it's about 170 all of the footage so far has been on courses in mushroom Cup which is the only set of races that Mar IQ ever saw while learning what shocked me is what happened once I put it on courses it had never seen before here we can see Mara queue racing on Chaco Island one has never seen this course before and yet is still able to navigate its way through the course with relative ease in the world of machine learning this is called generalization and to be honest it wasn't even one of my goals I was gonna be happy if you could just drive okay and mushroom Cup I was actually pretty shocked when I started putting it on flour and Star Cup courses and it still held its own well some of the time at least so how does queue learning actually work how does Mar IQ get better at driving over time to answer these questions fully I'd have to get pretty bogged down in mathematical details if you're interested in those details there are a lot of good resources out there on queue learning instead I'll try and give you an intuitive view of what's going on neural networks are really good at predicting things in this case the neural network is trying to predict how much reward it can get if it takes each of the three available actions as it builds up XP driving it'll get better and better at predicting how much reward each of the actions will yield in any given game state if it thinks too highly of a particular action and ends up driving off a cliff over time it'll learn to predict a lower score for that action if it's estimating too low of a reward then over time it's experiments with random actions will eventually allow it to learn that as you predict a higher reward value the better it gets it accurately predicting the rewards for the three actions the better it'll drive when it picks the action with the highest predicted reward before I wrap up the video I want to share with you an interesting anecdote about Mario circuit 2 in this course there's a jump where you cross over another segment of the track if you don't take this jump straight on it's pretty easy to miss the full jump and wind up going back half a lap when this happens tomorrow IQ it receives a massive punishment of minus 1000 points however if it makes the jump they were just receives the normal reward of plus 100 points later on after 80 hours of training we can see the impact if it's not confident and can make the jump it'll usually end up swerving and then ramming into this corner over and over if it thinks it might miss the jump it's not willing to risk losing minus 1000 points and once it's in the corner there is simply no easy way to recover and get over the jump however if it's on the straight path it'll have no trouble making the jump and usually won't try and survive out of the way it's obvious to any human that it should just always go for the jump this behavior is the result of the reward system I've imposed and when you look at it that way there's a certain logic to it all of the code that runs mar IQ is available for download from the video description there's also a manual that'll help you get up and running and help you run your own experiments to try and make an even better version of more IQ now that I've got through all that I want to take a moment to thank my subscribers in 2013 as my channel turned 2 years old I posted my 1 million subscriber special blocks vs Zombies in late 2015 I hit 1.9 million subscribers and thought that the big 2 mil was inevitable what actually happened was that Minecrafts popularity died down a lot and I also started losing interest in the game my pace of posting video slowed down and base I was losing as many subscribers as I was gaming and that's how it stayed for quite a long time now four years later I've started posting minecraft content again with their renewed love for the game and it's definitely showing in the subscriber numbers you guys have been super enthusiastic in the comments and I feel as motivated as ever to keep making the best videos I can you've been really supportive and I just want to express how thankful I am for all the excitement you guys share with me seven years ago when I quit my job at Microsoft to pursue youtubing as a full-time career I felt a huge amount of uncertainty at the long-term prospects but seven years later I feel it's still one of the best choices I've ever made and it's you guys that have allowed me to keep going for this long with all that said that's about it and I truly mean it when I say thanks for watching [Music]

Info

Channel: SethBling

Views: 359,175

Rating: undefined out of 5

Keywords: Neural Network, Machine Learning, MarIQ, SethBling, Q Learning, Mario Kart

Id: Tnu4O_xEmVk

Channel Id: undefined

Length: 10min 3sec (603 seconds)

Published: Sat Jun 29 2019