Build an Mario AI Model with Python | Gaming Reinforcement Learning

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[Music] what's happening guys in this video we're going to be taking a look at how we can build an ai model but specifically a reinforcement learning model to be able to play mario as per usual we're going to be treating it as though we're performing a developer or data scientist and client relationship so you'll be able to see the back and forth as we go and build up this model ready to do it let's go take a deeper look at what we'll be going through so in this video we're going to be focused on a couple of key things but the main goal is to be able to train a machine learning model or specifically a reinforcement learning model don't fret if you don't know what that is i'm going to talk a little bit about it later on but we're going to build a model to be able to go on ahead and play mario so ideally we wanted to smash through levels and effectively be able to collect coins and get to the end of each world now how are we going to go about doing this well first of what we're going to need to do is set up a mario environment so this means that we're going to need to be able to get mario to be able to interact with some python code so i'm going to show you how to do that then we're going to pre-process it for a ifinger so this means that we're effectively going to be able to pre-treat our environment with a couple of pre-processing steps just to make it a little bit easier for our ai to learn how to play the game then we're going to use a technique called reinforcement learning to be able to go on ahead and teach it to play super mario and then last but not least we're going to be taking a look at the final results and as per usual we're going to be treating it as a data scientist developer client relationship so you're going to be able to see some of the back and forth as we go on ahead and build our model ready to do it let's get to it yo nick you're into python right yeah why you reckon you can use it to play some games say no more jimmy so jimmy wants us to get a game running with python well as you might have guessed we're doing super mario in order to do this we're going to be coding with python in jupiter lab and we're going to be using open ai gym this is a common framework that makes it easy to train ai to play games and interact with other simulated environments think alphago but on a smaller scale for our mission we can use the gym super mario brothers environment that's been built on top of the nes emulator for python let's do it alrighty guys so welcome to the mario ai tutorials so in this video we're going to be focused on building a full-blown reinforcement learning model or ai model for mario now in order to be able to do that we are going to need to go through four key steps so first up we're going to set up mario then we're going to take a look at how we can pre-process our environment where they're going to train our reinforcement learning model and then we're going to test it out now the first thing that our mate jimmy asked us to do is just to see whether or not we can even play mario using python given the fact that we're python focused here so that's exactly what we're going to do in this section up here now i'm going to be coding all of this in python and the environment that i'm working in in case you haven't seen something like this before is called jupiter lab this is great whenever you're building machine learning models or doing data science and all that type of good stuff so the first thing that we're going to go ahead and do is set up mario now in order to do that we're going to be installing 2k libraries that's a little bit too zoomed in we're going to be using gym-super mario brothers and then we're going to be using 7.3.0 that's fine i'm going to show you how to use it now before i jump into that too much this basically allows you to play mario using python so we've got a whole bunch of different types of environments and this is what we're going to call our game in a game environment we've got a bunch of different environments that we can effectively use and the great thing about this is that we can easily set it up to be able to train deep learning models or reinforcement learning models later on so it's pretty cool when it comes to using it for ai so we're going to be using this and installing this and we're also going to need nespi which effectively allows us to build a virtual joypad for python to be able to play around with our mario game now all of this is based on top of a framework called open ai gym this is a really popular framework so if you've seen any of the ai that google builds to play alphago or any of the ai that open ai builds they're actually using this a lot for their reinforcement learning applications and you can actually see some of the different environments they've got available here it's actually really really cool if ever you get a chance to check it out by all means do go and check it out we've also got some other videos on the channel that talk a little bit about it but for now uh enough blabber let's actually go ahead and install some dependencies so that's the first thing that i'm going to go ahead and do so i'm going to write it and then i'm going to explain what we've gone and done okay so that is our first line of code written so it's just a single liner and what i've gone and written is pip install so exclamation mark pip install and then gm underscore super underscore mario underscore bros equals equals 7.3.0. so that is effectively installing this and it's going to be what allows us to play mario in python then i've also gone and installed ness underscore pi so this is no different to how you might go about installing a regular package whenever you're writing any of your python code so we've gone and installed that now that is all well and good what we now need to do is actually import it into our notebook to be able to start setting this up so let's go ahead and import our dependencies and then we'll be able to set up our environment okay so those are our dependencies are now imported so i've gone and imported three key things there so first up we've gone and imported the game now in order to do that written import gym underscore super underscore mario underscore bros then i've gone and imported the joypad wrapper so i've in order to do that i've written from ness underscore pi dot wrappers import joypad space i'll talk about this a little bit more in a second and then the last thing that i've gone and done is i've gone and implemented or imported the simplified controls i've written from super underscore not from gym underscore super underscore mario underscore bros dot actions import simple movement this is really important so a key thing whenever you're building ai for games is to try to simplify the environment as much as possible because the more complex it is the harder it's going to be for your ai to learn how to actually play that game so what we can actually do or what we're doing in this particular case is we're simplifying the amount of actions that our mario character can take so this simple movement variable over there simplifies what our mario character is going to be able to do and what our ai is going to learn to do so what we're doing is we're simplifying it down to one two three four five six seven so there's only going to be seven different types of actions that mario can actually take using our ai and this is common practice right if you actually take a look on the official pie torch repository they actually simplify this even further to just two actions we are going to be taking these actions so we can either take no option or no operation so that means no key we can hit right right plus a right plus b right plus a b a and then left so these are going to be the different actions that our ai is eventually going to be able to take um and then what we need to do is we set up our game using gym underscore super underscore mario underscore bros and then we wrap it inside of this joypad space wrapper so you're going to see the wrappers a little bit more once we go to pre-processed environment but for now just know that we've gone and imported the game we've gone and brought in some wrappers and we're going to simplify our movement using this so now that that's done we can actually go on ahead and set up our game so this is akin to effectively turning your nintendo on so let's go ahead and do it okay so that is our game now set up so i've gone and written two lines of code there so i've written emv equals gym underscore super underscore mario underscore bros.make and then to that we've passed through this option here or this parameter so i've written super mario bros v0 so if you take a look inside of the documentation you've got a bunch of different types of environments now i keep saying environments but just think of your environment as your game so where our python code is going to be running against and effectively it's an emulator so we are going to be using this environment which is just super mario brothers and we're going to be using the standard version you can also play the down sampled version so if you take a look this is much more pixelated versus this to this we can take it one step further and effectively go to pure pixelated and you can even go down to one which is just a rectangle so that is mario there uh that little red block um you can even go play super mario brothers 2 so there's a whole bunch of different versions and effectively whatever you go and pass through into here is from here so we can choose which version we actually want to go and play so that's that the second line of code is remember i said we're going to be wrapping our environment to be able to use those simple movement actions so if we actually let me actually just hold that off so i can show you the different spaces so if i go to emv dot action space dot um let's go dot sample you can see here that there are 200 let me actually just show you the number the fact that this has come back and said discrete and 256 means there's 256 different button combinations that you can actually go and play which is going to take our ai a ton of time to actually go on ahead and learn now by wrapping it inside of this joypad space wrapper over here we can simplify that down so right now it's 256 which is way too high it's going to be way too hard for our ai to learn how to play this if we go and wrap it you can see we're dropping it down to seven now so this simplifies how our ai is going to be able to learn how to actually play super mario cool so that is now done so we've gotten written those two lines so the second line is e and b equals joypad space which is what we brought in from up here and then to that we're passing through uh two keyword arguments two positional arguments the first one is env and the second one is simple movements and what we got from over there cool so uh let's go ahead and play it now so we've gone i also showed you um the action space as well so whenever you're working with these environments or specifically open ai gym environments there's two key tips that i can recommend if you go and take a look at the observation space this tells you what you're actually going to be getting back from your environment in this case your game so we're going to get 240 pixels by 256x3 which is effectively a frame from the game and then we can also take a look at our action space which in uh we don't need the dot shape which in this particular case tells us that we've got seven discrete actions so for example up down left right a b so on and so forth um but really in this case there's these simple movement actions let me show you that again they're these actions okay so that is a little bit about our environment let me actually show you how to actually go and play it so let's go ahead and write this loop so we're effectively going to loop through and take some random actions it's just going to be pretty random at this stage so let's go ahead and do this okay before i run this let's actually go and comment it together so the first thing that i've gone and done is created a flag and i've set this to true and this tells us whether or not to restart the game or not restart or not so by setting it to true i'm effectively telling the environment or the game that we need to start a game to begin with because right now we don't have one started out now to start a game we can run the command e and v dot reset so this just resets our game it's like turning your uh i don't know your gamecube or your i'm trying to go turning your switch on and off so envy.reset allows us to restart the game so that is what we're doing here so basically we're going then and looping through every single frame of the game so think about um your game as being a set of frames so you do something with each frame so something updates on your screen you press a key something changes on that screen you press another key this is what we're doing here so four step this could just as easily be frame four step in range hundred thousand so we're going to take a hundred thousand frames and then colon if done because keep in mind our game hasn't started yet so we've gone and set it to true so we are going to reset our game so this is effectively start the game start the game let's add another comment here loop through each frame in the game um start the game to begin with and then what we're going to do is we're going to take a step in the game so we can use the method so env.step and this is like passing through an action to your game so it's like pressing a button or moving a joypad or changing a control so envy.step allows us to pass through an action to our game so it could be um jump move right move left so on and so forth so in order to do that we can type in emv dot step and then we're just going to randomly take steps so this line here allows us to randomly take steps so a and b dot action underscore space dot sample allows us to take random action so let me show you this before we get into it so env dot action space dot sample is just getting random actions which is from this simple movement uh list here let me show that right so we're taking a random action random action let me zoom in so you can see it a bit better random action random action random action random action random action so you can see that as we're going to go through we're going to just take random actions because right now we don't have any smarts we don't have any ai to actually go on ahead and play mario so we're just going to do that to begin with okay so we're at the nba dot step line so this is just going to take random actions take random actions actually it's more do random actions and then what we're going to get back from taking a step is we're going to get back the state which is i'll explain that a little bit more in a sec we're going to get back something called a reward we're going to get back whether or not we're dead or not or done or not and we're also going to get some info then we run env.render so this allows us to show the game on the screen and we are also able to close the game using emv.close right so if i run this now this should effectively allow us to see uh just python randomly playing mario and you can see that there let me oh no why did i do that all right so that looks a little bit better so you can see that that is mario now effectively playing but he's just getting up to the pipe and he's not really he's not able to jump over because we haven't actually embedded any smarts in this and as soon as that timer gets to zero it's going to restart the game so let's let it run out right so it's restarting it's jumping over one pipe doing nothing sort of getting stuck there because it doesn't know how to actually jump over the pipe we haven't actually added any smarts yet but now keep in mind jimmy just asked us to be able to get our python environment up and running so for now we've at least fulfilled that initial first mission so we've got python playing mario effectively but we definitely want to take this further a little bit further on and you can see he's clearing that first one and you know what he's only clearing that first one because he's taking just random actions and one of those random actions just happens to be jumping which you can see there um he's jumping to the right okay so again that's just going to keep on going until we stop it so we've gone and run it for a hundred thousand steps which is a ton so we can just go back into our notebook hit stop and that is going to stop the game um we can also run env.close if it's still popping up down there you can see uh let me zoom in so you can see that so you can see we've got a little python icon running down there we can actually go and close that down just running enb close it's going to shut it down cool okay so that is our environment now running what else did i say i was going to let me just take quickly take you through nba.step before we go back to jimmy so if i type in env.step and pass through action one what's happened there we've got all right so if you get this os error exception access violation blah blah blah this basically means that there's a little bit of a mismatch between your notebook and the simulator or the emulator so all you need to do is just go to kernel restart the kernel and you can get through that um so then what you need to do is just unfortunately go and run rerun these imports so if i run rerun that import and then go through that is effectively back up and running so now if i go and run enb dot reset or amd dot step let's run reset first so you can see that we're getting these values back now these values that we get back are what we call a state now our state in this particular case is just the frame from the game so let me zoom in on that is so it is going to be 240 pixels wide by 256 pixels high by three channels so it's a color image so we're going to be able to give this color image to our ai later on to try to learn how to play mario uh what else do we get so that's emv.reset now if i go and take a step we get back some more information so if i run in b dot step and let's just take a random action for now we're actually going to get four things back so if i show you the length of this right we actually get four values back when we run e and b dot step so if we go and run this and grab the first value so the first value is getting our state again so it returns a new state after we've taken an action so imagine you jump what you're going to get back is the frame after you've jumped so ideally mario should be up in the air then if we go and run so what is our second value our second value is our reward now this is i'll talk about this a little bit more once we get to actually training our reinforcement learning model but just think about this as whether or not you've done something right in the game so whether or not you've got a point or not i'll explain a actually let's actually take a look at how this works so the way that the environment works is using this reward function this is really really important so what's going to happen is or in this particular case the reward function is based on or it has a main goal and that main goal is to move mario as far right as possible as fast as possible without dying so every time we get him to go to the right we get some sort of reward now in this particular case you can see that that reward was zero and that's because mario's probably hit that pipe and is no longer able to go further right hence we're getting a zero back now we effectively train our ai to maximize that reward that's what we're going to be doing in a second so in this particular case we've got zero but that's fine eventually ai should be able to get some more reward the third thing that we get out of this is whether or not we're dead or not or whether or not the game is done or not and the fourth thing is we just get some info back so the number of coins we've collected whether or not we're at a flag at the number of lives we've got our score a stage a status and you can find more about this inside of the documentation so this tells you a little bit about the info that you're getting back and i'll link to this as well as all the code for this tutorial in the comments below uh what else do we get uh what world we're on our x position our x position in respect to the screen and our wired position cool okay so that is uh python playing mario again it's pretty basic at the moment but we've successfully let me expand this again we've successfully been able to get our python environment playing mario let's jump back over to jimmy and see what he thinks uh yeah that that's cool but can't you like ai it build some sick model to actually play the game yeah well you kind of got to pre-process the game first before we can get to that well chop-chop then get pre-processing mate you know what they say about data big data big uh wrong quote i think what i meant is rubbish data rubbish ai in this case it's no different we need to pre-process our mario game data before we ai fight we're going to apply two key pre-processing steps grayscaling and frame stacking our ai is going to be taking in images of the mario game to learn a color image has three times as many pixels to process so converting it to grayscale cuts down the data it has to learn from frame stacking helps our ai have context by stacking consecutive frames we're effectively giving our ai model memory it'll be able to see mario and his enemies movements time to pre-process okay so jimmy was kind of impressed but not super impressed right but we also know that from here we actually need to start pre-processing our environment to be able to start building ai to work with mario and so that's exactly what we're going to do we're going to start pre-processing our environment so we're now up to step two before we do that let's actually clean this up a little bit so we don't actually need um this observation space bit we don't need that we don't need that and we can get rid of this and this so all that we're left with at the moment is our imports setting up our game and then actually testing it out but again all this code as well as all the final code is going to be available in the description below so you can actually take that and run with it cool so the next thing that we're going to do is pre-process our environment as per usual we've got to import a couple of things before we can actually go on ahead and do this and the majority of these things are going to be wrappers so keep in mind we wrapped our environment once already inside of this joypad space environment or this joypad space wrapper we're going to now bring in a couple more wrappers so let's go ahead bring these in and then we'll take a look okay so i've gone and written a bunch of lines of code there but i realize that we haven't actually gone and installed something that we're gonna need so i'm gonna come back to that in a second before we run this but let's actually take a look at what we've written so far so what i've written is from jim.rappers import frame stack comma grayscale observation this is going to do two things so frame stack is going to allow us to capture a couple of frames while we're actually playing mario now this means that our ai model is eventually going to be able to see what happened in let's say for example the last four frames so we'll actually be able to see what direction mario was moving in what direction our enemies were moving in and how we're actually interacting with the environment because otherwise say for example we just pass through one frame to our ai it's only going to know what's happened right then and there it doesn't have any concept of where our enemies are moving or what position or velocity mario is actually moving at so we're actually going to use this to stack some frames together so think of them exactly as stacking them together to be able to train our ai then the second thing that we've brought in is grayscale observation so this actually allows us to convert our colored game into a grayscale version this has a really great effect of actually cutting down the amount of information our ai model needs to produce because a color image is effectively whatever the image size is so by the height and the width multiplied by three channels because you need one channel per color to represent red green and blue if we make a grayscale we effectively shave it down by a third so rather than dealing with 100 of data we're only dealing with 33 that means that we can deal or we've got less data to process to actually get our ai to work and this means that our ai is going to be faster when it comes to actually interacting with it as well then we've brought in some vectorization wrappers and this is where i realized we haven't actually gone and installed some stuff so when we actually go and implement our reinforcement learning model so our ai model we're going to need to vectorize it in order to be able to actually use it without ai um and the ai library that we're going to be using is called stable baselines it was originally built by openai and then a bunch of open source guys went and cleaned it up and made it a bunch better and implemented a ton of extra stuff so there's some great vectorization or great wrappers inside of that so let me read you out that whole line so it's from stable underscore baselines three dot common dot vect underscore env import vec frame stack and dummy vec environment so our vect frame stack allows us to work with our stacked environments and our dummy vectorized environment it just wraps our base environment inside of a vectorization wrapper so just think of it as um how you need to transform your model to be able to pass it to your ai model later on so we're gonna i'm gonna show you how to do all of this and then last but not least i've brought in matplotlib i'm gonna use this um to show the impact of frame stacking you'll actually be able to see that all right but before we go on we've actually got to install some stuff because this right now without actually going and installing some stuff is actually going to throw an error so i've got it installed in my environment but i want to show you how to do it so the first thing that we need to do is install pytorch torch and to do that we just need to go to pytorch.org and again i'll include this link in the description below and then we just go over to here hit install scroll on down and you've got a bunch of options right depending on the type of machine that you're working on so i'm just going to choose stable hit windows um the way that i want to ins and so if you're on a mac hit mac if you're on a linux machine hit linux we're going to use windows then we need to choose how we want to install the package so we're going to use pip go to old pip and then the language that we want to install for so we're going to choose python and then the compute platform so this is uh where things can get a little tricky so if you have a gpu on your machine which i recommend for this uh or recommend using for this if you've got a gpu then you need to be el and you want to use it to train your ai model we need to have cuda running or some sort of gpu acceleration running so if you're on linux you can also use rock m uh in this particular case we're running on windows so we're gonna choose windows and the cuda version that i've currently got installed is cuda 11.3 so if you you're not too sure how to install cuda i've actually got a video on how to do that as well but i'll link to that uh below uh where was i going with this so you choose the compute platform and this is purely so you want to choose one of the cuda ones if you want to use a gpu um if you don't have a gpu that's perfectly fine you can still do this just hit cpu over there and then to actually gonna head and install you just need to copy the command down the bottom so i'm gonna choose the cuda 11.3 one because that's what i've got and i'm going to copy that so copy this command and we can go back into our notebook paste that in there and then delete the three and then run that we've got some error in our syntax what have i gone done add an exclamation mark cool so in this particular case we've got pi torch already pre-installed so it's installed relatively quickly now the next thing that we need to go ahead and do is install stable baseline so this is going to be the let me actually show you or stable baselines three so this is a reinforcement library so if you've ever done any machine learning or data science or ai type stuff before this actually gives you a whole bunch of different algorithms that you can use to train your ai model now this particular case the type of deep learning that we're going to be using is called reinforcement learning it's great because it allows you to work with different types of open environments now these are the algorithms that you've got available so a2c ddpg dqn hdr ppo sac and td3 we're going to go into ton of detail we're going to be using ppo so it stands for proximal policy optimization don't ask me to explain that in great detail but basically think of it as an algorithm we can use to train to play our game so we need to install this so let's go back and let's take a look at our installation steps so in order to install it we just need to run this line here so pip install stable dash baselines extra so we're going to copy that and we are going to paste that in let's just double check do we need anything else nope that's all so let's go and install stable baselines now the reason that i went and installed pytorch before running this is this stable baseline package will actually install pi torch but if you want to use a gpu you need to force or run it that installation the custom installation for a gpu to begin with so just keep that in mind if you want to use your gpu if you don't perfectly fine so i'm going to include an exclamation mark then i'm just going to add a comment for good practice so install stable baselines for rl stuff cool all right so that is all installed doesn't look like we've got any errors there so that is all good to go now so we've gone in installed pi torch and we've also gone in installed stable baseline so the line so i'm not going to read out the whole line for the pi torch that's just from the pi torch site the line for stable baselines is exclamation mark let's zoom in on that exclamation mark pip install stable dash baselines three and then inside of square brackets will pass through extra that just installs some nice little extras for us okay cool that is that installed sort of installed high torch of installed stable baselines we can now go on ahead and run our imports okay so that's imported successfully so i just went and ran this cell you can see no errors there we've gone and imported our frame stacking wrappers our vectorization wrappers and we've also gone in imported matplotlib as well cool all right so the next thing that we actually want to go ahead and do is wrap our environment inside of these additional wrappers so we're actually going to start off with this so we can actually just copy this and this was how we set up our game to begin with now what we're going to do is we're going to apply some pre-processing so let's actually comment this a little bit better so this is um create the base environment and i'm actually going to number this so you can see the different pre-processing steps so that's the first one then the second one is uh simplify the controls because remember we had something like 200 different controls before we went and applied our simple movement wrapper over the top of it the next two that we're going to do are we are first up going to grayscale our environment so we're going to grayscale we are then going to uh what's the first one we're going to wrap it inside the dummy environment and then last but not least we are going to uh stack the frames okay so let's go on ahead and do this okay so that is our environment now grayscale now i've only done one line so you can see it we're actually going to visualize this one at a time so if we go now and run state equals frame emv dot reset what we should get back is a frame so if i type in state dot shape so what we're getting back is so remember cast your mind back to actually let me just run this so you can see it so this is our first environment right so if i comment out our grayscale and run reset you can see that we've got three channels so it's 240 pixels by 256 pixels by three channels this is a color image so if we now go in and let's actually visualize it so if i type in plot dot i'm show and run state so that is our super mario environment in its color form right so you can start to see that there and so what i've basically just gone and done there is i've grabbed the frame from env dot reset and then we can pass that to plot dot i am show to actually be able to visualize it this is using matplotlib so use map plot lib to show the game frame okay so you can see that right now we haven't actually gone and done any pre-processing it's just a it's still a color frame now what we want to do is actually go and grayscale this observation so if i go and run emv equals grayscale observation which is what we brought in from our gym wrappers and then we pass through emv and then comma keep underscore dim equals true and i'll explain that a little bit more so we need the keep underscore dim to be able to use or stack our frames later on otherwise it just throws let me actually show you it without keep dm equals true so you can see it gets rid of the well let's actually hold on that's color too so this is with ktm equals true so you see that we retain this last channel if i delete this you can see it gets rid of that last channel we need that last thing to be able to do the frame stacking which i'll explain a little bit more okay so now if we go and visualize this so we've now gone and applied our grayscale if we go and take a look at it you can see it's now gray so in this case maplod lib plays around with the colors but this is actually a gray frame that we're getting from our environment now so this effectively cuts down the amount of information now by how much so the number of pixels that we've got to process is 240 pixels by 256 by one so this means that we have 61 440 pixels to process when we're grayscaled now keep in mind when we're not grayscaled we have three channels at the end so it means that we're going to be processing 184 320 pixels so significantly more data that our ami model is eventually going to have to process but we are going to grayscale it so we don't need to worry about that so we can delete those and delete those and uh let's leave that for now okay so the next thing that we need to do is wrap it inside the dummy environment so let's go ahead and do this i actually just realized we're not going to use this frame stack we're going to use this one so we can actually drop that out of our import so let me just quickly go through that so remember from gym wrappers we imported frame stack and grayscale observation we don't actually need this frame stack so we can get rid of that okay so we're going to wrap it inside of the dummy environment so let's go ahead and do this okay so that is our environment now wrapped up inside of our dummy stable baselines vectorization environment so if we now go and run it again what you can see there is that we are now inside of a another array so we've now gone and placed it inside of another effectively set of arrays right or list so if we go and visualize now it should throw an error that's fine if we go and just pass through zero again we can still get our frame out so you can see that each time we do one of these processing steps the shape of our data changes slightly all right now the last thing that we want to do is uh let me actually take you through this line of codes i've written emv equals dummy vec env so what we imported from over here and then to that i've passed through a set of a square bracket so this is a list and then we're specifying lambda e and b so it actually returns an environment back to our dummy vectorized environment last thing that we need to do is stack our frame so let's go ahead and stack our frames okay before i go and run that let's take a look at what we've written so i've written emv equals vec frame stack and then to that what we're passing through is our environment so remember we're just taking these environments and passing it through to each one of our pre-processing steps each time so by passing our environment through the vect frame stack we can specify how many frames we want to stack so in this case i'm choosing four you can choose more you can choose less i found four works and then we'll also specify that where our channel's order is in this particular case our channel's order is last so i've passed through last so if i go and run this now all things holding equal you can see that now our state shape is now four at the end and this is because we actually have four channels at the end to represent each one of our different frames now i haven't exactly explained this all that well because it doesn't actually visualize it that well here but what we've actually got is we've actually got four different images now stacked together so if i go through and grab the first uh wait why are we not getting this back to zero one let's take a look at our shape state oh so it's the last channel okay so that's fine so right now it's visualizing that so we haven't actually passed through any additional data to our state so let's actually do that now so if we go and run emv.step and we're going to pass through uh so let's unpack this again i'm just going to copy this down to here actually let's just copy that whole line so what we want to do so right now when we run our state initially we're only going to have one frame back so when when we actually go to stack our frames there's only going to be one frame to stack hence why we're only seeing our gray image there if we go and run this again uh what's happened so let's run and b dot reset now we're getting uh object is not subscriptable let's put this inside of an array because we are now vectorizing our environment okay so you can see that we're getting different colors so if i go and run that again different colors run that again different colors and what we're actually getting back is a stacked frame now let me actually show you what this actually looks like so i'm just going to write some additional code to be able to show you this visualization a little bit better okay there we go that's a little bit better so you can see there that this represents each one of our different stacked frames so if i actually go back let's reset our state first so i'm just running this cell up here and if i go and run this you can see that right now we only have one frame and this is because we've only gone and run one we've actually gone and seen one frame from our environment because we've only gone and run or we've only just gone and started our environment let's just delete this so we can see this a little bit better now if we go and take a step in our environment so if i go and run this line and run this again you can see that we now have two frames if we go and run another line or if we go and take another action again we're now up to our third frame go and run it again another action let's actually make this a bit bigger so 20 by 16. that's a little bit better so if we now let's reset our environment again right so we've only got one frame let's go and take let's actually uh what's the right action so env dot what's simple movement our right is number zero one so if we go and just keep going right let's see what that looks like so if we go right actually let's the better action is jump because then we'll actually be able to see that's a simple movement um what is jump right a so let's try zero one two three four five so if we go and pass through five we should effectively see him jump run that mario's not jumping um let's go and run it again okay so there i've just run it a ton of times and you can actually see that mario is jumping so it looks like uh it's just moving really really slow so let's uh let's reset this and then let's keep and then let's keep hitting it a bunch of times right so that's mario down the bottom and then if we go and run it again me actually get rid of this just so we can see what we're doing so all right so let's reset our environment right so that's mario standing down the bottom so now if we keep running this let's go back up so you can start to see that mario is moving up so down here is let me zoom in so you can see that mario's down the bottom here and as we scroll up he's moving up he's moving up he's moving up to the top so this actually allows us to stack each one of these frames so i had to run it a bunch of times for you to actually see it but you can now actually see that our ai model is going to have some sort of memory so it'll actually be able to see the velocity or the movement that is actually being enacted when we're actually going on ahead and playing mario alright so now that we've actually got that running we can go back to jimmy and at least show him that we've got a pre-processed environment so let's do it all right man it's all pre-processed you end up checking out the code on discord yeah yeah cool but are you gonna build the ai models now definitely yeah i was um i was just showing you progress right it's about the journey not just the destination yeah nice try go build it i want to see the destination want to learn a quick way to 10x your startup just add ai to the domain name i mean we've talked big game about ai what exactly is it really and how are we going to use it for gaming well the type of ai we're going to be using is called reinforcement learning it has four key elements an easy way to remember these is to think of area 51 agent reward environment in action and uh 51 mario in this case is our agent he can take some action for example jumping moving right moving left and so on inside of the game environment then depending on the results of his action he might get a reward or might get a penalty the ai controlling mario learns what actions to take inside of the environment in order to maximize that reward the specific reinforcement learning algorithm that we're going to be using for this is called ppo which stands for proximal policy optimization which was originally created by a team of researchers at openai in 2017. okay so we're now up to the good bit actually building our ai model so up until now what we've gone and done is we've gone and built up our environment we've gone and pre-processed it now comes the good bit so the first thing that we need to do here is import some dependencies and we've already talked a little bit about the dependencies that we're going to need we are going to be using stable baselines and specifically we're going to be using the algorithm called the proximal policy optimization algorithm so this is exactly what we're going to use here now first up let's go ahead and import those dependencies and then i'm going to give you some sample code that actually allows you to save models as you're training them so you can actually go back through and see the different models that you've actually gone and built so let's go ahead and do this and we'll be able to import our dependencies okay those are our three dependencies that we actually need to build our ai model so first up i've gone and imported os so this just makes it a little bit easier when it comes to actually determining where we want to save our models now i'm gonna give you a call back and i'll explain this in a second i'm gonna give you a call back which actually allows you to save a model every say 10 000 games or 10 000 steps so this means that you can actually have a backup so you don't go through and train your model and then you lose all your progress um you've actually got the ability to go on ahead and reload that okay so we've gone and imported os and then we've imported our main algorithm so from stable underscore baselines got an extra space here from stable underscore baselines three import ppo and so this is just importing the algorithm that we're going to be using to train our ai model or our reinforcement learning model and then we are going to be and then i've imported the base callback so from stable underscore baselines three dot common callbacks import base callback this bit is kind of optional and it's because we are going to be using my train and logging callback i remember where i actually got this from but i think it was inspired from someone else um but i'll include this code inside of the description below or inside of the github repo where i've got this code but this is kind of optional because you don't need to go and save your model every 10 000 steps it just makes it a little bit easier when um and ensures that you've at least got some sort of backup as well so that is our train or those are our dependencies now imported and this is the callback that is going to allow us to save our model every x number of steps so through this callback we're going to be passing through the check frequency so this is how frequently we want to save our model we're also going to be passing through a save path and this is where we want to save our model 2. now again this is purely optional because you can save the model manually but i found it good practice to effectively have this that being said the models can be big so i did this a while ago each model is around about 282 megs let me just go back so if you don't have a ton of disk space just be careful how frequently you save it so you can see 282 megs um they get quite big and i saved it every ten thousand steps so you can imagine i think i trained for four million steps so it can take up quite a fair bit of space but that is our base callback and those are our dependencies now imported cool now what we want to do is we actually want to go on ahead and set up this callback so first up we need to set up some directories or where we're going to save our stuff so let's go ahead and do this first okay so those are our two directories so i've gone and set one up called checkpoint uh checkpoint underscore dear and one called log underscore dr and i think i've already got these created but uh that's fine they they create themselves uh where are we mario rl super mario here all right so inside of here i've got two additional folders and we've got one called logs and one called train so inside of train all of our saved models are going to be saved and inside of logs every time we run the algorithm once we're going to create this new log file so you can see i've run ppo a ton of times when testing this inside of there you're actually going to get a tensorflow log file and we're going to set this up in a second but we can actually open up tensorboard to actually be able to go and see the progress of our model in a whole bunch of different metrics i'll show you how to do that as well okay so that is those two directories now set up now what we need to do set up our callback so let's go ahead and do that okay so that is our callback now created so this is just an instance of this big train and logging callback so what we're effectively going to be able to do and again this is purely optional i'll show you where to disable it if you don't want to use it if you don't want to use this what this is going to do is it's going to save our model every 10 000 steps so this means that rather than ensuring that our model stays up and we don't have any errors we can actually just have it automatically do its saving for us so every 10 000 steps we're going to get a new model saved so let's walk through this entire line of code so i've written callback equals train and logging callback and then to that we've passed through two keyword parameters the first one is check frequency and i've set that equal to ten thousand if you wanted to only save every hundred thousand steps you could just add in another zero there so you can actually tune this depending on how much disk space you've actually got and then what i've also gone and passed through is the save path so in this particular case i've called it checkpoint dr and this means that everything is going to be saved into that train folder so that is our callback now done now the next thing that we need to do so it's pretty straightforward from here on out we just need to go and set up our model and then kick off our training so once that starts we'll actually be able to let it run and we'll be able to come back and test it out so first up let's go on ahead and set up our model now remember the model that we're going to be using or the ai whenever you hear me say model i'm talking about the machine learning deep learning or reinforcement learning model or our ai model so here it's going to be our ppo model so let's go ahead and create our ppo model okay so that is our model creator let me just run that okay that looks fine that is our model created so i've gone and written a single line there so this is the ai so you can start to see a recurring theme in a large number of machine learning and deep learning projects the hardest bit is actually getting the data into the right format this is actually creating ai right now so that one line has created a temporary ai model we just need to go and train it but the hardest bit is actually getting all the data in the right format ensuring that it's it's ready for training so um so we've gone and created that so let me just write this is the ai model started and what i've gone and written is i've gone and created a new variable called model and i've set that equal to ppo which is this algorithm from up here to that i'm passing through the type of policy so when we actually go and create a reinforcement learning model behind the scenes what we've got is something called a policy network now if you've ever heard of deep learning this is where the term uh actually comes in so in deep learning we've got something called a neural network and think of this as a basically a computer-based brain so you've got a bunch of neurons and these learn the relationship between different variables and target variables so think of this as the brain of our ai now by passing through cnn policy we are using a specific type of neural network behind the scenes that is really fast when it comes to processing images because remember our game is just a set of frames it's the frames from our images so this is why we're going to be using cnn policy you can use another one there is another one called mlp policy this one stands for multi-layer perceptron policy and it's great for tabular data or i think excel spreadsheets think i'm trying to think of other stuff excel spreadsheets csvs um json based data so that's where mlp is good um cnn stands for convolutional neural network policy so we're going to be using that type of policy and let me actually show you where this is so if you go into stable baselines and cnn so down here so this is inside of the documentation so you can actually see what types of policies you can use with different types of algorithms now just as a little bit of background and i know i'm probably diving into this too much um but inside of the older version of stable baseline so stable baselines two you've actually got even more policies and even more algorithms that being said stable baselines three is a little bit more stable so if we go into ppo2 inside of stable baselines um what do they have there so here you can see that they've got mlp policy cnn policy cnn lstm policy so this one um lstm actually stands for long short term memory and it's great for for sequence-based data so our frame stacking data it's great for that as well um so you could definitely try using that in this case we've chosen cnn okay so two ppo will pass through that we've chosen the policy that we want to use we also pass through our environment this is going to be our data remember our pre-processed data from over here and then to that let me just double check we've gone and done that right oh we've actually removed something from here so this should actually say um channels order equals last let's just double check that's working yep cool all right so just remember that we need to have channels underscore order equals last in outside of our very frame stack i think i removed it when i was messing around um okay so what we've gone and done so we've gone and specified our policy we've gone and specified our environment setting verbose equals one this means that you get a whole bunch of information when you start training i've gone and set up the tensorboard log so this means that we can actually see a whole bunch of metrics on how our training is actually performing as we're actually going and running our model uh then i've gone and specified my learning rate this is super important so if you want your model to learn faster then um you can drop this down but just keep in mind there's a trade-off here so if you learn too fast it might not actually converge to a good ai model if you learn too slow then it might you might be training for years um so i've set it to 0.000001 and then the next parameter that i've gone and set is n underscore steps so i believe this is uh so the number of steps to run for each environment per update so basically it's how many uh frames we're going to wait per game before we go and update our neural network so i found that a good parameter for this is about 512. and again you can play around with this i just found that that actually gives us a model that tends to perform relatively well relatively quickly cool so that is our algorithm now set up now the next thing that we need to do is actually go on ahead and train our model so let's go on ahead and run this and then we i'll show you how to actually view the log directories then we're going to go back to jimmy show him that we've actually kicked off our training and say where to from there so let's actually go and kick off this training and then we'll wait 10 000 steps because that's no we'll be able to go and see our parameters straight away so let's go on ahead and learn alright so we can go and run model dot learn and then we want to pass through total time steps and we're going to set that to i'm going to go a million for now you don't need to let it train for all that time it's just gonna it's basically dependent on how long you want your model to train for um and then i'm gonna specify callback equals callback all right so let's go and take a look at what we've written there so i've written so this is actually train that ai model this is where the ai model starts to learn okay so what we've written is model dot learn and then to that we've passed through total underscore time steps and we'll set that equals to a million this is effectively how many frames our ai is going to get to c so this means that for every single game imagine every single move is a frame it's gonna get to say a million frames um you can amplify this say for example you're doing multi-processing this is where you could actually run it way faster add in a whole ton more frames but there's gpu limitations and whatnot and then remember how i said if you don't want to use this big callback over here what we set up over here you can just remove this so if you don't want to use that then you can get rid of this and you can get rid of this right so that's all you need to write to actually go and train your models so it's actually not that much when you actually take a look at it given how sophisticated this stuff actually is it's actually pretty cool okay so if we didn't want to run callback but we are so that is the full line there so we're going to actually leave that we want to run a callback uh let's write him again call back equals callback okay so this is uh where everything comes to fruition so let's actually go and try to run this also another key thing to note if you are using a gpu it should say use coded using kudo device up here if not it'll say using cpu device so let's kick this off and see how this goes this looks like it's running you can ignore this error okay so this is a good sign so the fact that we've got this little log uh appearing or this little bit of information this shows that our model is successfully kicking off training you can see it's running so what you're actually going to get is a whole bunch of information out of this so you're going to get the number of frames per second the number of iterations the time and lapse so how long it's been training for uh the total number of time steps so this is effectively how many frames your model has gone through and then you get a whole bunch of training metrics so you get i can't remember what kl is again you get clip fraction clip range entropy loss i normally pay attention to entropy loss uh this loss metric down here and then this value loss down here um you can also see your learning rates ideally you want to see your loss going down as you're going through you also want to see your explained variance going up uh as you're going through so those are two good signs that your model is actually converging um but this is just going to keep training for a million steps now once you hit 10 000 steps our callback should effectively save a model so let's wait for 10 000 steps so you can see it's going pretty fast up to 5 000 now up to six thousand where are we up to now seven thousand i'll fast forward this and see when we get to 10 000. a little longer than a few minutes later okay that's 10 240 so you can see that our loss is progressively going down so we're now at 0.125 our value loss is going down and that entropy loss is going down as well okay so those are all good signs and again it might bouncing up up and down because the environment is consistently changing but for now we're going to let that run and a good sign that you want to see is if you go into that training folder where we set up where we're going to save our model so in this case for me it's inside of youtube and then super mario and then train you can see that we've got our first model now effectively saved down there right so that's a good sign so we've got best underscore model underscore 10 000. so we are looking good there so the other thing that i also wanted to show you is how to view these logs so if you wanted to go and take a look at what these tensorboard logs look like let me actually show you how to do that so i'm going to create a new command prompt and i'm going to navigate to that log folder and again this is purely optional if you want to view these logs um you can go in and take a look at them so i'm going to go into my d drive and then what do we need to do so we need to first up activate a python environment so i'm going to go into youtube or wherever i'm working from where's my 17-1 and we are going to activate our environment so i've got a virtual environment here that's optional if you don't have a virtual environment you can just use basic python uh nope right so you can see i've got this virtual environment created then i'm going to go into my log directory and then i'm going to go into the last ppo folder so this is the latest one the 23 and then what we could do is run tensorboard so this is the main command so tensorboard dash dash log dear equals dot so this over here is going to be what allows you to actually view the logs and this assumes i can't remember if i've got tensorboard on this environment but we'll see okay so that looks like it's running successfully so we can actually copy this link here and go and open it up so you can actually view your training progress so this actually allows you to see how things are actually going pretty cool right i'll include some more information on each one of these metrics as well as well as what you should be looking for in the description below but basically this allows you to monitor how your training is going so you can see our fps we can see our approx kli there's not a ton of documentation on what you actually get back from here or what each one of these actually mean i'll see if i can find some stuff for you but basically this allows you to monitor how your model is actually performing so you want to see explained variance going up i normally look at this and i normally look at the loss as well so ideally you want to see loss going down okay so that is our ai model now training let's jump back over to jimmy and have a chat to him well it's built ain't nothing to do but check it out now right yeah let's see it all right so i'm actually going to show you the ai at different stages of the process not just right at the end this will allow you to actually see what it's actually learned right jimmy so you'll actually be able to see our ai model learning cool cool all right we made it guys time to bring it home now that we've trained our reinforcement learning model we can use it on our super mario environment this is pretty straightforward in mission 1 we set up a model to take random actions in each game all we have to do is sub out those random actions for the predictions from our ppo model using a command called model.predict let's go alright so our model has finished training so we've got to the million steps so what we can now do is we can actually go and take a look at our we can actually go and load our models we've got our million or what is this how many different models so we've got a hundred different models there i'll see if i can share some of the best ones via github they're pretty big i don't know if github's got a file upload limit but that's fine so we've got a ton of models so we can now dazzle jimmy now we'll probably take a look at some of the best ones here um and then we'll skip through to a montage where you can see all of the results sort of compiled or at least some of the better ones okay but now how do we actually go and use this model to play super mario because so far we haven't actually been able to apply our ai to our model so the first thing that we need to do is actually load our model from memory so say for example we stop this training or we shut down this notebook we want to be able to reload it from a file rather than relying on on the model that we've got in memory so we can do this so in order to do that we can use model.load or the actual ppo.load method so let's do that so i'm going to write model equals ppo dot load and then we pass through the name or the directory or specifically the file path to the model that we want to load now remember that our models that we've gone and trained are inside of a folder called train and then all of them here they're called best underscore model and then the number so let me zoom in on this so you can see that so they're called best underscore model and then the actual number of the model or how many steps it was at so let's try loading in one at 50 000 and then we'll maybe go to i don't know 500 000 and we'll probably go take a look at a million but then i'm gonna do the montage and show you a whole bunch of different models as well okay so first up we need to load our model so we're gonna go to dot forward slash train and then let's just take a look at one of them so if we go into youtube and then super mario and train uh let's start off with a hundred thousand i'm gonna copy the name of that paste that there load in the model cool so that's loaded successfully so no errors so let's just add a comment load more oh also if you want to just save the existing model right you can just type in model.save and then type in the name of the model so uh this is a test model right so that's going that's like a manual save so rather than using the callback that's how you'd go about actually saving that model manually and if you go into the base folder you can see that we've actually got that model there so you can see that that is now saved there but if you do use the callback then it effectively does that all for you so you don't actually need to go and do this all right so we've actually gone and loaded up our 100 000 model let's just run that again just make sure it is that now what we can actually go and do is try to play mario with this so this should be the final swan song so we're testing out the 100k model we'll try out 500 and then maybe a million alright so let's test this out okay so that's all we need to actually go and test out our model so some of this will actually look pretty familiar to you so the first thing that we're doing is we're starting our game so start the game and then what we're going and doing is we're looping through loop through the game we're then going and making so all right so this is the bit where it gets new so here rather than so remember previously to get our action let me just add a comment there previously to get our action we were just it's actually wrong to get our action we were just grabbing a random action using env.actionspace.sample now what we're actually doing is we're actually making a prediction so we're passing through model.predict and we're actually grabbing at our state then out of that we're going to get our action what are we actually getting out of here let me just write a and b dot reset so this is going to be state let's take a look at a prediction so state equals e and b dot reset and then we can get our model so we can use model.predict and then pass through our state and this is actually giving us our action back so we actually get action five now if we extrapolate that so let's go and grab that action and then uh what do we need to go to grab it again right then we can go all right so what is this action actually saying so if we go to simple movement so this is basically saying that in this particular environment the best keys to press are right a and b we're going to reset it and run it again so in this case it's right so if we keep running it right a so you sort of get the idea right this is how our model is actually making predictions and actually determining what is the next best action cool all right so uh what do we it's kind of weird that we're getting different actions based on the same state but anyway that's fine so let's actually take a look at how this actually performs so if i go and delete this now what we can actually go on here to do is so we're going to get two values back we can pass that to env.step to actually make a prediction so if we go and run this you can see we're getting that popping up and remember this is our 100k model so let's actually take a look make it a bit bigger right so again still getting stuck at the pipe doesn't look like it's making too much progress kind of average right like it's not that great at the moment okay we can shut that because it looks like it's just going to stop there so what we can go do now is rather than loading up the hundred thousand model let's load up the 500 000 model right so all i've gone and done there is i've gone and changed what model we're going to load up now if we go to state again see if this runs okay take a look at that it's clearing a bunch of stuff that's able to jump over one of the bigger pipes and again it's going to although that killed a mushroom looks like it's having a little bit of trouble jumping over that second uh that second hole you can see it's definitely playing better and remember keep in mind our reward function was all to do with how fast and how far it can go right hence why it's going so hard to the right having trouble with that second uh that second hole you can definitely see it's playing better than what we had as our random environment pretty cool right so that is our ai model successfully running all right so the last thing that i want to show you is let's actually take a look at the million model and then we're going to do a bit of a demo for jimmy so let's or a bit of a montage so let's actually stop this and let's try loading up our one hundred thousand one million model so that's one two three one two three all right so that's that so we can load up our million model reset our environment rerun this so this is our million model running now still doesn't like that hole does it like occasionally it's going to run well and anybody that says these models work perfectly every single time is probably lying to you unless you're part of the deepmind team and you've trained this but god knows how many steps still doesn't like that second hole so in this particular case our 1 million step model might not even be performing as well as our 500 000 model kind of crazy how fast it runs as well right like this is insane i mean i can't play mario this fast okay that's cool i could keep watching this for a ton of time what i'm gonna do now is uh let's kill this off let's quickly do a review and then we're gonna jump over to a montage and see a whole bunch of the different progressions so i've only shown you three different models but let's actually take a look because some might actually perform better than others so let's stop this now so what we've gone and done is a ton of stuff so first up we went and set up mario we then went and pre-processed our environment and remember we did a couple of different processing steps we went and converted the simplified the controls grayscaled and then we also went in frame stacked to be able to get give us some sense of motion we then went and created a specific ai model and we're using a method called reinforcement learning and the algorithm that we're using is ppo we then went and learned for a million different steps and we were then able to successfully load up our model using ppl.load and then this little script down here is actually allowing us to go out and test out our model so remember this is pretty similar to what we had originally right up here except the only change is that rather than using env.actionspace.sample we are now using the model.predict function which is over here model.predict to actually predict what is the next best or next best action that we should actually take but on that note let's jump over to the montage and see how this model actually performed across time [Music] so first up is the model train for ten thousand steps this one was pretty sucky at this stage a hundred thousand steps started to get a little bit better it took a little while but it eventually got over the second pipe [Music] and 500 000 steps this is where the game started to change you can see that mario is going for a hell of a lot longer so i ended up training the model for a bunch longer the models through 1 million to around about 3 million steps were pretty average to say the least and then it happened we got our dream run guys at four million steps mario went and smashed out a level every now and then he gets stuck but he does make it out through and wait until the end to see how he finally performs and take a look at that final score up in the top left corner [Music] so what do you think not bad man how do you think you might go about improving performance well there's a couple of ways what you could try doing is training it for longer with a slower learning rate this would mean that it takes longer to teach our ai model how to actually perform in the game but ideally it should end up at a better outcome not always the case but can definitely help we could also try multi-processing so this means that our ai model plays multiple games at the same time which should effectively mean that it learns faster last thing that we could also try is a different algorithm there's a whole bunch of things that we could definitely do sweet not bad for a first crack so what game's next who knows let me know what games you'd like to see in the comments below and that's a wrap thanks so much for tuning in guys hopefully you enjoyed this video if you did be sure to give it a big thumbs up hit subscribe and tick that bell and again thank you so much guys we just hit 50 000 i appreciate every single one of you thank you so much for getting us here if there's anything that you need by all means do hit me up in the comments below thanks again for tuning in guys peace bye
Info
Channel: Nicholas Renotte
Views: 141,808
Rating: undefined out of 5
Keywords: ai, python
Id: 2eeYqJ0uBKE
Channel Id: undefined
Length: 77min 5sec (4625 seconds)
Published: Thu Dec 23 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.