Reinforcement learning - How train an AI to play Snake using stable-baselines3 framework

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

hello everyone uh today we will be talking about reinforcement learning which is one of the machine learning techniques to train ais to play games and in our game examples we're going to use um the stable baseline three frameworks which is one of the best reinforcement learning frameworks in my opinion that can be used to teach ai's the game that we are going to change the ai to play is going to be the snake game and for that we are going to use python during this video so before we uh start teaching the ai to play that snake game let's go ahead and play that snake game by ourselves and for that i will start by creating a new project and i will use pycharm as my ide and while the charm ide is being initialized the snake game that i'm going to play is from a github repository that i'm going to share in the description by someone called rajat this was so i will create a new project i will call this nick a i project and it's using pygame which is a library that is used to create games using python and this will take some time until the virtual environment is being created so i will go ahead and just talk a little bit about stable baseline 3 framework and what reinforcement learning is reinforcement learning is used based uh is based on giving actions to the ai so let's say in our ai example our actions would be like going up down right or left and you want to reward or punish that ai based on the action that it makes so if it did a good action you reward it if it did a bad action you punish that ai and i will continue about talking about this but let me go ahead and install some dependencies that we are going to use in advance so we will need my game and we will need gem which is uh where we are going to create our environment and we will see what that means in a second well i mean throughout the video yeah and as i was saying you uh the ai picks some some action and if it picks the right action you reward it if it's picked the wrong action you punish it and so that and then the ai keeps iterating uh through each step and then it basically will try to make the action that will return the best reward that's how reinforcement learning is is being done and there are a lot of great examples on how ai was used and very good examples you can see are from open ai deep mind etc and two favorite examples that i am aware of is alpha zero which is an ai that we have even beaten or if not even crushed stockfish which is the best chess game engine and another example that i can think of is is uh dota 2 and i've been playing that game for over 10 years it's a very complicated uh game and they even have became the world champions which is something way was way too astonishing at that time and and that being said i'm still installing some dependencies i've installed so far by game i've which is the library that we are going to use to play our games then gem which is going to be used to create our environment and the stable based language which is our framework to uh imply the reinforcement learning and after that i will install another library called pipelit which is a library that is being used by many of the built in gym environments in order to render some of their uh some of their examples and the reason why i want to do that is because before we start and train our ai to play the snake game i want to show you a sample example and then build over that example so that you understand what each fight does and how to do it and what it actually means so we will start by playing the snake game understanding its rules seeing how it's being played then step number two i will show you a sample example that they have for stable baseline three and then the last step which is the uh the step that we are waiting for is to create our own snake environment our own custom environment in gem and then teach the ara using that environment to play a snake on its own and this is the last dependency that i want to install and of course without i even mentioned this as an ide you can install these dependencies using pipe install but i'm using the ide and i'm using a virtual environment for that perspective okay so it has installed all the dependencies that we need for this video i will create a new file that will play the snake in i will copy this and just paste it as is and let's play the game now you'll see i don't have an option to run the game it's because there are some background tasks that are happening in python it shouldn't take few seconds i hope okay now i can run the the game and here there's something very important that i want to mention so let's start about the moves or about the actions that we can make we can either go to the top to the right to the bottom and to the left as you can see there are uh in order to like get a better score in the game you reach the food and as you can see on the top left the scroll keeps increasing and there are two ways to lose the game or two conditions one condition is to eat yourself so i will try to eat myself and i died and as i can see this ud screen and the second condition is which is not not no um usually available for many uh snake games which is that if you hit the wall you die you'll see in most of the snake games when you hit the wall you go you run you get you go from the other side and one last thing that i want to mention which is that you cannot return 180 degrees so if you are already going right you cannot go to the left and if you well already going to the top you can't you cannot go to the bottom immediately you have to rotate like um like the right light for example and then go to uh to the bottom so this is basically the basic rules of the game how do you win the game basically there's no realistic way to win the game but you want to achieve the highest score by eating food and you lose the game in two conditions by touching yourself or by hitting one of the four walls on the edges let's go to them into pi and let's go to the step number two which is to use a sample example that they have and we are going to explain what each line does let's go okay first i'm importing gem at line number one gem is used to create environments and inside these environments environment is a class is a python class and it has list of functionalities which we will go into a lot of details for example it has a step function so if for each action you make you go to the step function there's a reset function which resets all your variables then you have the render function which will render your results or your model prediction on the screen so in our snake example rendering means the window the snake and the food and we can see how that's being interacted on the pie game window and we have a close function which will close your environment so in our pi game example that would be uh quitting the game and it has a constructor and inside that constructor you can add all your initialization logic just uh such as putting the snake position the initial snake position setting the game window setting the initial foot position and so on line number three this is we are importing our reinforcement learning algorithm and i recommend that you go to the documentation and go here and go through some of the exam algorithms that you can use for stable baselines three there's actual critic there's ddpq dqn ppeo etc and my advice is when you train your model try to train it using different algorithms and see which algorithm will give you the best performance lineup number five this is where where we are assigning the environment the gym environment to available gem has many pre-built environments and i recommend that you actually go to the gem github repository and look at the code for many of these environments in this example we're using the cut pull card pull environment the card pool is uh you have a card which we will see and you have a pole on that card and the card card job is to balance the pole to prevent it from falling down so if the pool was below a certain angle um you lose the game you lose you lose this game line number seven is uh where you initialize your model so you provide your a reinforcement learning algorithm and some of them have different policies um which you can lead into documentation some of them don't even have this policy and then after that you change your model and you provide how many iterations you wanted to train so the bigger the number of iterations they're better but at the same time it will be more time consuming line number 10 this is where you'll sit in the environment because when you train your model after that training is done some of the variables would not be initialized to their initial state so you want to make sure that you do that before you test your model and see how it works in action here we are iterating uh for 1 000 step step is is what is is action like basically step is an action so for each accent action you move one step in your environment and make uh just make sure that uh pay attention to what i'm saying right now you don't know what the step is what is rendering is you will see all that in in much details so at lineup 12 we are taking this uh the action from our model and then we are passing it and then we are rendering and see the results on which action did our model make and how it looks like in our vendor function or how it looks like in the window where we are seeing our logic in action that's basically it once we are done and this is very important when the game is over or when the game is done we want to make sure that we listen everything to its initial state so that we keep playing again and again and again so that's the purpose of this if statement that being said i will go ahead and run this example so if our model was trained correctly then the pool should stay balanced on that cart now 10 000 decisions usually some uh takes around maybe 30 seconds so it shouldn't take that long but if you are training for a hundred thousand or for millions it's very normal that training takes for days uh or if not weeks so as you can see our model is doing a very good job in balancing that pull by trying to move right and left to make sure that the pole stays balanced so this is a good indication that we have trained our model correctly now of course it was think correctly because this is a pre-built environment and it was the logic behind it was written correctly and there are like over 10 examples that you can see in gem beside this one and believe it or not many of these examples are very complicated that even my snake environment that i will show you is easier than many of them so now we saw what how this works we saw what the game is about let's implement the logic to train our ai to play the snake game so the first step to achieve that is to create our own gem environment instead of using the pre-built environment that we see in line number five how can we achieve that let's go to the documentation and let's go to the using custom environments section i will just copy this line this copy paste and then we will explore what that customer environment looks like i will put it in the same in the same file so that we can navigate much easier i will start by going to align the button which is the metadata we can provide generic information about this environment the metadata that's provided here is render modes and there are many mods but i want to introduce you to two of them something known as a human and something known as console human is usually used when you are running rendering something to a window so in the cultural example uh it actually used the pikelet library in our state game we are using the pi game window so when we are rendering our function we can't see the window we can see uh like the pool we can see the card we can't see things uh visibly on on some sort of a window that is used mostly for the human mode for the console mode this is used when you are not rendering something on a window instead you are printing statements or printing things to the console this is where the world console came from but since we are using pygame we are going to keep this as human and this is the constructor for the environment now we can pass arguments just like any python like class however we are not passing anything in our case so we will keep the self keyword which is similar to this keyword in java and javascript i will call this the snake environment just to be consistent with our game and the most important there are three very important lines here which we will keep this will call the super constructor which is for the gem.net class there is something known as an action space and there is something known as observation space what do they mean well let's start with the action space action is what action are we making right so for each step that the air is making what are the action what are the steps what are the um inputs that it can do so in our you've got a cardboard example there were only two inputs you can either move the card to the right or you can move the cart to the left but for the snake example we have four we have we can either go to the right to the left to the top or to the bottom and there are two types of action spaces there's something known as discrete action there's something known as continuous action this heat action is when you have only one discrete or single value of action that you would make for each step so in our snake game example it's only up only bottom only right only left you're not making multiple stuff for each step you're only clicking or hitting just one keyboard for each step and therefore our action is going to be discrete and inside this spaces.discrete function we are providing what is the maximum number of inputs that we can have for our model so since so in our case this is going to be full and to illustrate that i will put that in a variable and just pass it here so that when i push this to github you can have a better understand understanding of what we are doing the line number 20 is for the observation space what are the things that we are observing in our environment that are affecting our action or our ai so which step which now let's think about this logically how should we determine if we should move right or to the left or to the top or to the bottom well we need to know where our snake is and we need to know where our food is or in another world we need to look at the position of the snake and the position of the food so and since this is a 2d game each position has an x value and has a y value so we have the snake position in x we have the snake position in y we have the full position in x and we have the fourth position in y therefore we have a total of four number of observations that we are making so let's go ahead and write that down number of observations equals four and now for the you can set limits for the observation what is like the minimum value what is the maximum value uh one many of the gem prebuilt examples what i've seen is that they set this to minus infinity to positive infinity and for the shape we provide the number of observations we can ignore these two this is if you want something um that is more complicated but in our sample case what i recommend is just set that from minus infinity to plus infinity and set this to float and just pass the number of the things that you are observing and what else well um we need to import uh the numpy library so let's go ahead and do that okay i'll just try to put this on the same line so that we can read this much easier so that is for the initialize function uh constructor now we are going to add more stuff so stay stay tuned uh in our step function this is uh the function that gets called for each action that is made by the ai so which is right or left or top or to the bottom so for example when we are moving our stake the snake position changes and like for example if it ate some food that food disappears and moved to a new direction we will see how that looks in a second we will see that it returns four values observation reward done and info observation as i mentioned the observation space are the things that we are observing so in our case that would be the snake position in x snake position in y fourth position in x and fourth position in y in as uh numpy array the reward is the most important thing that i want you to focus on so it's a numeric value between like minus infinity to plus infinity if a1 e i did something good we want to reward it with a positive value and if our ai made the wrong action then we want to reward the ai with a negative value and the bigger the magnitude of that keyword is the higher the impact so a plus 10 is better than plus one and a minus 10 is much worse than minus one gun indicates when the game is actually done so in the cut pool example that would be when the pool falls down or goes below a certain angle for the snake example it would be either by eating uh either by the snake eating itself or by hitting one of the four edges or walls the info is an optional object that you can return by your set function in our example we will make this as an empty object however you can uh here if you want like to look into some some variables or you will consent by some stuff just put them in there for um just put them in the info object here the reset function is the function that is responsible for resetting everything to this its initialized state so for example in our snake example game we want to initialize the score to zero we want to initialize the snake position uh to where it was which is usually at the top left we want to initialize the foot position to some random position and so on and we'll stand the observation and the observation is for the reset system or the reset function is the same as the observation of the step function the render function is where you put all your rendering logic so for example displaying the window um changing the score all that stuff is probably inside that with the render function the close function is how you would close the um how you would close your game so in that in our example since we are using pygame the function is pygm.quit so let's fill out these functions based on the game that we already have here first i will start with the difficulty so this difficulty is used um for the frame per second controller so the higher the symbol is the faster the snake or the game goes so i will copy that and for any constant values i will put them um exactly under the class but you will feel free to use python in any way you think fits you the best and for the difficulty i will pick 20. because i'm trying to make the game a little bit easier to explain to you uh how to train the ai now um of course the bigger the difficulty is the bigger the for example the bigger the window size is the harder the game is but in return you can do like more iterations and your ai performance will be better but i don't want uh to take like hours uh training the model uh on this video and so for this that same reason i will divide the window size by by two so that being said let's scroll down this is we are when we are initializing our pie game it is being assigned to check errors so if there's an error it will print it but i know for a fact that you should not expect any sort of results so i will initialize this and as we mentioned any initialization logic should be inside the constructor of your gem environment so i will put that here and i will import pygame let's continue again these are stuff that happened when the game is initialized he even wrote the comment about that so let's put that here and now i want to explain something very important especially for people who are not aware of python you want to make sure that for your object variables to prepend them with a self keyword so that they can be accessed in uh in other functions like for example this game window uh variable will be used in other functions as you can see so for that i will prepare this with a soft keyword to make it an object variable and this is something that you will see me do a lot through the video and these are the colors they are constants so i will just put them at the top this is something that only happened at initialization and the truth is anything before the while loop is something that happens on the initialization because the reason why we have a while loop in the fight game is because we want to keep it rating uh because there are stuff that keeps changing while the game is being played so let's copy all these lines together and let's put that here and again this is something that was me see me do a lot which is to defend our variables using the self-key world now you might be wondering why i'm not defending the random because this is not available this is uh the library that we have from python and now all the errors that we have are gone and now we have this game over function if you remember when i died we saw the red uh you died in red on the screen and we saw the scroll at the bottom in red now when we when we uh test our model we don't want to see that every time the motor loses instead we want to do something similar to the cardboard which is when the game is done we want to listen everything to its initialized state and then we want the uh game to play automatically so that we don't keep wasting time so for that i'm going to make this game over available or to be more precise a boolean variable so we will not do all this logic uh so that the game keeps playing keeps being played even after we lose so for that i will create this variable and i will assign it to false my bad because at the beginning uh it's not game over obviously and this is the show score function this will show the score on the top left i will put this function at the bottom and since this is a function inside the class i have to put assign the self at the beginning and as usual we assign this to all of the variables as you can see here yeah everything looks good so far let's go down and now this is the main logic as you can see here and send all our logic that is not related to rendering but the logic of your game itself should be inside the step function here as you can see if we are trying to quit the game the game gets quitted so let's put this inside our clause function and here let's import the assess package and let's go back and here when i applied when i was playing the game i was using the keyboard and there are two ways to play the game either use the up down right and left keys or you can use wasd just as in most games now in our case the input is the action itself that is being passed to this tip function as we can hear as we can see here but before i tell you what that value is let's copy this logic for now and see how it looks like in action and now just before i continue i will make sure that this certain function is on the same indentation so what is we said this is the input right this is what the ai provides as an action so what is the value of that action the value of that action since we are using this creates space as we've shown in line number 50 it's actually going to be a constant and that constant depends on the number of actions that we have and since the number of actions is formed that means that the num the action value ranges from zero to number of actions minus one so in our case since action since number of actions equals 4 that means that the action value is from 0 to 3 inclusive now we can't deal with numbers in our code and that will be totally fine but if we want to make the code more readable it's highly recommended that you replace the numeric values with constants so for that i will create a constant for each action that our function our environment or our model can make i will call this action uh constants and just before i continue i just noticed this fps controller it should be after the game is uh after the fight game is initialized if i'm not mistaken let me scroll up yeah it had to be uh after you initialize the pi game okay let me go back to the action constants so i will create four constants left right up and down and i call them this way to be uh consistent with the snake game file as we will see in a second up equals two and down equals three so now we want to say if action is up this is our perfection it's down this is down if this action is left this is left if action is not this is right so we are replacing the keyboard events with the action value so let's do this and in the second case oh sorry this is up this is down this is left and this is right and again this all these variables make sure that they are always dependent by the self keyword since they are objects now let's continue we are they have a small event which is if we hit escape we are not going to hit escape obviously i will copy uh these two at the same time just to save you some some time and i will take this and i'll put that here and just make it this thing's easier let me do this and i will explain to you what uh what this logic does in in a second sorry this is the snake position okay so for this uh for this code uh remember when i told you that the snake cannot return 180 degrees so if the snake went to the is if the snake was already hit looking at the right you cannot go to the left and the opposite is true for all the positions this is the logic that enforces that and now this is the logic that changes the snake position we are when you are moving to a certain direction so if you're heading to the top obviously the snake position changes the same goes for all the for all the directions okay let's continue and as i did previously i will just open all of these with the self keyword and then we will explain uh briefly what this logic does okay so what does this logic do so every time the snake moves its its body size increases but then why the snake body size says the same when you don't eat food because if you don't eat food the uh the snake body decreases by this pop function but if you ate the food the snake body increases and it doesn't decrease and that's how the snake body keeps only increases when you eat food and obviously if you eat food or if the position of the food is the same position as the snake you increase your score you score and then you sit the food spawn to false because you want the food to disappear and then appeal in a completely different position that is the logic here so if the food spawn is false or if you ate the food in in another language make sure that you put the food position in a different place and make sure that the food is being spawned let's scroll down gfx stuff game window play game draw rectangle buy game projects all of these are what all of these are related to rendering the page not related to the logic itself so what does that mean that means we need to put them in the render function and i just have to explain to you this is the function that's responsible for drawing their snake food etc and the snake body yeah this is the conditions of when the game is over so when is the game over we have two conditions one is if we hit one of the edges or one of the walls and the second condition is when the snake eats itself so this is for the wall and this is when it touches itself and here i as i mentioned i set this to a boolean value of false and true so in this case this would be a value of true and let me do the same thing here and in here okay everything looks good so far let's scroll down showing this phone updating the display and changing the frame per second controller all of these are related to rendering so let's take this and let's let's put this in the rendering logic okay i apologize i just noticed something with this should not have the uh self keyword and that's why it was complaining the pie game is um is initialized once and that's it you don't need to uh you should not repent this with this keyword okay that's it this is you've done your uh one of the most important steps which is to take the logic of your game and split it into the functions that you have okay so now we have we are done with the close function completely we are done with the render function completely the only two functions that are left are the reset function and the step function so what are we supposed to do with them let's start with the reset function which is easier as i mentioned the reset function is used to initialize your object variables to their initial uh to their initialized state what is the initialized state of these variables they are the they are the values that we can see in the constructor so these things always stay the same and if you scroll all the way to the bottom this action space and observation space do not touch them you just put them in the constructor and you don't look again so the values that changes as you play the game are these are these in the in them in the middle snake position snake body food position if the food is there or not the snake direction that changed on the direction variables the score and the game over variables so for that let's take this and let's put that in the listed function as we can see here and we are obtaining the observation and as i mentioned we are observing the snake position and the food position so let's do that uh self the snake position of zero self.snake position of one self dot food position of zero and self.food position of one of of of the y uh which is the second element of the array and here there's something that's very important that i want to highlight let's just fix the indentation and this is something that i read from it from many places which is to make sure that this is an a numpy array so for that i will make this an apply array and we want to make sure that the d type of that numpy array is the same d type that we've used in here at the top now the stuff that we will observing are are numeric they are numbers but uh you just make sure that it's the same thing in here and let's put that in there and yeah i think this should be good let's look okay i don't think it's matters but let me just this okay so far we are doing very good so now we have that we are done with the listed function so what about the step function we are returning four things like the observation the reward the done and the info let's start with the observation we mentioned it's the same thing so let's copy and paste so what about the reward i will keep the reward in the end because it's the most important thing which is the one that i want you to pay attention to the done function is when the game is done so let's ask ourselves when the game is done the game is done when the game is over or in another world when the game over boolean variable essential what about the info it's something excel that you can add to your step function in our case it's going to be an integral object empty object now the last but not least the reward i usually prefer to just initialize it as zero which is neutral not bad not good and let's try to think for a second on how we can reward or punish our ai so when we think of this simply when we can say that our snake did a good job and when we can say that our snake did a bad job well the snake did a bad job if it lost the game or in another world the snake did a bad job if the game was over right so let's start by implementing this so if self.gameover that means that we want to make this bad so we assign it a negative value so when it did a good job it did a good job if it hit a certain food because of that action right so when does it eat a food it's when the school is appended and if you look at the top here you can see the score is appended or in a different world if the foot position and the snake position is exactly the same so let's copy that and let's put that in and let's assign that to one now what do you think will happen if we train the model like that do you think that our ai is going to be trained very well the answer is no how do i know that the answer is no uh first uh because i tried that because before the video and there's something that i want you to take from this hour if you forgot anything everything here just remember this experimentation always experiment different things as you've trained your ai and you'll discover more and more amazing things and one of the most important aspects of what you should experiment on is the leewand and how you are rewarding your system and trying to see its behavior based on the different new world values so there is one problem with this with this system and the main problem is if you run this um ai for like a thousand times or two thousand times it will be it will render useless why because the chance of the snake eating the food in these conditions are extremely slim and i would say with confidence that over 90 percent it's going to lose the game because it doesn't understand how it should reach that food and for that we need to teach our ai how to get to that food or we need to motivate the ai to reach that food so in order to achieve that we need to look at the position of the food and the position of the snake so if after you run a specific action or a specific step and you notice that the snake is closer to the food it means that the snake is doing the right thing but if you notice that this thing is going away from the food it means that your ai is probably making a wrong decision so for that what we are going to do is comparing the colored position of the snake with the previous position of the snake and compare the distance between the snake position and the foot position in comparison to the previous snake position with the foot position and since we are concerned about the distance we're actually going to use we are going to take the absolute value of the difference between the position of the snake and the position of the foot let's see how that looks in action so else if now before since we want to compare with the previous position uh before i write this uh i will create a new variable called the previous sneak position let me put it here and i will assign it i mean initially it's the same value as the current snake position and something very important before we even continue we need to make sure that the snake the anise thing that we put in that constructor we need to make sure that it's in the reset function extremely extremely important otherwise when the game is done and the a i will try to thin again it will think that the previous snake position is somewhat far far away from the cannot sneak position and it will be extremely confused you do not want that you don't want that okay so i put this in the listed function so let's scroll to the step function and add our logic to compare the distances so else if the absolute value between the difference between the current position for the snake in x and the foot position in x plus the difference for the colored snake position and the food position but in the y axis if that is larger than and let me copy the same thing and let's say this is the previous sneak position so what does this mean that means is if the current position distance is larger than the distance of the previous position it means we are going farther and further away from the food which is a bad thing therefore we want to punish our ai by assigning a negative value to our reward system so let's just assign this to minus one else if instead of writing everything from scratch let's do it like that we want to reward the system if the current position is closer than the previous position as you can see here and something that is of utmost importance we want to make sure that after we do that that we want to assign the previous position with the current position because the current position will be the previous position of the next iteration or the next step so let's do that now there is one tiny problem if we did it like that and the problem is here the snake position is being passed by reference to the previous knit position so if you change the snake position in the next iteration the previous snake position is going to stay is going to change as well and we do not want that therefore we are going to use the copy function of python to make sure that the previous state position is being assigned to a copy of the current state position so that if the snake current snake position changed in the next situation the previous snake position stays the same okay now let's go ahead and look back at this reward our system this reward system is definitely better than the previous one there is just one tiny thing about this which is that we are saying or we are claiming that getting closer to the food is as important as eating the food itself and going further away from the food is as worse as losing the game but this is not the truth these are just motivations to make the ai do the right thing but the truth is eating the food and losing the game first of all eating the food is way more important than getting closer to the food itself and losing the game is way worse than going away from the food therefore the magnitude of losing the game or eating the food should be way higher so for that i will put some arbitrary value of 100 okay so now we did this are we good we are getting better the truth is there is even something even bigger than this why and how did i know how did i know is because i was experimenting before the video as i explained and then um this is something that you will not think much about when you play the snake game but then now if i told you about that you will be like oh it actually makes a lot of sense to the ai you know what what what is going to happen if you try to train this in many occasions you will see that the snake is going to keep rotating around itself forever and now you might be wondering why it's doing that because it's trying to feel the chance of losing the game because it's looking at this new wall it's looking at it it looks that it's minus 100 and it's like oh that's extremely bad let's avoid it at all cost let me show you what i mean by this i will just run the snake game again so this is what i mean you'll see the snake game just keep rotating like this forever which makes sense from our perspective because it thinks it's doing fine because it's not losing the game however i want to avoid that in in our video so i'm going to do two things first i will um i will ignore this logic you'll tell me okay but that will make that that will make the st snake more likely to lose the game i will tell you yes that's true however if the snake was trained properly it will actually try to go to the food so and since it will try to go to the food it's inadvertently not hitting the walls now of course uh going back to this logic the chance of not hitting the wall is much uh is more rare than this but in return this will give me a higher chance of getting higher score but you but it's true that um the chance of hitting the wall using this logic is higher but at the same time the chance of getting a higher score believe it or not is much better with this logic and for that same reason since i don't want the uh for the game to keep uh strolling because of that charge of the snake keep rotating forever i will create a new variable and i will call it counter and i will give a limit a time limit before uh the snake hit the uh eat the food so for each score or uh like full if for 100 step the snake did not eat the food the snake loses this is an additional rule i just created so that we can see how the game looks like in action or how to see how the game looks better in action and of course i recommend that you take this file in the final file i try to change some stuff and try to explore uh it's extremely fun okay so now if you put something in here very important you put it in the reset function and i will go to the step function and i will increment that by one and i will say if self.counter is bit is longer than 100 so if for 100 step you do not need that food you lose immediately so for that i will just return um the step function i will copy this code and then i will modify it slightly the info is an empty object the done is true now in reality in the original game it's not but in my modification it's going to be true so that we don't see the snake keep rotating and i will assign this to -100 i will tell it whatever happens do not keep rotating around yourself but you'll tell me that way it will try so hard not to rotate around itself which is which is true which is true so anything that you do in your world system has good advantages and it has advantages and disadvantages and you need to balance your thoughts and how you are going to do it i'm doing this best on my experience but again you can you probably can achieve better results you might achieve first results this is the fun of uh training the ai okay love to put things in the same line and um yeah and one very important thing which is that if they think at the food i want to reset the counter so i will take this and i will put it there and there's now one more thing before we look uh before we finish starting the gym our environment that i want you to pay attention to since i didn't care about losing the game there is a high chance that the snake will eat itself right because i did not tell the snake that um that it's bad to eat itself which makes actually sense and this is something that you will see uh however um this is something that i recommend you to do which is how can you implement a logic to prevent the snake from touching itself and if you did would that make your training better or worse so that will be your assignment okay and i think everything looks good now um okay so now we are done with creating our environment with that being said let's scroll down to the bottom now since we created our environments what should we do here at the bottom very simple we assign that environment we create a new environment object just like we do in any python program and at line 174 where we are assigning the model to the right reinforcement algorithm now based on my experience with this game i have tried to use uh many algorithms and the one that worked best for me is the ppo so this is based on experimentation not not based on one not based on knowledge now that being said there are some algorithms that only work with discrete actions some of them only work with continuous actions etc okay and that's it now before we um test it i will just run the program only once with one notation i just want to make sure that there are no syntax mistakes there are no issues within the code and then i will train the model and we will be able to see the finalized results okay okay this is the snake game my bad i meant to run the main game i mean the our ai okay so there are no exceptions so that that's a good thing now i'm going to train our ai to run one hundred thousand iterations you'll tell me why one hundred thousand i noticed that after a hundred thousand it performs generally well at least for the video video perspective now of course it will take some minutes maybe three to four minutes i will um i will add timestamps if you don't want to wait that long and the second thing is in order to avoid like let's say that you don't want to keep uh teaching the program every time you don't want to waste minutes every time running the running this you want to save the model and for that i'm going to actually save the model and i will save it i will save it under a file called snake ai model for now and this will create a zip file that has our that has our our model and then we can actually load the model without training it every time and let me put like 1000 iterations and um yeah let's let's run this and see how it will go fingers crossed and something that um of course you can skip the video uh in the meanwhile for people who want to stay i will tell you some funny stories while making this video this is i think my third or fourth attempt to make this video and i spent hours and hours um every time i make some small mistake that ruins this experimentation entirely or sometimes i do something then i discover or maybe this there was a better way to do it and i kept doing this over and over again until i found the final result that uh hopefully you will about to see so if you hear me saying this it means things are probably going to be good especially like one of the main mistakes that i made uh for the reward system which i wanted to explain highly in the video um yeah just and also the algorithm like i always use the e to see the active critic because i don't know that was in the main example but then i thought to myself maybe i can use different algorithms and surprisingly there was some good difference between the ppo and the a2c so yeah just pay attention also if you made a mistake in just one variable in one line your logic is going to go all the way to the bottom and your model is not going to work properly so make sure that you pay a lot of attention to what you are writing um yeah so this is i just want to share this with you but this is the third time i record this video the first time took me one and a half hour the second feed is the same thing um hopefully this one will not take you that long assuming that this will work from the first time okay and i think i know we are in halfway through because if you look at the console on the bottom left you see the number of updates to 300 right now and it keeps incrementing to 10 and if i'm not mistaken once it reaches 500 we will see the snake gaming in action now i will not spoil you the performance of the snake um but again the purpose of this video is not like to provide the previous perfect model uh it's more really about learning how to implement the enforcement learning and then learn from our mistakes and then show you how to do how to get a custom environment this is something that i didn't see much in in videos like every time you see something about reinforcement planning you see that gospel example or you see some of the pre-built examples but you don't see much about custom environments so this is where it hits me that i wanted to do something with a custom environment which is what you probably would want to use if you want to use reinforcement learning okay so we are about to be done please work please work okay please please do well my ai please go ahead okay so it's let's see so it's doing not so bad as you can see it doesn't care much about hitting the wall because i'll move the game over from the rewards but overall it's going to the food fairly well if you want my opinion and as you can see this is just from training the model for like five minutes but imagine if you train this way more and how accurate it is and this game by the way is extremely difficult um even after i made the winter size smaller now why is this difficult because you want to take into considerations that the there are a lot of pixels in this actually let me run this again in two thousand uh range and so that we have fun um i hope that you are enjoying this as much as i do okay i wanted to stop this uh because say i only save the model so we don't have to iterate 100 000 times so what i'm going to do i will skip all of this and i will load the model immediately so for that i will use the ppo function and i will paste the file name which is this one and let's fund this yeah so sometimes it doesn't eat the food but even if when it doesn't eat the food you will notice that it's very close to the food you know of these locations like this one and this one as well now if you added uh a negative reward for the game over thing so that it doesn't hit the wall instead of going to the wall directly like it's doing right now it will keep rotating around itself but i didn't i didn't i did not want to do that or to show that in the video so that we can see a better how the ai behaves instead of just keep resetting the again i hope that gave you a very good idea on how to use reinforcement planning of uh of this game and this game i don't need to mention this again is is even though um it's actually way harder than the gospel example um because in the cupboard you only have two inputs and the logic isn't really that complicated and that windows like the observation is not that large right but here there's um there's a lot of positions uh for the snake etc yeah but overall um i think it's doing a good job for such uh complexity and for a higher number of inputs and that's it um i hope that you really enjoyed the video i will go ahead and push it to the github repository right now and see you next week thank you and goodbye

Info

Channel: Mustafa Abusharkh

Views: 3,025

Rating: undefined out of 5

Keywords: coding, programming, Mustafa Abusharkh, mufasa11037, python, stable-baselines, stable-baselines3, machine learning, reinforcement learning, snake, snake game, gym, gym environment, custom gym environment, reinforcement learning algorithms, AI, train AI, openAI, artificial intellegence, Computer programming, pygame

Id: 5dxJEXCjruE

Channel Id: undefined

Length: 73min 36sec (4416 seconds)

Published: Tue Sep 21 2021