Training AI to Play Pokemon with Reinforcement Learning

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[Music] right now we're watching 20,000 games played by an AI as it explores the world of Pokémon Red in the beginning it starts with no knowledge whatsoever and is only capable of pressing random buttons but throughout 5 years of simulated game time it gains many capabilities by learning from its experiences eventually the AI is able to catch Pokémon evolve them and defeat a gym leader at one point it even manages to exploit the game's random number generator but perhaps more fascinating than its successes are the ways that it fails which are surprisingly relatable to our own human experiences it turns out that studying the behavior of an algorithm can actually teach us a lot about ourselves in this video I'll tell the story of this ai's development and analyze the strategies it learns at the end I'll go deeper into some technical details and show you how to download and run the program yourself let's start by asking how does it work the AI interacts with the game much like a human it takes in images from the screen and choos es which buttons to press it optimizes its choices using something called reinforcement learning with this method we don't have to tell the AI explicitly which buttons to press we only need to give it high level feedback on how well it's playing the game what this means is that if we assign rewards to the objectives we want it to complete the AI can learn on its own through trial and error so what does this look like the AI will start with no knowledge or skills whatsoever and will only be able to press random buttons so for it to get any useful feedback we need to create a gentle curriculum of rewards which will guide it towards learning the difficult objectives perhaps the most basic objective to start with is exploring the map we'd like a way to reward the AI when it reaches new locations more generally we'd like to encourage curiosity one way to do this is by keeping a record of every screen the AI has seen as the game is being played we can compare compare the current screen against all screens in the record to see if there are any close matches if no matches are found this means the AI has discovered something new so it will give it a reward and add the new screen to the record rewarding it for Unique screens should encourage it to find new parts of the game and seek out novelty all right now that we have an objective let's start the learning process and see how it goes at this stage the AI is essentially pressing random buttons just to see what happens to gather experience more quickly we'll have it play 40 games simultaneously and each will play for 2 [Music] hours now the AI will review all of the games and update itself based on the rewards it earned if it goes well we should see incremental Improvement and the whole process can repeat after a few iterations of training the AI finds its way out of the starting room noticeably faster than When Its Behavior was random for some reason though in instead of exploring Route One it becomes fixated on a particular area of Pallet Town why is this it turns out when you're seeking novelty it's easy to become distracted notably the area it's stuck in has animated water grass and NPCs walking around it turns out this animation is actually enough to trigger the novelty reward many times so according to our own objective just hanging out and admiring the scenery is more rewarding than explor exploring the rest of the world this is a paradox that we encounter in real life curiosity leads us to our most important discoveries but at the same time it makes us vulnerable to distractions and gets us into trouble as humans we can reflect on our own sources of intrinsic motivation but we can't easily change them in contrast it's straightforward for us to change these for the AI returning to our current scenario a reward is triggered if more than just a few pixels differ from previously seen screens but if we raise this threshold to several hundred pixels the animations won't be enough to trigger the reward anymore this means the AI will no longer get any satisfaction from watching them and will only develop an interest for more novel locations note that each time we modify the rewards we restart the AI learning from scratch because we want the full process to be reproducible after this change the AI starts to explore Route [Music] One eventually it makes its way up to vidiian City this is great progress but now there's another problem in most battles the screen looks pretty much the same so there's not much exploration reward to be gained during battles this in turn causes the AI to Simply run away from them but ultimately it can't progress without battling so to address this let's add additional rewards based on the combined levels of all of its Pokémon now with this new incentive to gain levels the AI gradually starts winning battles cat catching Pokémon and leveling them up eventually it's Pokémon reach a high enough level that they start to evolve it actually tends to cancel the evolutions at first before finally deciding that it's beneficial it does however seem to get stuck when a Pokémon battles long enough for its default move to be depleted despite this after many more training iterations by version 45 there's substantial Improvement at this point we now see a pidgeotto being evolved for the first time and it's also finally figured out what to do when a move becomes depleted and is able to switch to an alternative one this will turn out to be super important for something that happens later on at version 60 the AI starts entering vidian forest and begins its first trainer [Music] Battles by this point it actually has enough experience that it succeeds on its very first [Music] encounter after this it slowly starts figuring out how to navigate the [Music] forest [Music] finally at version 65 it's found out how to get through the forest and makes its way up to P City but something still wasn't right despite making a lot of progress the AI is charging straight into battles even ones it's not able to win to make things worse it never visits a Pokémon Center to heal which means that when it loses it's taken all the way back to the very beginning of the game we can try to fix this by subtracting reward when it loses the battle but this doesn't work quite like we would hope instead of avoiding difficult battles when it's about to lose the AI simply refuses to press the button to continue stalling indefinitely this technically satisfies the objective but isn't what we intended while studying the effects of this however we noticed that on very rare occasions there's something else subtracting a huge amount of reward just once per training run there's consistently a single game where a reward 10 times larger than anything we intended is deducted replaying the moment just before this happens we see the AI walking into a Pokémon Center and wandering up to a computer in the corner after logging in and aimlessly pressing buttons for a while it deposits a Pokemon into the system and immediately a huge amount of reward is lost this is because reward is assigned as the sum of the Pokémon's levels so depositing a level 13 Pokémon is losing 13 levels instantly this sends such a strong negative signal that it actually causes something like a traumatic experience for the AI it doesn't have emotions like a human does but a single event with an extreme reward value can still leave a lasting impact on its behavior in this case losing its Pokémon only one time is enough to form a negative association with the whole Pokémon Center and the AI will avoid it entirely in all future games the root cause is that we never accounted for the unexpected scenario where the total levels can decrease to address this we'll modify the reward function so that it only gives any reward when the levels increase this seems to fix the issue after restarting the training the AI starts making visits to the Pokémon [Music] Center [Music] finally we see it starting to challenge Brock in the peer gym this battle is much harder than the other ones and poses a significant challenge because the ai's previous experience is now working against it up to this point it's had great great success using just the primary moves and has learned to rely on them exclusively now it needs to use something else this issue might seem trivial but even humans struggle with the same fundamental problem our experience and biases help us make decisions and solve problems more quickly but they also limit our thinking and hamper our ability to approach a problem from a new angle to make matters even worse if the AI loses the battle too many times it will actually learn to avoid it alt together but after carefully tuning Our rewards and countless failed attempts the AI finally has a stroke of luck in one of the games we see it starting the gym battle with just one usable Pokémon that has only a fraction of its hit points in available moves normally this would be a pretty terrible strategy it doesn't know that a water type move will be super effective against The Rock Pokémon but because tackle is fully depleted it sees it needs to use an alternative move and switches to Bubble instead and finally after over 300 days of simulated play time and 100 iterations of learning the AI defeats Brock for the first time this accidental breakthrough helps it learn to choose bubble as its default move and in future games it wins this battle more consistently this honestly exceeds my expectations of what I thought would be possible when I started this project it did take a huge number of experiments to get here but I still felt amazed each time I checked in on a training run and discovered that the AI had reached a new area I would go as far as to say it was a very rewarding experience so this seems like a reasonable stopping point but just out of curiosity let's see how far the AI will go if we let it continue after the gym battle it starts encountering trainers in route [Music] three eventually it reaches the entrance of Mount Moon inside the Pokémon Center here a man will sell you a Magikarp for $500 Magikarp isn't helpful at all in the short term so you might expect that the AI won't be interested in it however buying it is a super easy way to gain five levels so the AI buys it every time over all of the games it buys a total of over 10,000 magikarps the ai's behavior might seem silly when its objective has become misaligned from its intended purpose but there are times when the same thing happens to humans in order for the AI to progress in the game we configured it to find ways to increase its levels similarly Evolution has selected humans for survival as hunters and gatherers which naturally configured us to find ways to acquire scarce Foods when the AI reaches an area with a cheap but pointless Magikarp it buys the Pokémon every time because this increases its total levels and as humans have reached our Modern Age of abundance we instinctively buy unhealthy Foods because they contain the historically scarce nutrients each of these proxy objectives arose from very different circumstances but when their environments changed they both became misaligned and no longer supported their original objectives next the AI starts to explore the cave inside mount [Music] Moon [Music] up to this point it's fought every wild Pokemon that it's encountered but Magikarp is so ineffective that it eventually learns a special behavior just for this Pokémon if Magikarp is sent into battle in any situation it will try to run away no matter what sometimes even while fighting a [Music] trainer despite exploring a lot of the caves it seems to get stuck in this passage possibly because the area is too visually uniform to trigger the exploration reward if we extend the games to run longer than 2 hours it's able to Evolve Blastoise and Pidgeot but it never makes it all the way through Mount Moon we could keep trying to improve our reward function but instead we're going to use this as a stopping point and start to analyze what the AI has [Music] learned this visualization shows how the AI navigates around the map each Arrow indicates the average direction that it moved while standing on that particular spot [Music] one fascinating pattern is that it seems to prefer to walk counterclockwise on almost all edges of the map this means that when standing with an edge to the right it prefers to walk up which is shown in blue when there's an edge above it it prefers to walk walk left shown in pink when there's an edge to the left it prefers to walk down shown in Orange and when there's an edge below it it prefers to walk right shown in green it's hard to know for sure why it developed this behavior when explanation is that this heris helps it navigate with limited memory and planning if you walk around the perimeter of a two-dimensional space you're guaranteed to pass all entry and exit points so by picking a direction and following an edge you can reach all possible Junctions the AI doesn't always follow this pattern but it does seem to help it recover after it gets off track however there are locations where its navigation fails and the AI becomes physically stuck for example at the bottom of Route 22 there's a long area with nothing useful the one-way ledge on the top means that there are many ways to enter the area but there's only one spot where it's possible to leave this acts like a fly trap and the AI stochastic movement causes it to get stuck and spend a disproportionate amount amount of time here we can also visualize how the ai's behavior changed over the course of its training here the early training iterations are shown in blue the middle training iterations are shown in green and the later iterations are shown in red here we can clearly see that in the middle of its training the AI was taking one path through vidiian Forest but later it switched to a different [Music] one another interesting Behavior which developed partway through the training happens in the first few moments of the game for some reason it was starting every game with the exact same sequence of button presses this was confusing particularly because the movement didn't even follow an optimal path watching a little further however something interesting happens it throws a Pokeball immediately on its first encounter and succeeds on the first try what this made me realize is that even though the game contains many randomized elements it still behaves deterministically with respect to the player's input this is something well known to speedrunners and it seems that the AI has taken advantage of the fact that it starts from the same state in every game to reliably catch a Pokémon on its very first try we also can visualize other statistics to understand what happened in all of the games that the AI played on the left we can see every Pokémon that the AI caught at least once on the right we can see at what point in training those Pokémon were caught the height of each region is scaled logarithmically so the Pokémon that represent the thicker regions were caught thousands of times whereas the Pokemon representing the thinner regions may have only been caught a handful of times reflecting on all this it's incredible that relatable experiences can emerge from an algorithm playing a video game so much has happened during these tens of thousands of hours there simply isn't enough time to discover all the interesting stories much less document them so here we conclude the main part of the video in this next section we'll dive into some technical details explore strategies for running experiments efficiently consider future improvements and go over how to run this program yourself though I've tried to avoid it as much as possible so far this part will inevitably contain much more technical terminology first off the specific reinforcement learning algorithm used to train this AI is called proximal policy optimization it's a pretty standard modern reinforcement learning algorithm and even though it was originally designed in the context of games and Robotics it's also been used as the final step in creating useful large language models however while reinforcement learning does sometimes feel like magic it can actually be an incredibly difficult tool to apply in practice the fundamental challenge of machine learning is getting a program to do something without explicitly telling it how to do it this means that if your model isn't behaving the way you intended you have to figure out how to improve it indirectly in terms of its learning algorithm or training data online reinforcement learning adds an additional layer of indirection on top of this the training data being fed into your model is no longer stationary and in your control but is itself a product of the model's Behavior at an earlier point in time this feedback loop leads to emergent behavior that can be impossible to predict so here are some strategies to engage these challenges without institutional scale resources first it may be necessary to simplify your problem to work around the limitations of the tools you might have noticed earlier in the video that the AI wasn't actually starting from the very beginning of the game here you can see an early version of the AI which does start from the very [Music] beginning it doesn't have any issue choosing a Pokemon and winning its first battle the problem comes a bit later when it's necessary to backtrack from vidiian City to Pallet Town this is because the exploration reward doesn't give it any incentive to return to an area it's already visited it might be possible B to solve this by hacking in special rewards for just this scenario but I decided that simply skipping past this was a better use of time the modified start point is still very close to the original and by giving the AI Squirtle by default it has a better chance of success later on next it's important to find a setup that allows you to iterate on experiments within a reasonable amount of time and cost in many cases the bottleneck will be simulating or operating the environment that the AI is interacting with in our case the environment is Pokémon Red running on the Game Boy emulator P boy this emulator runs in the ballpark of 20 times normal speed on a single modern CPU core running mini games in parallel on a large server with many cores allows us to effectively gather interactions with the environment at over 1,000 times faster than normal speed this means each learning iteration with a batch of 40 games each lasting 2 hours will complete in around 6 minutes if we use a small model as the policy its inference and training time will be negligible especially if we use a GPU this means we can get results from small experiments in minutes to hours and a full training run will take a few days this can easily get quite expensive using the cheapest possible Cloud options a single full training run costs around $50 all of the experiments run in this project combined cost a total of around $1,000 however if you're not careful how you choose and manage these resources it's easy to spend many times more next you'll want to care carefully consider how the AI interacts with the environment and how your reward function is designed the choices I made in this project certainly are not all optimal but I'll describe at least some of the considerations that went into them for example the AI observes the screen and chooses an action once every 24 frames this is just enough time for the player to move one grid space in the world so whenever the AI observes the screen it will always be aligned perfectly on one of the grid cells this makes the exploration reward much more effective because it significantly limits the number of possible screens that can be seen the component that makes decisions called the policy is represented by a small convolutional neural network notably it's non-recurrent which means that it actually has no internal memory of the past this was done to improve training stability convergence speed and simplicity so how does it make decisions with no memory of the past well first the three most recent screens are stacked together to create a simple form of short-term memory second some basic information about the game state is encoded as visual status bars these display hit points total levels and exploration progress a more conventional approach would be to encode these as abstract vectors injected directly into the model I chose this method instead because it's both human and machine interpretable which makes it much easier to tell what's happening when debugging recorded games together these encode enough information that the model can make decisions without any other form of memory now let's talk talk a bit about the reward function exploration and total levels were discussed earlier in the video these were by far the most important rewards but there was actually seven used in total and even more that were tested but not ultimately used there isn't enough time to cover them all in depth but the general criteria for all of them is that they broadly encourage playing the game well rather than focusing on a specific moment and they shouldn't be easy to cheat all this information about the game state is obtained by reading values from the Game Boy emulator memory the game doesn't have any kind of proper API but most of its memory is statically allocated so variables can always be found at the same memory address the Pet Project has done an amazing job of reverse engineering the game's source code and mapping out its memory as a tangent it's honestly mind-blowing that all of the logic graphics and audio for this whole game is stored in less than 1 Megabyte this is smaller than a single photo that you take on your phone anyway understanding the AI beh behavior is essential for realizing its full potential one of the best ways to do this is through visualization so how are the visualizations used in this project made first key information like player coordinates and Pokémon stats are recorded at every step in every game next all of the games need to be rendered onto a single map the game itself doesn't contain the concept of a single map with a global coordinate system rather the world is divided into chunks of 256x 256 tiles the game tracks the player's local coordinates within a chunk and which chunk the player is currently in so after deriving a mapping from local chunk coordinates to global coordinates all games can be rendered together the actual rendering itself is a rather slow numpy program which places Sprites at the coordinates the players are moving choosing the appropriate Sprite based on the direction they're moving and interpolating between each tile to render the Giant game grid a python script is used to generate an FFM Peg command to tile together thousands of videos using a rather endowed server the flow visualization was made using the same player data by aggregating all movements on every tile combining all these strategies it's possible to train a reinforcement learning agent to play a complex game using only modest resources but what else could be done to further improve this process let's take a moment to consider how this could get easier cheaper and faster in the future first as mentioned earlier the AI in this project started learning from scratch with no prior EXP experience at all in the future it would likely be possible to apply something called transfer learning this is when a model is pre-trained on a large broad data set which can then be leveraged for new tasks very efficiently in the past this has revolutionized the fields of computer vision and natural language processing there's been some interesting early work applying this to RL but it hasn't really landed yet this is partly due to a lack of large diverse data sets for these types of tasks however it seems like it should be feasible to extract a useful World model from a big enough data set here's a quick experiment I did using clip for zero shot classification of game States it's easy to imagine how this could be used to create reward functions for new environments without special access to their internal State it's probably just a matter of time before large multimodal models start to have a huge impact in this area a second method of interest is directly learning environment models a couple notable Works in this area are muzo and dreamer these approaches offer a huge Improvement in data efficiency by learning a model of the environment itself and optimizing the policy against the learn model I recently ran some experiments with dreamer V3 and was very impressed with the results a third method worth mentioning is hierarchical RL which decouples low-level control from high level planning this allows fine grain movements and long-term strategy to be handled by separate mechanisms so those are some ways this technology could be improved in the future to wrap up let's walk through how to run this AI on your own computer all right so this is the repository Linked In the video description first step is going to be to download it you can do this using git or by downloading the zip the next step is going to be to legally obtain your Pokemon Red Rom you can find this using Google it should be a 1 Megabyte file ending ingb I already have mine here so we're just going to move into the root directory of the repository and we're going to copy it in in here and I'm going to rename it to Pokemon red. GB now the next step this one is optional um we can create a cond environment so we can just do name it whatever you want accept that and then activate [Music] it and now the next step is the same whether you created the cond environment or not we're just going to install the requirements from the requirements file like [Music] this all right when that's done we'll just make sure we're in the baselines directory here and then we can run the pre-trained model script which is going to be run pre-trained interactive we'll take a few moments to start up now the game should open up and the AI will start playing right away in this mode it won't do any additional learning and will only play based on the experience that it already has you can actually interact with the game at the same time that the AI is playing it so for example I can interfere with it by using the arrow keys and forcing it to go into this corner this mode is very fun because you can put it into different scenarios to see how it handles them if you want to disable it entirely you can edit this text file so by editing this text file agent enabled txt if we change this to no and and save it the AI will stop having any actions and now we can take full control of the emulator and if we edit this file back to yes it will reactivate and continue playing now if you want to train the model from scratch you're going to want a lot of CPU cores and memory but assuming that you have that the script that you want to run is run Baseline parallel this will start running multiple emulators headless without any UI and this will take quite a long time you might need to wait many hours or even days to see positive results if you want to change any of the basic settings of the emulator or the games this configuration which can be found in any of the Run files allows you to do this other files in the repository allow you to make any changes to the reward function or to modify the visualizations if you have any questions regarding the code feel free to open an issue on GitHub and with that we've reached the end thanks for watching I hope that you gained some insight into reinforcement learning or maybe even your own psychology I really believe that these ideas are as useful for understanding our own behavior as they are for advancing machine learning if you'd like to support this work see the link in the description bye for [Music] now [Music]
Info
Channel: Peter Whidden
Views: 6,214,489
Rating: undefined out of 5
Keywords: reinforcement learning, machine learning, ai, pokemon, gameboy, emulator, ppo, learning, computer science, cs, programming, hacking, python, pytorch, ml, rl
Id: DcYLT37ImBY
Channel Id: undefined
Length: 33min 52sec (2032 seconds)
Published: Mon Oct 09 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.