Reinforcement Learning in 3 Hours | Full Course using Python

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hey nick what you've been working on oh man i've been working on some awesome stuff i'm actually using reinforcement learning to train a race car to race around the track oh really how's that going yeah it's going great yup great at doing burnouts [Music] i promise guys it does get better than this let's get to it what's happening guys my name is nicholas chernotte and welcome to the reinforcement learning course in this video we're going to be covering a bunch of stuff but basically the core goal is to be able to allow you to go from absolute beginner to being able to go and leverage reinforcement learning we're going to cover a ton of stuff specifically how to set up your environment how to work with different algorithms we'll also test out on some pre-built environments using open ai gym so you'll be able to balance a cart pole you'll be able to build your own self-driving car and then last but not least we're also going to take a look at how to build custom environments something which is so so important when it comes to being able to leverage reinforcement learning for a use case which is relevant for you but all in all by the end of this video you should be able to take away those skill sets and be able to leverage reinforcement learning in a practical manner ready to do it let's get to it alrighty guys welcome to the full-blown reinforcement learning course now this course is intended to be a practical guide in terms of getting up and running with reinforcement learning so ideally it aims to bridge the gap between a lot of the theory that you see out there and practical implementation now we're going to be covering a ton of stuff in this course so let's take a look at our game plan so first up what we're going to be doing is we're going to be taking a look at rl in a nutshell and this really talks about and specifically in this section we're going to be talking about how reinforcement learning works and learns some of the applications around rl as well as some of the limitations then we're going to take a look at how you can set up your environment to work with reinforcement learning and then we're going to be using a library called stable baselines then under step 2 we're going to be taking a look at environments so environments are one half of the equation when it comes to working with reinforcement learning so we need to be able to set up an environment and specifically open ai gym environments to be able to work with reinforcement learning then we're going to kick off our training so there's a whole bunch of different types of algorithms available inside of stable baselines so we're going to take a look at how we can set up some algorithms to be able to train a reinforcement learning agent then under step 4 we're then so once we've trained our model we're then going to test it out and evaluate it so this is easier than it sounds so you can set up an environment and test it out and see what your agent actually looks like then we're also going to take a look at evaluation as well as how you can take a look at different metrics how to understand those metrics and we'll also take a look at how we can open them up inside a tensorboard something which i really really like then we'll take it one step further so step five we'll take a look at how we can leverage callbacks to stop our model trading once we hit a certain threshold we'll see how we can use different algorithms so there's a whole bunch of algorithms available in reinforcement learning you don't need to write them yourself there's a whole bunch already written for you that you can use and we'll take a look how we can use those and then we'll also take a look at different architectures so say for example you wanted to change the neural network that sits behind a particular agent you can do that as well but this wouldn't be a full-blown course unless we had some projects as well so we're going to be taking a look at three different projects so we're going to take a look at how we can solve the breakout environment which is an atari game so it's it's sort of like pong a little bit but not really we'll also take a look at how we can solve a self-driving environment so that's a car racing environment and how we can train our model to only have a picture as an input and train our car to drive along a racetrack which i think is pretty awesome and then we'll also take a look at custom environments something which i think is so so often overlooked so this will allow you to get a better understanding of how to build an environment to work with reinforcement learning now the framework that we're going to be using when we build our customer environment is going to be open ai gym so i'm going to show you all the different types of spaces don't worry if you don't understand that yet or if you're not too sure what i'm talking about we'll go through it in great detail okay that's a game plan in a nutshell now it's time to take a look at irl in a nutshell so i wanted to include this section to give you a little bit of context about what reinforcement learning is how it's meant to be used and some of its applications as well as some of its limitations this is not going to be a full deep dive into the theory and the maths behind it it's just a high level overview so you get an idea as to where rl fits in in the big world of machine learning and data science so first up what is reinforcement learning well reinforcement learning focuses on teaching agents through trial and error that's a really really high level statement now i know there's probably a lot of hardcore deep learning engineers that will probably go nick that's not quite right but it sort of gives you an idea as to how reinforcement learning learns ideally you've got an agent and it learns based on the reward that it gets so try something out if it doesn't get a reward then it tries something else if it doesn't get a reward or it gets a bigger reward it might try doing that multiple times we've also got this thing called the exploration exploitation trade-off so again i'll talk about that a little bit later but you sort of get the idea reinforcement learning is learning and based on actively engaging with an environment now that brings us to how the framework actually fits together well there's four key things or well five key things that you need to consider whenever you're working within reinforcement learning or there's four fundamental concepts so they are the agent the environment the action and then reward plus observations so think of your agent as something which is operating within an environment so this might be a machine learning model might also be a person or a player if you're working in a game environment your environment is where that particular agent is actually operating in so in this case say for example if we take a game so your player is operating within the game environment so it's getting reward based on what it actually does there now your agent will see what's happening within that environment so say for example we're taking a look at a game your player will be able to see what's around them so in terms of the observation so it'll see what the game environment actually looks like and then it'll also see what reward it accrues based on the actions it takes so ideally your agent might walk around the environment it might do something and might accumulate a point might do something else it might not accumulate a point it might even lose a life that might be a negative reward a really really good way to sort of get your head around this is to think of how you might go about training a dog say for example you wanted to teach your dog how to sit or how to lay down well your agent in this case is going to be your dog because you're trying to train your agent to be able to take the right action now the reward in this case is you giving your dog a treat every time they do the right thing so what your dog might try to do is take an action so initially you might say sit and the dog might not actually do anything so in this case it hasn't actually taken it or it's taken an action of doing nothing and in this particular case the environment that it's working with is the environment with yourself in it so in it's trying to get a reward or trying to get a treat from doing a particular thing now your dog will eventually see that it gets no reward because it didn't sit down so it might try something else so in this case you might say sit again it might then sit and then it'll say that it'll get a reward so ideally it will then start to learn what action to take in response to the environment in order to maximize the reward so it's observing the command that you're giving to be able to take the right action so this in a nutshell is how reinforcement learning works your agent tries to take an action in order to maximize its rewards in response to the observations within the environment now again i just wanted to give you a little bit of theory we're not going to delve into this too much but you sort of get the idea as to how reinforcement learning works it's a little bit different in terms of how you might work with tabular deep learning and machine learning because your agent is actively engaging with a simulated or a real environment now in this case we're going to be dealing with simulated environments but i'll talk a little bit about that later so what are some applications or practical applications of reinforcement learning well there's a whole heap out there and there's only becoming more reinforcement learning is really really popular right now because there's a whole heap of open world environments that people are trying to solve using machine learning and deep learning one of which is autonomous driving so you can see this picture up here this is actually from an environment called kala so kala is a really really popular driving simulation which allows you to actually train autonomous agents or perform reinforcement learning on it now you can actually train a car to be able to navigate through an open world using reinforcement learning it's pretty pretty cool right now another great application of reinforcement learning is securities trading so again think of this so your agent in this case will be like an autonomous trader your environment is going to be the securities trading environment so ideally what you're going to do is you're going to try to train your agent to make trades that are going to make you profit so ideally it wants to buy low and sell high and sell high and buy low so if it's short selling again this is really really popular at the moment there's a heap of stuff happening in that space another one which i'm personally fascinated by is a neural network architecture search so what you can actually do is use reinforcement learning to build up a neural network for you and find an optimal neural network which i think is absolutely crazy so say for example you're trying to build a deep neural network to solve a particular use case you might not know what the best type of architecture in terms of layers in terms of number of units or in times of activations is you could actually use reinforcement learning to try to solve this problem for you now this is obviously super advanced but it sort of gives you an idea as to what's possible with the tech another place that reinforcement learning is super popular right now is in robotics so training agents or training robots in real life can often be quite expensive because say for example you've only got one robot can be hard to train on a lot of tasks so what you can actually do is build up simulated environments of that particular robot and train that robot to do a particular thing now in this case the agent is going to be the autonomous model which is training the robot the environment that it's operating in in this case this agent i believe is trying to move a ball to the correct position this is actually based on a simulation environment called mujocho so again i'll show you that a little bit later but we're not going to be solving that one today but you sort of get the idea so we can actually train the robot the environment is going to be moving the ball to the right place and the reward is going to be how close or how far that ball is from its optimal position so again there are a whole heap of applications i've only sort of shown four there but there are a ton out there another place where it's really really popular is in gaming so again gaming is an open world environment so the reward function can be really really different each and every time you can sort of see how it can start to apply into different environments okay so what about some limitations and considerations for reinforcement learning so again reinforcement learning is absolutely amazing and i'm fascinated by it but there are some limitations specifically for simple problems reinforcement learning can sometimes be overkill so say for example we're taking a look at hyper parameter optimization there's already really really powerful models for that particularly when you're dealing with simple models but if you're getting to super advanced problems reinforcement learning could help you out in that space another thing that it assumes is that the environment is markovian that means your future states for your environment are based on your current observations and there's no random acts but we know in real life that there's going to be random events that happen that influence our particular model so say for example you were training your mujoko robot right in this particular case your environment might not cater to people walking past the robot or knocking the robot so you never really know what's going to happen in real life you can only train in your best case scenario so again we can sort of deal with this because in our reinforcement learning model we're going to sort of isolate our environment but again it's just something to take in mind when somebody asks you that question another thing to note is that training can take a long time and is not always stable so we've got this concept called the exploration and exploitation trade-off so ideally what your model tries to do is explore the environment when it's starting out and then it tries to exploit it to be able to get the best possible rewards but sometimes what might happen is your model might not have enough time to explore and might start exploiting too early so sometimes we need to tune hyper parameters to be able to get our model to truly explore the environment and truly understand it sometimes because we don't get this quite right our model might not all be that stable so we might get to a certain point we might reach a cap in terms of our maximum reward but now another thing to note as well is that training can take a long time so if you've got a really really open environment so say for example you're trying to train a reinforcement learning model for grand theft auto because it's such a huge environment training a model to sort of work out what to do in that particular case is going to take a long long time all right now not to be down i just sort of wanted to bring up some of those limitations and considerations now on that note let's start getting onto our setup so step number one is going to be setup what we're first i'm going to do is install our required dependencies now in this case it's really really simple to get up and running with this just a single pip install all you need to do is run exclamation mark pip install stable dash baselines three and then insider square brackets pass through extra so stable baselines is a reinforcement learning library that allows you to work with model free algorithms but again we'll talk about that later so we can work with stable baselines to build up a reinforcement learning agent to be able to train against a specific environment now the cool thing about stable baselines is that it's actually based off an original library called baselines which was built by open ai the great thing about stable bass lines is that there's a whole heap of really really useful helpers now i've got the documentation on the screen so this is the full this is the migration link but if you wanted to go into stable baselines i'm going to include these links in the description below as well as all the code that you're seeing in here but you can see here that stable baselines actually is or this is the documentation there's a whole heap of guides and it's a really really well supported environment or really well supported library as well so again really really really useful um there's a whole bunch of getting started information if you want to be able to leverage that sort of shows you how to get started really really quickly so this here is one single reinforcement learning environment and training in a single what is that like 40 20 lines of code so again you can get started really really quickly with this but we're going to be going through all of this in great detail as we're walking through it so let's kick things off and start by installing stable baselines so i'm going to be working inside of a jupiter notebook environment for this and i'm going to give you the baseline code or the starter code as well as the completed code as well inside of the github repo in the description below so you'll be able to pick up all of this and work with it at your own pace so first things first we're going to have 10 different steps that we're going to be going through for our main tutorial and then we're going to have our projects as well so the first thing that we need to do well let's actually take a look at these 10 steps so first up what we're going to do is import our dependencies then we're going to load up our environment so in this case we're going to be solving a reasonably simple environment called cartpole and i'll show you that in a sec we're going to take a look at how to understand an environment because that is so so important then we'll train a reinforcement learning model i'll show you how to save it down to disk and reload it so if you wanted to go and move it elsewhere or go and deploy it you could do that we'll take a look at how to evaluate it how to test it how to view our logs inside a tensorboard how to add a call back to the training stage so this allows you to stop your training at a certain point once you're happy with it how to change policies as well as how to use an alternate algorithm so we're going to be covering quite a fair bit but again take it at your own pace and if you get stuck or if you have any questions at all hit me up in the comments below or join the discord server again link will be in the description below always happy to chat there as well all right enough on that let's actually kick this thing off and write some code so the first thing that we're going to do is install our dependencies and import them so in this case we're going to be installing stable baselines three so remember we had exclamation mark pip install stable dash baseline street and then in square brackets extra so let's go ahead and write that alrighty that looks like it's all installed successfully so you can see we don't we've got a warning there that says to upgrade pip that's fine don't worry about it but it looks like we're all good to go now in this case that is now done so that again really simple to get started with stable bass lines it's a single pip install but again there's so much you can do with it which makes it pretty cool so the next thing that we want to do oh let's actually take a look at that line so we've written exclamation mark pip install and then stable dash which i'm just going to screw that up stable dash baselines and then three and then extra now the reason that we're passing through three is that stable baselines has gone through a number of iterations so there was a stable bass lines one and then a stable bass lines two we're now up to stable bass lines three so this is the latest package again which runs on tensorflow and pytorch we're going to be using pytorch for this but just uh something to keep in mind so that's the reason that we're passing through the three all right now that's our installation done we're all good to go now the next thing that we want to do is actually import some stuff so let's go ahead and import some dependencies and then i'll talk you through each one of those okay so those are our main dependencies now imported so we've gotten written five lines of code there so first up what we've written is import os so os is just an operating system library that's going to make it a little bit easier later on when we go to define our paths to save our model as well as where to log out then we've imported jim so jim is for open ai gym but i'm going to talk about this a little more once we get into our environment section of our slides so jim allows us to build environments and work with pre-existing environments really really easily then we've imported our first algorithm so we've actually imported ppo so to do that we've written from stable underscore baselines three import ppo and again there's a whole heap of different types of algorithms so if we actually take a look we're actually taking a look at the stable baselines package that should be stable baselines three so if you actually take a look there's a whole heap of different algorithms there so there's a2c ddpg dqn her ppo sac and td3 now again there's a whole bunch of stuff here so i'm actually going to talk about when to use which algorithm um and under which circumstances so again don't fret if you've seen this and you're like oh my god there's so much we're actually going to go through this and i'll actually give you a little bit of a guide or at least some guide rails as to when to use which particular type of algorithm but in this case we're going to be using this one here so ppo so again if you want to see the documentation it's all there and you can take a look at the performance of that particular algorithm cool right okay so that's this line here so from stable underscore baselines import ppo and then the next one that we've written is from stable underscore baselines three dot common dot vec underscore env import dummy vec env now i'll talk about this a little more once we get to the breakout tutorial but basically stable baselines allows you to vectorize environments this means that it allows you to train your machine learning model or train your reinforcement learning agent on multiple agents at the same or multiple environments at the same time this means that you can get a huge boost in your training speed by doing that now in this case we're not going to be vectorizing our environment so we're going to be able to use this dummy vec env wrapper instead you'll see what it actually looks like when we do vectorize in the breakout project but again for now just think of this as a wrapper around your environment makes it easier to work with stable bass lines then the next thing that we've written is from stable underscore baselines three dot common dot evaluation import evaluate underscore policy so evaluate underscore policy makes it easier to test out how model is actually performing so what you'll actually get when we run this is the average reward over a certain number of episodes again i'll talk about it more later and you'll also get the standard deviation for that particular agent that you're training so again five lines of code so import os import gym import our algorithm which is ppo import our dummy vec wrapper and import our evaluate policy helper which will be used somewhere around here cool that is pretty much it in terms of our dependency so we've written five lines of code now on to step two environments so i think a key thing to call that is the difference between simulated and real environments now this is why we're using open ai gym so open ai gym allows you to build simulated environments really really easily so there's a whole heap of helpers it's a really well supported library and that's particularly why it's really really popular when it comes to working with reinforcement learning now a key thing to call out is that when we're working with reinforcement learning often one of the benefits is that by using simulated environments we're able to reduce cost and we're able to produce better models a whole heap faster now say for example you're working for an engineering company and your engineering company wants to build an autonomous agent to go and train this robot over here to be able to move a particular ball to a certain position now this robot is actually called a fetch row but it's actually a real robot so you can actually take a look at what it looks like now they might only be able to afford a single robot so this sort of limits how fast they may be able to train that particular robot and obviously there's costs involved with actually training that robot so you're going to be wearing down the joints you're going to be using electricity it's going to take a lot more time and cost to be able to train that robot if you're doing it in a real environment now one of the amazing things about reinforcement learning is that you can try to simulate that environment to be able to train the agent in the same way over here you can see that this is actually a replica of this robot and it's actually built inside of a simulation tool called mojoko so again i talked about this a little bit earlier but this makes it a whole heap easier to train and a whole heap more cost effective to be able to go ahead and train your agent which is actually pretty cool because i mean this technology hasn't been around for a whole heap of time but it obviously improves the ability for people to leverage reinforcement learning so rather than having to go and do it in real or in real time on that particular agent they can do it in a simulated environment and run it on there but again ultimately what you may find is that whilst we may train on a simulated environment the end goal is to take that agent and go and deploy it on a production-like environment which would be a real robot likewise if you're doing it on a game you might train on a testing version of the game and you might deploy on a real version of the game so you sort of get the idea between a simulated and a real environment this is simulated this is going to be real now this is where open ai gym comes in so open ai gym gives you a really lightweight environment but really feature packed to be able to go and build out a reinforcement learning environment now in this case you can actually take a look at the docs so it's at https colon forward slash forward slash jim.openai forward slash docs so if we actually go to that link which is over here we can go to docs you can see that there's a whole heap of documentation around how to actually use open ai gym and the nice thing about this and particularly why i've used this particular environment or framework is that there's a whole heap of support so it's really really well supported a lot of people are using this when it comes to open air gym so know when you're looking at cutting edge stuff or you're looking at what skills to learn if you wanted to go and do this for your career open ai gym tends to be the standard in this particular space now there's a whole heap of pre-built environments that you can actually use inside of openai gym so remember i was talking about mujoko for that particular robot so you can actually oh it's might not be mujo it might be under robotics so you can see that we've actually got our fetch robot over here there's also this shadow hand robot now these are actually based on real robots so if you actually google fetch robot that's not what i meant to type a fetch robot you can actually see so it's actually a real robot that i'm actually showing you here so this robot is exactly mimicking this this robot this hand one is actually called a shadow hand robot shadow and robot i believe there you go so you can actually see these are actually based on real robots that are out there in the real world so people are trying to train them using open ai gym now there's also a bunch of environments around algorithms around atari which we'll do a little bit later around box 2d so we're actually going to be testing out this one classic control so we're going to be testing out cartpol mujoko robotics toy text so on and so forth there's a whole heap there's also a whole heap of third-party environments so if you wanted to do something really really hardcore you could definitely take a look at those as well so i remember i was talking about carla i believe there's one over here so there's actually a wrapper to be able to leverage color as part of open ai gym now in this case we are going to be dealing with classic control to begin with so we're going to keep this relatively simple and try to solve the carpol environment so if we actually take a look at this the goal in this particular case is to get this little robot down here to be able to balance this beam now you can see right now it's sort of bumping around side to side and the beam is sort of falling over now there's two actions that we can really take we can move it to the left or we can move the cut to the right but again i'll delve into this a little bit more so what we're going to be able to do is train a reinforcement learning agent to be able to solve that particular problem now what we're going to do next up is we're actually going to be taking a look at what that environment actually looks like now key thing to note is that when you're actually taking a look at open ai gym environments is that these environments are represented by something called spaces there are a number of different types of spaces that open ai gym supports now the names might be a little bit tricky when it comes to actually leveraging them but let me sort of walk you through them so the first one is box now this is a range of values so think of say for example you wanted a continuous value you're going to want to use a box space so the way to instantiate a box space is by using box and then passing through the low value the high value and the shape of the space again i'm going to delve into this a whole heap more when we actually take a look at our environment and we're actually going to use some of these spaces to actually build up our own custom environment in project 3. the next type of space is discrete so this is just a set of items so if i type in discrete and then pass through the value three what you're actually going to get back in terms of your space is the values 0 1 and 2. so it's actually going to give you discrete numbers that represent specific mappings to something so typically you'll see discrete actions used for or typically you'll see discrete spaces used for actions so action zero will be something action one will be something and action two will be something else uh you've also got tuples so tuple allows you to combine spaces together so you can see we can use tuple and then pass through discrete and box so this just allows you to join them key thing to note is that stable baselines doesn't support tupor so again good to know but you're not going to use it all that much you've also got dict spaces so this is just a dictionary of spaces so really similar to two balls but in this case we're just declaring dict and then we're passing through a dictionary of spaces the other two types of spaces these are ones that i haven't dealt with too much but it's important to note that they're there so you've got a multi-binary space so this is a one hot encoded set of binary values so if you pass through multi binary and pass through the value four what you're going to get is a list of values and you're going to have four positions so you'll have zero one two three so ideally four values and you're just effectively going to have binary flags so ones or zeros represented in those positions so it's a one hot encoded vector of different actions or different spaces you've also got multi-discrete so this is very similar to our discrete space but in this case you can have uh multiple sets of values so you'll have 0 1 so if i pass through 5 2 2 what i'll get back is a range of values between 0 and 4 for the first position 0 and 1 for the second position and 0 and one for the third position so again you can start to see how these spaces sort of play but enough on that let's actually take a look and start building our environment so back to our notebook what we're going to do now is start loading up our environment so first up what we're going to do is we're going to use open ai gym to instantiate our environment and then we'll actually test it out and take a look at it so let's first upload our environment okay that is two lines of code to be able to go and create our environment now i've gone and separated it out into two lines of code but you can make it one and i'll explain this so the first line of code that we've written is environment underscore name equals cart pole dash v0 now this is case sensitive so make sure you get the case correct so we've got a capital c-a-r-t p-o-l-e dash v-0 so this environment name is just a mapping to the open or pre-installed open ai gym environments then what we're doing is we're actually making our environments we've written emv equals gym dot make and then to that will pass through our environment name variable so again if we actually just printed out the environment name variable it's just going to be a string right environment name just a string right nothing crazy there now what we can actually do is we can actually test out this environment so remember what we're going to do initially is just take random actions in that environment but eventually what we're going to do is we're going to take our agent and specifically our reinforcement learning agent and try to get it to take the right actions in that particular environment to maximize our reward that's what reinforcement learning is all about so what we want to do first up is sort of get an understanding of the environment it's really really important to understand what's actually happening in that environment before you try to do anything because you might be trying the wrong algorithms you might be doing a whole bunch of random stuff but understanding the environment is going to make your life so much easier trust me on this so let's go on ahead and write a bit of a loop to test out our environment so let's go do this alrighty so i've written a lot of code there but i'm going to take it step by step with you so again we're going to take this step by step whenever we're going through this stuff so what is that so one two three four five six seven eight 9 10 11 12 12 lines of code now again all of this code including the beginner as well as the completed tutorials are going to be available in the github description below so if you want to take this and walk through it and compare your code you can do that first up what we're going to do is walk through each step or each line of this code so what we're actually going to be doing is we're going to be trying to test out the carpol environment five times now what we've actually gone and done is created a variable called episodes and we've set that to five so this means that we're going to try to loop through our environment five times to see how we can operate within it so we've written episodes equals five and then we're looping through each one of those episodes so for episode in range and then we're starting off at one and then we're going episode plus one so this is effectively just looping through each one of the episodes if we actually write this out to episodes equals fives for episode in range one comma episodes plus one let's print out our episode so you can see that it's just going to be looping through one to five right so that's all these two lines are doing here then what we're doing is we're resetting our environment so by running env.reset we're going to get an initial set of observations remember there was those five key components to any environment or four key components there was the agent the action the environment and then the observations plus the rewards so by running env.reset we're going to get our initial set of observations so if i type in emb.reset you can see that these are the observations for a particular environment now i'll talk about what these values mean in a second once we actually get to understanding the environment but for now just understand that these are the observations that we get for our particular pole right so we're getting these four values now what we're effectively going to be doing is passing these observations or later on we'll pass these observations to our reinforcement learning agent to determine what the best type of action is to be able to maximize our reward so our agents going to see these values and it's going to go hey i've got these values what should i do or what action should i take to be able to maximize my reward and get that bar in the straightest possible position then over here we're just setting up some temporary variables so we're setting whether or not the episode is done so you've got a maximum number of steps within this particular environment and we're also setting up a running score counter across the episodes then we've got a while loop so while not done we're then going to render our environment so the render function allows us to view the environment or view the graphical representation of that environment key thing to call out is if you're running this inside a collab the render function is not going to work like this you've got to do a little bit of extra work so hit me up in the comments below if you want a little bit of help with that then what we're doing is we're generating a random action so rather than taking in our observations and actually generating an action which is actually useful we're just going to take a random one so this is akin to doing this so i can just type in let's actually move this down so what would be doing is emv dot action space dot sample so we're just generating a random action this is actually really good to note as well so if i actually take off sample remember how i was talking about the different types of action spaces in this case here we've got discrete is two so this will mean that we get two different types of action so we've got zero or one so if we actually type in dot sample you can see we've got one this time got one got one got zero so you can see our action space is just going to have those two different actions zero or one this is what discrete two which is what our action space looks like represents now we can actually take a look at our observation space as well this is a key thing to call that so there's going to be two different spaces within any environment your action space so these are the actions you can take on that environment and your observation space so this is what your observations actually look like for that particular environment so it's a partial view so if we type in observation space you can see that we've actually got a box environment so i've got these values here so that's going to be a lower bound and this is going to be our upper bound and then we've got four comma so this means that we're going to have four values so zero oh zero one two and one two three four and then they're gonna be in the d type float32 so again you can start to see how our environment is actually built up now again we can sample this if we wanted to and this is going to look almost identical to what we get from enb dot reset up here then what we can actually do is we can actually pass through our random action so this is the next line that we've written to our environment so we can do this using a and v dot step so if we actually did that so amv step and pass through the values 1 you can see we're going to get our observation back so again we can keep doing this and this is really just us passing through our action now what we actually get back from this is actually really really interesting so we're going to get back our next set of observations which are what you can see there and we're also going to get a reward so this is whether or not we're getting a positive value or a negative value so one is obviously increment zero is going to be a decrement or negative one is going to be a decrement and then true is basically specifying whether or not our episode is done so remember we've got this done statement here and this done statement up here so once our episode is done we're going to stop it so that full line of code is n underscore state comma reward comma done comma info so this is just unpacking the values that we get from env.step and then the next line of code is just accumulating a reward so score plus equals reward and then we're just printing out the results that we actually get from taking that step so we've written print and then open quotes episode colon squiggly brackets score colon squiggly brackets i call them squiggly brackets dot format and then we're passing through our episode and our score and then last but not least we're closing our environment so when we use env.render you'll get this python pop-up to close it down you just need to run env.close cool so that is all well and good let's actually test this out so if we run this now you should see down the bottom that's our environment testing itself out now if we didn't want to close it we can just comment this line down here so we can actually see it and there you sort of go so again it's running really really quickly and it's just sort of moving the bar around when we actually go to test it out we'll see it run a little bit more slowly but you can sort of see so if we keep running it looks like we've screwed it up let's actually close it say amv dot close and then it's closed down so let's run it again then we can close it you can start to see what's sort of happening there right our actions are moving this black box to the left and to the right and our bar is effectively swaying based on response to that so ideally the goal is to hold that bar as straight as possible for as long as possible cool all right so that is a whole bunch of stuff now done now we've taken a look at how we can sample our environment so let's actually take a look at that in a little bit more detail so remember there's two parts to our environment there's our action space and our observation space so if we type in env dot action space that's going to be our actions and then we can type in emv dot observation space and those are going to be our observations now you're probably thinking well nick what are these values that we get from these observation spaces so and action spaces so let's actually type this so if we write dot sample and dot sample to the end of these you can see that we've got these values so let's actually duplicate this so we've got both so amv dot action space and emb dot observation space right so this is describing the type of action space this is actually an example type of observation space and then this is an actual example now we can actually take a look at what these represent so i've actually got this link here which actually gives you a little bit more detail so in terms of that observation so this is actually from the open ai gym documentation so you can actually zoom into this so in terms of our observation space remember we've got a box 4 which is this down here so box and then four the first position represents the cuts position and it's got a minimum value of minus 4.8 and a maximum value of 4.8 we've also got the cut velocity which is going to be this value here we've then got the pole angle which is going to be this value here and then we've got the pole angular velocities i'm guessing this is sort of the speed at which the pole is moving up or down so again these observation spaces just map to this now not every environment is going to be well documented like this but i sort of wanted to give you an idea so cut position cut velocity pole angle pole velocity cut position cut velocity pole angle pole velocity then in terms of our action spaces remember we've got two possible actions zero or one these are the descriptions for our actions so action zero is going to push our cart to the left action one is going to push our cart to the right so you can sort of see how these actions and observations sort of play together now that is our environment in a nutshell so you can see that we've gone and done a few things there so we went and defined our environment we then went and tested it out and then we went and actually took a look in granular detail to be able to actually understand how this environment actually fits together and i think this is really really important because it gives you an idea as to what the hell you're trying to solve but keep in mind that whenever you're solving one of these environments you're typically going to have an action space and an observation space and it's a good idea to try to understand what each one of those means but on that note that is our environment now set up so we can actually close this so let's go and take a look at what's next so this brings us to step three training so a key thing to call out is that there are a heap of different types of algorithms when it comes to reinforcement learning now typically these are grouped into model based rl and model free based reinforcement learning algorithms now we're mainly going to be focusing on model freebase reinforcement learning algorithms because that's where a lot of development is happening but that's not to say that model based reinforcement learning isn't useful as well so a core thing to note in terms of model-free rl so the whole idea between model-free rl is that it only uses the current state values to try to make a prediction with model-based reinforcement learning what's actually happening is it's trying to make a prediction about the future state of the model to try to generate a best possible action so there are a whole heap of advantages for and again so i'm not going to go through it in great detail there is a great document on this down at the open ai website so under https colon forward slash forward slash spinningup.openai.com there's a really really good explanation of model 3 versus model based rl now if you so i guess a key thing to call that is that stable baselines really only deals with model 3 based rl there are a number of other libraries that look at model base rl as well i believe rl lib is one of them so we're going to be focusing on a model free rl and specifically we're going to be taking a look at the a2c algorithm ppo and we'll also probably use i think we'll probably only use those two to begin with but again we'll also use dqn over here as well so again we'll use a few different types of algorithms so you can sort of see what they look like but that sort of gives you an idea as to what's sort of out there there's a bunch of algorithms broadly grouped into model 3 versus model based reinforcement learning now a core thing to note is choosing the best possible algorithm for your use case so we've talked a little bit about different types of action and observation spaces so far now the algorithm that you're going to try to use or the algorithm that i'd suggest you use should ideally be one that maps appropriately to your particular type of action space so you can see here that in terms of stable bass lines there's a whole bunch of different types of algorithms so we've got a2c ddpg dqn her ppo sac and td3 and i'll show you how to actually use these different algorithms a little bit later in the main tutorial but a key thing to note is that certain algorithms can only work on certain types of action spaces so you can see here the a2c works on boxes discrete multi-discrete and multi-binary ddpg works on box spaces only dqn works on discrete spaces only and this is in reference to the action space so a key thing to call that is that it's based on the action space not so much the observation space so remember if we go back to our main tutorial so it's this over here so if you type in emb dot action space and you say that it's discrete then you know let's scroll back then you know that you can use any one of the models down here that has a green tick under discrete so we could use a2c to solve this dqn her and ppo now if you had a box environment so remember if we take a look at our observation space assume our action space had a shape of box well then you'd be looking at using one of these ones a2c ddpg her ppo sac or td3 i've got a little bit of a guide down here so discrete single process use a dqn discrete multi-process use ppo or a2c using a continuous single process sac or td3 continuous multi-process ppo or a2c a key thing to call out guys is that treat these algorithms as commodities so you can choose to use whichever one you want for your particular use case some are going to perform better than others it's good to know how they work it is better to know in detail how they're put together but again you don't really need to know that or that level of detail to be able to try this out or try your hand at it it's just important to know which algorithm you should use for which type of action space and again all of them are available inside of the stable baselines documentation so you can see that we've got all of those here so remember to get to this you can go to stable dash baselines3 dot read the docs dot io forward en forward slash master forward slash models and then this is if you want to look at the ppo algorithm modules forward ppo dot html but again all the links are going to be in the description below so you can grab that and pick it up cool so we've talked a little bit about different types of algorithms and when to use which one now another thing to note is that you need to be able to understand your training metric now which type of algorithm you use is going to determine what type of metrics you get during training but broadly you should get something that looks a little bit like this and you'll see this once we kick off our training so we can break this up into evaluation metrics time metrics loss metrics and then we've got other metrics so our evaluation metrics are all to do with our episode length and our episode reward mean well so these are our averages so our length is how long our episode actually went for so if you're playing a game think of it as one single game when we're trying to balance our cart poll at one episode is the number of steps the maximum number of steps we're allowed to take a time metrics so in this case we've got frames per second so this is how fast you're processing iterations so that means how many times you've actually gone through time elapsed how long it's been running and total time steps so that's how many steps you've actually taken in an episode you've also got some loss metrics so you've got entropy loss policy loss value loss again if you want a greater detail or if you want a greater explanation on that hit me up in the comments below we've also got some other metrics as well so we've got the explained variance so this is how much of the variance in the environment your agent is able to explain you've also got your learning rate so how fast our policy is actually updating and you've also got n updates which is how many updates we've actually made to our agent now a core thing to call out is that by default when we actually go and install stable baselines using the pip install command we're only going to be installing that without gpu acceleration now if you wanted to use gpu acceleration you could all you need to do is just go and install the appropriate pi torch version so say for example i wanted to leverage gpu acceleration on my particular machine which i'll show you in a sec all i would need to do is go to pytorch.org so if we go to our baseline install page hit install and then if we scroll on down you can see that down here it sort of gives you the steps to go and install this so i could choose the stable install i'm working on a windows machine but if i could if i was on a mac i could hit math linux i could hit linux so i'm going to choose windows and then in this case i'd say for example i wanted to install using pip i could just hit pip and then i could choose the language that i wanted to install for so if i wanted java i could choose that if i wanted python in this case we're working in python so choose python and then i need to choose the compute platform this is really really important here so cool thing to call that is that cuda and cu dnn are only supported on nvidia gpu so if you want to leverage gpu acceleration you have to have an nvidia gpu to use cuda now over here you've also got rock m now rock m is the beta package which is available for amd gpus now i believe this is only available on linux at the moment so if you wanted to use an amd gpu to be able to do this you'd need to be able or you'd need to be on the linux os in this case i'm on windows so you can see it's not available on windows so i'd be using cuda or in this case cuda 10.2 or cuda 11.1 now this is really only needed if you want to use gpu acceleration to be honest with reinforcement learning you're not going to see as much of a performance boost in training as you would with traditional deep learning by using a gpu so if you don't have a gpu don't stress don't worry about it you don't need to do this step i just wanted to call it out for people that do have a gpu that wanted to do that but in this case we'll probably do that in one of our other projects and take a look at how to install that so for now you can sort of skip this it's just a good thing to note alrighty so on that note though let's go on ahead and let's go and train our agent so i'm going to skip back into our notebook and what we're going to do now is start training our reinforcement learning model so first up what i'm going to do is i'm going to define a log path and this is going to be where we save our tensorboard log so if we wanted to go and monitor our training we'd be able to take a look inside of this log directory and view how our model is actually performing so i'll show you how to do that down here so let's go on ahead and first up to find our log path so i'm going to type in log path or log underscore path and then we'll specify that now a key thing to call that is that this path needs to exist so we can also create it as part of our code but i've just gone and done it manually because it's reasonably straightforward so what i'm going to do is inside of the folder that we're working with i'm going to create a folder called training and then inside of that i'm going to create two additional folders one called logs and one chord saved models let's zoom in on this so you can see that we've got a training folder and then we've got one called logs one called saved models so inside of our logs folder we're going to save our logs so in this case you can see i've got a bunch let's delete them because we don't need those and inside of saved models we've got a bunch of models as well so i'm just going to delete these ones because we don't need those for now so our logs is going to be where we save our logs for our model and our saved models are going to be where we save our saved model so our trained model so again we'll take a look at those a little bit later once we actually go and do it okay so that is that now done so again when you're doing this i'm going to add a comment so make your directories first all right so we're going to define our log path so again this is going to give us a path to our logs training backwards backwards logs and because i'm on a windows machine the path is represented by a double backward slash if you're on a mac or a linux machine i believe it's a forward slash cool so that is our log path now defined now the next thing that we need to do is instantiate our algorithm and specifically our agent now remember when we went and imported our dependencies we went and imported ppo so in this case ppo is going to be the algorithm that we're going to be using for this particular environment so let's go ahead and define that and then we'll take a look okay that is our algorithm now set up now you can see here that it's printed out using cuda device this is because i do currently have gpu acceleration set up for this particular environment if you were not using the pytorch cuda version or the pytorch gpu accelerated version what you would see here is using cpu device so again no need to stress if you're not using gpu acceleration i'll show you how to set it up later if it says using cpu device that's perfectly fine as well you're still good to go all right so in order to do that we've written three lines of code here so written so we've gone and recreated our environment here this is just to keep it all encapsulated so i've written emv equals jim dot make and then to that we'll pass you our environment name so this line over here is no different to this line over here so again exact same thing then what we've done is we've wrapped our environment inside of that dummy vec env wrapper so remember up here we imported this this is where we're actually wrapping it so to do that we've written env equals dummy vec env and then we've created a lambda function so this is going to be an environment creation function so inside of square brackets i've written lambda colon and then env so this is going to allow us to work with our environment that's wrapped inside of that dummy vectorized environment so again just think of it as a wrapper for a non-vectorized environment i'll show you a real vectorized environment when we get to our project one and then we've actually gone and defined our model so think of this as defining our agent so we've written model equals ppo so again this is the algorithm that we've gone and imported over here and then to that we've passed through what is that uh two arguments and two keyword arguments so the first one is defining the policy that we're going to use so in this case it's a mlp policy this stands for multi-layer perceptron policy now in this case this means that we're going to be using a neural network which is just using standard sort of neural network units we're not using lstm layers and we're not using cnn layers what we will do inside of project 1 and project 2 is we'll actually use a cnn policy core thing to call out as well is stable baselines two actually had one advantage over stable baselines three in that it actually had an mlp lstm policies so if you wanted to use windowed data sets which are particularly useful for trading or finance as well as certain gaming applications that particular policy isn't unfortunately available in stable baselines 3 as far as i know so again if that changes i'll mention it in the pinned comment below for now it supports mlp policy and i believe cnn policy you'll see cnn policy inside of project one the next argument that we've passed through is our environment so this is going to be this vec dummy vectorized environment here we specified verbose equals one because we want it well we want to log out the results for that particular model and then we'll specified our tensorboard log path so tensorboard underscore log and we've specified it as this log path here so if we actually go and take a look at this algorithm ppo there are a whole heap of arguments that we can actually pass through here so you can see that we can pass through the policy the environment the learning rate the number of steps the batch size the number of epochs gamma gae lambdas so there's a whole bunch of different types of piper parameters that you can actually train on here as well and a whole bunch of different things that you can actually train so again if there's a whole bunch of documentation on this particular environment and you can see all of that there so again we're keeping this pretty simple and we're using the standard hyper parameters in this particular case so now that our agent is now set up the next thing that we need to do is just go on ahead and train it so again this is pretty simple from here on out so we just need to use model.learn to be able to go and train it so let's do it so if we type in model.learn and then we just need to pass through the number of time steps that we want to train it for so again i'm just going to pass that through and initially i'm just going to set that to 20 000. so the full line is model dot learn and then we're passing through a keyword argument so total underscore time steps equals 20 000. now you can play around with this number and in terms of how long you want to train so for a simple environment you're probably going to be able to get away with a lower number of total time steps for a sophisticated environment say for example breakout or the self-driving environment you're probably gonna need a heap more so for example for cardpole i've managed to solve it more often than not in under 20 000 steps for the breakout and the self-driving tutorial well breakout doesn't actually have a end goal per se but that actually took around about 300 400 000 time steps as did the self-driving tutorial so again the complexity in the environment is going to define how many time steps you need to train for so in this case we're pretty happy with 20 000 so we can go on ahead and kick that off and what you'll see eventually is once this model starts training it looks like we've got a bit of an error there okay it looks like it might have just been a warning okay so you can see our model is now starting to train so we're getting our time metrics and we're also getting a whole bunch of additional training metrics so let's let this go on ahead and run and then as soon as it's done we'll be able to test it out okay so we can see that our model has finished training and if we take a look so it looks like we've got an explained variance of 0.231 we've got an entropy loss of minus 0.599 a learning rate of 0.000 loss of 57.6 looks like it wasn't all that stable to the end but that's fine let's test it out and see what this actually looks like so that's our model now trained or at least trained for 20 000 steps now if we wanted to we could go and train this for longer so all we need to do is go and run it again it's going to start training again so you can see it's kicking off training and it's going on so again if you wanted to train it for longer all you need to do is go and run that again now now that we've kicked it off let's let that finish and then we'll actually test it out okay so that is our next round of training now done looks like our explained variance is a little bit higher learning rate's still the same we've gone through a total of 20 480 time steps so again this is just for this latest run now more often than not what you're going to want to do is you're going to want to save down this model and move it around if you wanted to go and deploy it you'd want to be able to save it so let's take a look at how we might save and reload our model first up and then we'll go and evaluate it so we're going to define a path and we're just going to call it a ppo path similar to what we did for our log path cool so that's our path defined so i've just written ppo underscore path so that's going to be our path variable equals os dot path dot join and then we're going to be saving it inside of our effectively our saved models folder so training and then saved models and then our file name is actually going to be ppo underscore model underscore cart poll so this is going to save our model inside of this folder so reinforcement learning course well this is my current folder training and then saved models so it's going to be saved in here so if we go and save it now so you can see that our model is now saved there so ppo underscore model underscore carpol so again to save it all i've written is model dot save and then i've passed through this ppo path now if we wanted to we could actually go and delete this model and reload it so let's go ahead and do that because this sort of simulates deployment right you're going to be reloading from your saved model each time so let's do it so i'll just write del model to delete our model and then what we can do is we can reload this model back into memory so if i type in model so we're just going to define a variable called model and then we can actually reload it so to do that we're just writing ppo dot load and then we just pass through our path or the path to your actual model so if you save it somewhere else you're going to make sure or make sure that you pass through the full path to the models and then we're going to pass through our environment as well so the full line is model equals ppo dot load and then to that we pass through the path where our model is actually saved so remember ppo underscore path is just going to be where our model is so in this case it's in training saved models ppo model carpol it's exactly this so training saved models ppo underscore model underscore carpal so that's the same file that we're working with so let's load it so right now so before i run this cell so you can see if i type in model dot learn for example total time steps equals a thousand so this would be our training step so you can see that we've got name model is not defined because remember we deleted our model over here now if we actually went and loaded it you can see that we've now loaded our model and if we go and run this you can see that we're now training again right so you sort of get the idea so you can go and train your model you can save it using model.save and then you can reload it using ppo.load so remember model.save and then you actually use the algorithm.load to be able to load it back up cool that is our training now done so in a nutshell we've done quite a fair bit there so we've ridden our so we've actually created our algorithm or our agent so ppo and then we'll pass through our parameters we've used model.learn to trade our model then use model.save to save our model and then ppo or whatever algorithm you're using dot load to be able to go and reload it into memory so again those four key components are really really important so use the algorithm to find the hyper parameters model.learn to train it model.save to save it and then whatever the algorithm is dot load to reload it on to step four testing and evaluation so so far what we've gone and done is we've set up our environment we've gone and trained it but we haven't actually done anything with our trained model as of yet well what we're going to want to do is we're actually going to want to train our model to see how it's actually performing now you would have noticed that when we actually went and trained that model using the ppo algorithm so let's actually go back and take a look we didn't actually get those training metrics now the training metrics that i was sort of showing up here or the rollout metrics that you can see there are very much dependent on the algorithm that you're using so with the a2c algorithm i believe you get these rollout metrics but with ppo you don't so what you're going to want to do is you're going to want to evaluate the model itself to see what the performance is actually like now we can actually use the evaluate policy method that we imported right up at the start to be able to see what that actually looks like but a key thing to call that is if you do get these metrics it's a great thing so the two key ones that you need to pay attention to are the episode mean length or episode or ep underscore len underscore mean so this is how long each episode actually lasted on average so say for example you're playing breakout it's how many times your model was able to play or hit the ball or how many frames it was able to go through before the model eventually died so with gaming this is particularly important the reward mean is effectively the your average reward so remember think back to our dog environment so it's how many times or it's on average how many times your dog got a trade or your average reward in this particular case now we can actually get metrics similar to these by using that evaluate policy method and we can also monitor those training metrics inside of tensorboard so remember when we actually set up our model we actually pass through tensorboard underscore log and we specified our log path so if we go back to that so you can see over here when we defined ppo we actually specified this tensorboard log path so we can then go and actually take a look at those metrics and those are going to be our training metrics cool so let's go on ahead and do that and this is how you start tensorboard but again i'm going to show you how to do this in a second so let's do it so we are now up to evaluation so let's go ahead and do this so we're going to be using the evaluate underscore policy method from up here so remember this is going to be a method that allows us to test how well a model is actually performing now the ppo model in this particular case is considered solved if you get on average a score of 200 and higher so ideally we want to see that our model is scoring 200 on average to determine whether or not the environment is actually being solved now certain environments are going to sort of have a cap as to where it's considered solved others are just going to be continuous whatever highest score you get is the best so breakout and the self-driving tutorial i don't believe have caps but in this case the carpol environment does so let's go ahead and test this out okay so we've gone and written our line to go and test out our policy or evaluate it and the line that we've written is evaluate underscore policy and then to that we've gone and passed through two arguments and two keyword arguments so we've gone and passed through our model our environment how many episodes we want to test it for so in this case we've passed through n underscore eval underscore episodes equals 10 and then we've gone and specified render equals true so passing through render equals true determines whether or not we actually visualize it in real time so if you're evaluating this policy on colab then you want to specify render equals false because you don't want to visualize it it's not going to work at least with the default evaluation so let's go ahead and test this out and see what our model actually looks like so you can see it's way more stable this time so remember when testing it out at the pa at the start it was sort of falling down and going sort of all over the place now it's perfectly stable right so you can see that it's balancing it almost exactly and so it's going to do this 10 times so it's going to go through 10 different episodes and take the average reward and we'll actually see that in a second pretty cool right so in just a couple of lines of code you've been able to build a reinforcement learning agent now again the training speed is going to be pretty much very similar when you're training on a gpu or not on a gpu it's going to be very very similar so don't fret if you don't have a gp on your machine test this out regardless cool so that's now done and you can see on average our reward is 200 so this environment is now considered solved so we're good so these two values that you get out of evaluate policy are the average reward over the number of episodes and the standard deviation in that reward so in this case we're getting a average score of 200 with a standard deviation of zero so we're perfect we're absolutely bang on in this particular case now the next thing that we want to do is actually close down that environment so again we have it over here so if we wanted to close it we can just type in emv.close and that is going to close it down now right now we've gone and evaluated it but if we actually wanted to go and deploy this how would we actually go about doing it so this is sort of just testing out our model in an encapsulated function what would happen if we actually wanted to go and rather than doing it like this we actually wanted to do it sort of similar to what we did up here well we can actually do that the core difference is that rather than using env.actionspace.sample we're actually going to pass through our environment observations to our agent now to try to predict the best type of action because that's what reinforcement learning is all about remember we're going to take our observations pass it to our agent our agent is going to try to determine the best type of action to take to our environment to maximize our reward so the flow is going to be very much similar so we'll take a look at our environment so we'll use env.reset to reset our environment and get our observations we're then going to use model.predict on those observations to try to get the best possible type of action and then we're actually going to take that step so we can actually copy this entire block of code here and right down here let's go ahead and test out our model now to this we're going to make a few key changes so rather than using env.actionspace.sample we're going to change this to model dot predict and then to our model.predict function we need to pass through our observations now right now we've got our observations named two different things so we want to change this so env.reset we're going to change that variable to be equal to obs and then down here in env.step.action rather than having n underscore state we're going to change this to obs as well so ideally what we're going to change is this line over here so before it was state equals e and b dot reset we're going to change our action line so rather than having env.actionspace.sample we're going to change it to model.predict and to that we're going to pass through our observations and then in our env.step line which is where our action is taken we're going to change that first parameter to obs as well so now if we run this rather than taking random steps we're actually going to be using a model to take steps so remember we've now subbed out our model so we're now now using model here so if we go and run this looks like we've got a bit of an error let's take a look and we might need to oh we're actually going to get two parts from our model so if we actually uh let's actually take a look at this we use model.predict obs we actually get back two values so we get this array and we get this none value so our action is actually the first bit and the second component are our states we actually don't need that second component so we just make that underscore so if we actually do that now and we take a look at our action that's looking better all right cool so we're just going to make this one change so we're going to unpack this value and our environment is still open so let's go ahead and close it it's closed all right let's try this again there you go so you can see performing way better than before it's now balancing that that pole way better than what we had initially when we were just taking random steps and you can see the score being printed down below so we're getting 200 pretty much every single time which means we're smack bang on solving that model and there you go so you can see that we've now gone and done that and again we can close this now if you wanted to run this continuously you could as well but in this case we're doing it in a nice sort of loop and then and again we can go and do this again so let's try running it pretty cool right so we're now actively using our reinforcement learning agent to be able to go and interact with our open ai gym environment so it's now balancing the poll a whole heap better now let's actually take a look at what we did there so we went and let's actually delete this so if you cast your mind right back up to section two where we're loading our environment and we're playing around with how we can actually play with it now what we actually did is we had one really important function which was emv dot reset now remember when we type in env dot reset we're going to get the observations for our observation space what we can actually do is we can take these observations or what we're actually doing is we're taking these observations here and we're passing them to our model so model.predict obs now what you're actually getting back here is two values so let's actually take a look at what we're getting back to do so we are going to get back the model action and the next state so that's used in recurrent policy so again because we're not using our current policy we're not getting that next state so what we actually get back in this particular case that is relevant to us is this first value here which is our array now remember in terms of our action space space remember there were two different types of action so zero let's see if we get one zero and one now what we're basically getting here is rather than just getting a random action we're using our model dot predict function on our observations from our environment to generate this action here so you can see that rather than getting env.actionspace.sample our model is actually predicting that based on the observations of our current environment right now you should take action one in order to get the best possible reward so this is effectively what reinforcement learning is all about so if you cast your mind back to that diagram so we've got our agent we've got our environment we've got our action and we've got our reward let's actually go back to that slide right so we've got our agent so in this case our agent is this model we've got our action oh that's actually so we've got our agent we've got our environment so remember our environment is emv this variable here we've got our action in this particular case which is what we're just generating here so this one and we've also got our observations which is this value here so remember our observation so we can print that out is these four values now if you cast your mind back we actually took a look at what each one of these observations meant over here so our observations are our car position our cart velocity our pole angle and our pole angular velocity so again you can start to see how this is all sort of fitting together you've got those four key components you've got your you've got your agent you've got your environment you've got your action and you've also got your observations now a core thing that i haven't called out yet is the reward right so we saw that we got our model.predict now how do we actually determine what our reward is well we get our reward when we run env dot step so if we actually do that now in b dot step and remember our model just predicted take this action so if we go and take that extract that out and if we pass our action to this what we're actually getting back is those values that are relevant to us so we're going to get our state so this is the state after taking our action on it then this value over here is actually our reward so you can see our reward in this particular case is a value of one now let's actually take a look does it talk about reward uh reward there you go so reward is one for every step taken including the termination steps so this basically means that we haven't let up hole four down completely which means that we get a reward of one if you pass a certain threshold and your pole starts to fall down then you don't get that reward so by basically keeping our pole in the upright position and not falling down we're getting accumulating a value of one every single time which is how we've got this value of 200 here so that in a nutshell sort of shows you the theory all the way through to the practical so these five steps are getting or what is what are we up to step six these six steps sort of show you how to define an environment how to train a model how to evaluate it as well as how to test it out as well so we've done quite a fair bit there now while you're training so this obviously trained really really quickly right so we're able to spin it up train it really quickly and get it up and running now if you are training a way larger and way more sophisticated environment what you might want to do is view the training logs inside of tensorboard so what we can actually do is do exactly that now i'm going to start it up from within jupiter notebooks but ideally you would want to run this from a command prompt so that you're not locking up your notebook because once you run this it's going to run continuously it's not going to unlock your notebook you're not going to be able to run anything else so i'll sort of show you how to do this and then we'll continue onwards so what we first need to do is we need to get the log directory that we want to view so if we go back to our folders so if i go into so this is our root folder so if i go into training logs you can see that we've got three different training sets now remember we kicked off our training three times so this is where we've got three different sets of logs so our one was our i believe were they all the same no our second one our first and our second one were the longest our third one was only that 1000 training steps so let's actually take a look at ppo2 so what we're going to do is we're going to go into that folder and we're going to specify tensorboard to run from that folder so first up what we need to do is give it a path to that ppo2 folder so let's specify that first up okay so we've gone and specified our training log part so if we go and take a look at that so you can see this is giving us a path to our ppo2 folder so training logs and then ppo2 so this is effectively where we gone so training and then logs and then ppo2 so this file over here is our tensorboard log file that we're going to be able to use now all we need to do to kick off tensorboard from within that folder and you're going to need tensorboard lot installed so i believe it's just a pip install tensorboard to go and do that to go and run this we just need to run exclamation mark tensorboard dash dash log dear and then we need to specify our training log path yeah that looks about right so written exclamation mark tensorboard dash dash log dear equals and then inside of squiggly brackets training underscore log path we've written that wrong so let me quickly explain what this line is doing so i think i've had some comments on this before so using an exclamation mark inside of a jupiter notebook is known as using a magic command so this allows you to run command line commands from within your notebook so by me putting through exclamation mark this is akin to me going to a command prompt or to a terminal and writing tensorboard dash dash log blah blah blah whatever so in this particular case what i've actually written is exclamation mark tensorboard dash dash logged let me actually show you it's probably going to make more sense so if i went to d drive cd youtube cd reinforcement learning uh so let's actually go into let's actually specify the exact same thing so written training log pass that's going to go into training logs okay so this is akin to me doing this so tensorboard dash dash log dear equals training slash logs slash ppo2 so right you can sort of see how this is actually running inside of a command prompt and eventually you should get a line that says it's running at http localhost 6006 this line over here that i've written inside of a command prompt is exactly the same as what we would be running over here so what i can do is i can go to this link over here which is being created by tensorboard and you'll get all of your training and doesn't look like we've got any training metrics what's happened there okay let's just go directly into the folder so we'll go into training then we're going to go into ppo2 and then we'll run tensorboard dash dash log dear equals dot that times the charm let's see if this works okay so it should be running at http dash or colon dash dash local host and then six and then six thousand and six let's refresh now okay way better so what we went and did is i went and just seeded into the folder i'm guessing i'm getting this path that i was specifying incorrect but that's fine you can sort of see how to run it there all right so from here you are going to get a whole heap of different metrics now specifically you're going to get train metrics so this sort of shows you the frames per second and you're also going to get a number of train metrics so you're going to get clip fraction approx underscore k and if you want to deeper dive into what these metrics mean how to evaluate them by all means do hit me up in the comments below i'll probably have a little blue box somewhere in the corner up here that sort of explains them as well but you can start to see that you're getting all of your different training metrics so you're getting your entropy loss your explained variance which should go higher your learning rate which looks like it stayed pretty stable your loss which looks like it's going down your policy gradient loss as well as your value loss so again you're getting a whole bunch of different types of training metrics that you can view in tensorboard now this is obviously run from the command prompt which we had to do a little bit of reducing to get to work now rather than doing that you can just run it from the notebook as well so if i stop this now all right and close down this command prompt what we can actually do is run this command and it'll effectively do the same for us right so this is currently running you can see the little asterisks there over here now if i go to localhost 6006 that gives us tensorboard all up and running now so you can start to see how to view those logs as well now in this particular case that is our set of metrics now done uh if anyone has any feedback on any of this or has um a better way to launch tensorboard by all means do hit me up in the comments below i'm always welcome to feedback but that sort of brings us to the end of our testing and evaluation step so what we went and did is we went and evaluated our model using evaluate underscore policy we went and tested it out and we also went and viewed our login tensorboard so it looks like it still runs even though you end the cell so that's something i might need to dig into if you do have any problems with that do hit me up in the comments below though now a quick word on performance tuning and performance in general when you are training your model the core metric that you should be looking at is your average reward so this gives you an indication as to how well your model is going to perform in that particular environment using that particular reward function now the other metric that you want to be taking a look at is your average episode length so this ideally aims to identify how long your agent is actually lasting in that particular environment now this is particularly important when you have environments that don't have a fixed environment length so when we take a look at the breakout environment and when we take a look at the self-driving environment those are really really good indicators to be taking a look at now what you can actually do is if your model is not performing that well there's three key strategies that i'd suggest you start taking a look at so these strategies are one train for longer so if you've only trained for say for example ten thousand twenty thousand or a hundred thousand steps try training your model for longer and see if that improves performance the other thing that you can also take a look at is hyper parameter tuning so again when you're dealing with deep learning models or even traditional statistical machine learning models hyper parameter tuning can yield significant results now stable baselines supports hyper parameter tuning using a package called optuna so we're not going to show it in this course but if you'd like to see a little bit more on that do let me know the last thing that i'd suggest you take a look at is take a look at the different algorithms that people are using to perform state of the art training as well so this can be another thing that helps you out when it comes to improving your performance okay on to our next topic so what we're going to do are we skipped all the way through so let's go back to where were we we are now up to step five so call backs alternate algorithms and architectures so what we're going to be doing in this particular step is we're going to be recreating our model but this time what we're going to do is we're going to specify a reward threshold so this means that our training is going to stop once it hits a certain benchmark now this is really really useful when you've got really really large models that you're trying to train and you want to stop them before your model starts getting unstable now what we can actually do is we can use some of the helpers from stable bass lines to do this so we can use the eval callback and the stop training on reward threshold callback to do that now the nice thing about this is that you can actually save your model as part of this as well so it will automatically save your best model we're also going to take a look at how we can define a different neural network architecture so remember we specified the mlp policy but we can actually pass through a different neural network architecture as well and then last but not least we're going to take a look at how we can use a different algorithm so that's the last thing that we should be doing in that particular section so let's kick this off and do it so the first thing that we're going to be taking a look at is how we can add a callback to our training stage now the cool thing about this is that if you need to stop your training after a certain reward threshold this gives you the automated capability to do that now this is really really important particularly when you're training huge models or models that are going to take a long time to train so say for example you're training the breakout tutorial the self-driving tutorial you might want to use this but again not mandatory just gives you the ability to extend out your training step so in order to do this we first up need to install a couple of additional dependencies namely some helpers from stable baselines so let's go ahead and import these okay so we've gone and written one additional line of code there so the line that i've written is from stable underscore baselines three dot common dot callbacks import eval call back comma and then stop training on reward threshold so our eval callback is going to be the callback that runs during our training stage and our stop training on reward threshold is going to be think of it like a checker so basically once our model passes a certain reward threshold so remember our reward for our carpool environment was our the reward which indicates that solved for the carpool environment is 200 so we'd basically be stopping our training once it receives or once it achieves that 200 score on average so that's now set up now what we need to do is actually set up these callbacks so let's go ahead and set them up and then we'll kick off another training run using it okay so those are our two callbacks sort of set up so there's one additional thing that we need to do and we need to specify uh where our best model is going to be but i'll come back to that in a second let's actually take a look at what we wrote so first up what we're doing is we're setting up our stop training on reward threshold callback this is the callback that's basically going to stop our training once we pass a certain reward threshold so in order to do that written stop underscore callback equals stop training on reward threshold which is this that we imported up here and then we're passing through our reward threshold so this basically specifies after which average reward we want to stop our training on so in this case i've set it to 200. and then i've also specified verbose equals one so we get some additional logging then the next callback that i've actually written is the eval callback so this is the callback that's going to be triggered after each training run now in this particular case i've written eval underscore callback equals eval callback and then two that will pass through a number of arguments so first up we're passing through our environment then we're passing through the callback that we want to run on the new best model so in this case i've written callback on new best and then we've specified stop callbacks so this basically means that every time there's a new best model it's going to run that stop callback and if the stop callback realizes that the reward threshold is above 200 then it's going to stop training overall now we can also specify how frequently we want to run our evaluation callback and in this case i've set it to 10 000 time steps and then i've also specified the best underscore model we actually need to do this so what we can actually do is have this eval callback save the model every time there's a new best model what we do need to do however is specify what we want that model to be called so let's specify that so in this case all i've gone and defined is the save path so this is where we want to save that best model so i've written save underscore path equals os dot path dot join and then i've just specified the same saved model folder so training and then comma saved space models and then what we're going to do is we're going to specify our best model as our save path so this means that after every 10 000 runs it's going to double check whether or not it's past the 200 reward threshold if it has it'll stop the training and it'll also save out the best model to this save pass so so you'll actually see this when we actually trigger it so if we run this cell now it looks like we've written this this should be best model save path my bad there we go cool so we just needed to update that parameter there so it should be best underscore model underscore save underscore path and then we've specified our save path that's all good to go now so those are our two training callbacks now what we now need to do is associate this to our model so we're going to create a new ppo model and assign these callbacks so that's our new model created again this line is exactly the same as what we did when creating an initial model under step three so again exactly the same as this line what's different now is that when we run our training command so model.learn we're going to pass through our callback so let's do that okay so what we've then gone and written is model.learn and then rather than just passing through the total time steps we're also passing through the callback that we want to run and in this case we're passing through our eval callback so this is going to be the callback that checks every 10 000 steps and again on every 10 000 steps it's going to save the best new model if it's got it into that save path and it's also going to check whether or not it's past that average reward threshold so if we go on and kick this off now we should see our training kick off so let's do that and so after 10 000 steps you should see the fact that it's evaluating whether or not it's past our reward threshold so you can see it's about 8 000 right now so it should check on the next one so you can see there so it's gone and evaluated it it's checked the episode length so it looks on average it's 198.8 so if we keep on going after another 10 000 we'll see that eval callback run again and there you go so because it's now hit the 200 it stopped our training and again we only had 20 000 but if we train for longer it would stop it regardless because it's now hit that 200 score threshold it's gonna stop the training pretty cool right so this gives you a lot of flexibility when you're actually going out and training really large models and you want to cap it off before it just sort of runs wild now another thing to note is that this will have also have saved our model so if we go into reinforcement learning go into our training folder in our saved models folder you can see this best model folder or this best model is now saved as well so this is a as a result of actually having our callback saved it actually goes about and saves the best model as well okay so that is our callback now the next thing that we can do is also change our policy so say for example we wanted to use a different neural network what we can do is do that as well so let's take a look at how you might do that so in order to change our policies we can actually specify a new neural network architecture now in order to do this we need to specify a network architecture for our custom actor as well as our value function so i'll show you how to do this so this is akin to just changing the number of units and the number of layers inside of your neural network so again pretty simple to do and then we can pass it on to our model so let's do it okay so that is our new neural network architecture defined now what we need to do is actually associate it to our algorithm so what i've actually written here is new underscore arch so new arch equals and then inside of square brackets i've defined a dictionary now the first neural network architecture that i've defined is for our custom actor so in order to do that we need to pass through pi and then we're specifying that we're going to have a new neural network with 128 units in each one of those layers four layers 128 units 128 128 128 and 128 and then we need to specify the same for our value functions again four layers with 128 units in each neural network layer so you can see that there so vf equals and then inside of square brackets 128 128 128 and 128 now you get again you might only do this if you had a really specific reason what i've actually found is the neural networks inside of the baseline algorithms work pretty well but again this sort of shows you how you might go about doing that so let's go ahead and associate this to our model and then kick off our training again oh this should actually be net arch my bad there you go alrighty cool so that is our new neural network now associated to our ppo model and specifically we've gone and updated the policy so what i've gone and written is model equals ppo and then mlp policy because again we're using the multi-layer perceptron policy then our pass through our environment verbose equals one tensorboard log log path so again no no real change from here onwards but then in order to specify the new neural network and specifically the new policy i've written policy underscored kw args equals and then i've passed through a dictionary and then to that dictionary we've specified the net underscore arch value and then we've set that equal to this over here so netarch so again this is just defining a new neural network or a new neural network policy attached to our model now what we can do is again type in model.learn and again we can apply our eval callback to this as well so again if we run that this is now using our new neural network architecture and specifically our new policy another thing to call that is that inside of stable baselines you've actually got a few different policies so if you go to the documentation and go to custom policy network there's a whole heap of information on how to actually do this as well so you can define custom feature extractors so on and so forth so again pretty pretty cool what you can actually do with this and it can actually get really really sophisticated now in this case we've got our model looks like it's all training sufficiently and again our eval callback has kicked in so it looks like our episode length has hit 200. looks like we're all good there and we've stopped all righty now the next thing that we want to now do is actually take a look at how we might go about using an alternate algorithm so rather than using our ppo model which we've been using so far we might want to use a dqn for example so how might we go about doing that well first up we need to import the dqn algorithm so let's do that okay so that's our dqn now imported so i've written from stable underscore baselines import dqn and then what we can do is just go and use this in a really similar manner to how we used our ppo models we could actually copy this over paste it down here and all we really need to do is sub out ppo for dqn we're going to get rid of our policy keyword arguments and so this has instantiated our dqm model and again model.learn and we'll pass through total time steps set it to 20 000. and there you go so this is now training a dqn model rather than a ppo model so really really quickly that shows how to apply a different algorithm remember also in stable baselines you've got a bunch of different types of algorithms so you've got a2c ddpg dqn her ppo sac and td3 in stable baselines 2 there's even more algorithms in that as well but because that is now in maintenance mode i figured i'd show stable baselines three cool so that is our dqm model and now done now again to save and export these it's a very similar manner so we just type in model.save the only difference when you're loading from a dqn is that rather than typing in ppo.load you'd now type in dqn.load so that in a nutshell really covers how to add a callback to our trading stage how to change our policy as well as how to use an alternate algorithm now on that note that brings us to our projects so step six we're now going to go through our different projects so we've got three specific projects that we're going to take a look at so project one we're going to take a look at reinforcement learning for atari games and we're specifically going to be trying to solve the breakout problem then project two we're going to try to leverage reinforcement learning to build a racing car and this is sort of like almost going down the path of autonomous driving and then project three we're going to take a look at how we can build our own custom environments using the open ai gym spaces that we talked about a little bit earlier on so first things first let's take a look at reinforcement learning for atari games okay so inside of the github repository you're also going to have a couple of additional projects so you can have project one project two and project three so project one is gonna be breakout so if i actually go to the github repo so you can see you can have project one which is breakout project two which is self-driving and project three which is custom environment now in this case what i've gone and done is i've started off with project one which is breakout now again even though we're working on different environments how we actually go about training them is going to be very very similar so whilst we've spent a lot of time going through the basics in the main course how we actually go about applying this to our different environments is going to be pretty much the same so let's start off by importing our dependencies first up okay so i've gone and written six different lines of code there so these are all going to be pretty familiar to you from the previous tutorial so the first one that i've written or the first line that i've written is importing jim again no difference there then from stable baselines three i'm importing a2c this is just a different algorithm so again remember how we import a ppo and then we import a dqn now this time we're going to be importing a2c just a different algorithm then we're importing from stable underscore baselines three dot common.vec underscore emv import vec frame stack so remember how in the main tutorial we didn't vectorize our environment so we only trained on one environment at a time what we're going to do for breakout is we're actually going to train on four environments at the same time so this should allow us to speed up our training then what i've written is from stable underscore baselines three dot dot evaluation import evaluate underscore policy no change there again so we use that previously and then from stable underscore baselines three dot common dot env underscore util import make underscore atari underscore env so this line is a little bit different and just helps us work with the atari environment so atari environments are the environments that allow us to play atari game so if we actually go to the gym documentation and take a look at our environments you can see under atari you've got the ability to try out a lot of these games now we're specifically going to be training on breakouts so let's take a look at that one so it's going to look like this yeah it's actually this one so basically the goal is just to maximize the score that you can see up here and you've got a maximum number of lives as well actually this is your number of lives this is your number of scores so again the goal is to just maximize that score so there's no real cap that you can get to to completely solve the environment it's just about getting the highest possible score so let's go on ahead so what else would we write there and then import os so again this is going to allow us to work with our operating system now another thing that i wanted to call that so say for example we wanted to use gpu acceleration so i said i'd show how to do that well all we need to do is go back to our pi torch link and in this case i'm going to choose the build that i want so stable windows pip python and then i've already got cuda 11 installed on this machine so if you don't you're gonna need to do that in order for this to work so what i'll do is i'll just copy this link and then bring it into my notebook so i'll add in an exclamation mark paste that in and i just need to get rid of this three here so if i run that this is now going to install the cuda accelerated version of pi torch so then what typically what you need to do is just restart your kernel so just hit kernel and then restart hit restart and you should be good to go and then what we need to do because we restarted our kernel is just re-import those dependencies so once that's done we should be okay now as of late there has been a change to the entire environment so previously you used to just be able to import them and they used to sort of just work but now you got to do something a slight bit different so what you actually need to do is download the raw files from let me just grab this link you need to download the raw files from this particular link here so it's atarimania.com forward slash roms forward slash roms.ra so you can see that that's now downloading and i'll paste that in i'll make that available in the notebook as well so it'll be this so you can see http colon forward slash forward slash www.atarimania.com forward slash roms forward slash roms dot ra without the one you don't need that so that's going to download all of the files that you need to be able to work with the atari environment so you don't need to do that or you didn't need to do this previously but i think as of late this is a change to the entire environments that you need to do in order to use them so once that's downloaded you'll have a file called roms.ra so let's wait for that to download and then i'll show you what to do with it okay it looks like that has finished downloading let's go and take a look at it so you can see that we've got roms.ra so i'm just going to copy this and paste it into the same directory that i'm currently working with so you can see i've already got roms and i've got hc rom so these files so i can actually delete these and what you need to do is just paste in that roms.ra file and then unzip it so i'm just going to extract it and you can see i've now got hc roms and rom let me zoom in on this so i've now got hc roms and roms so what we'll then do is extract these into the same folder and so that's hc roms let's do roms as well and so once those have extracted there's just a command that you need to run in order to install these so it's pretty straightforward but once that's done you should be able to leverage the atari environment so let's let that finish and then we should be good to go and if you get a warning you can just skip those cool that's good so you should now so we can actually delete these now so we can delete hc roms roms.ra and roms so we don't need those we just need the extracted folders which you can see there okay so that's all well and good now we need to go and ahead and install those so if we go back we just need to run a simple command to go and install those into our environment so let's go ahead and do that and there you go so we're all good so the command to run it is exclamation mark python dash m and then atari underscore pi dot import underscore roms and then you just need to pass through the path to those roms so again if i actually show you so those files are inside of a folder called roms and then roms again so you need to point to this particular file path here so in this case what i've written is exclamation mark python dash m atari underscore pi dot import underscore roms dot backwards roms backwards rom so again if you're on a mac the file path might be a little bit different i believe it's forward slash rather than backward slash but you sort of get the idea so once that's done you should be all good to go ahead and test this environment so let's go ahead set up our environment and we'll actually take a look at it okay so that's our environment now made now if we type in emv.reset we should get our observation oh my bad so there you go so we've got our observations and if we type in uh what's the other one so emv dot action space i'll leave that env dot action underscore space so you can see our action space is discrete and we've got four different actions that we can take we can take a look at our observation space in v dot observation says you can see that our observation space is going to be a box with the values ranging from 0 to 255 and the dimensions are going to be 210 in terms of height 160 in terms of width and 3. so this means it looks like it's going to be an image which in this case it's an image based model now what we can do is we can actually go on ahead and test out this model so remember if we cast our minds back to step 6 where we actually tested out our model we can actually copy this block of code and run the same thing here so remember this is just going to go on ahead and test out our particular model actually this is the wrong code let's actually write it from scratch so what we can do is go through a number of episodes and actually play breakout so let's give this a crack okay so let's take a look at what we wrote so this code is really really similar to what we used in step two where we loaded and tested out our environment so again we're setting up the number of episodes that we want to play we're looping through each one of those episodes and then we're basically going on ahead and taking random actions on that environment to see what it looks like so if we run this now should get a little pop-up and you can see we're effectively playing breakouts again it went really really quickly if we didn't want to close our environment we can just comment at that last line and you're going to see it play now you can see it's sort of just playing randomly and it's not exactly getting the highest score so it's what capping added around two four it looks like the highest that it got was four now it gets a point for each block it breaks so you can see they've got one two three in that case we want to try to train a model that's able to play a little bit better now again training this model can take a long time so we'll give it a crack and if you want to take this further train it for longer let me know how you go in the comments below now what we're going to do here is a little bit different to what we did in the main tutorial because what we're going to do is we're actually going to vectorize our environment and train four different environments at the same time so let's go on ahead and test this out okay there you go so that's our environment at the moment so now if we type in emv.reset and env.render you can see that we're actually playing with four different environments at once so this means that when we actually go and train we're going to be training four environments at the same time so hopefully this should give us a little bit of a speed up now we can type in emb.close to close that down you can see it doesn't look like it's closed down in this case let's try that again sometimes it's not going to want to close down and you're just going to have this for shut it but i don't want to foreshadow it because sometimes it'll crash the kernel that's fine for now just leave it open so that is our environment now set up so all well and good now what we can do is actually set up our model to actually go ahead and train this so let's do that and kick off our training oh let's actually take a look at what we wrote to vectorize our environment i completely skipped over that so what i wrote was emv equals make underscore atari underscore emv and then to that we pass through the type of environment that we want to run so in this case the environment that we're actually running is breakout dash v0 so this is the breakout game that we're actually working with now a key thing to call that is if you actually take a look at the gym environments there's actually a breakout ram version and a breakout dash v0 version so this one is going to train using images this one is actually going to train using ram we want the image based model because we're going to be using a cnn policy so then to that we've also passed n underscore e and v's equals to 4 because we're going to use four environments at the same time and we're going to specify seed as zero to get some reproducible results then what we've actually gone is we've gone and stacked those environments together so to do that written emv equals vec frame stack so this is that wrapper that we imported up here and then we've passed through our environment and we've specified n underscore stack equals four so this basically stacks our environments together then we're going to go and specify models so let's go and do it okay so that is our model now set up so we've gone and written two lines of code there so first up we've written log underscore path equals os dot path dot join and then we'll specify training and log so again it's similar to how we set up our log path in the main tutorial then we've gone and specified our model so model equals a2c so remember we're just using a different model in this particular case so this is the different algorithm that we're using so rather than ppo or dqm then we'll specify the type of policy that we want to use this is a key differentiator so previously we used the mlp policy which is great for tabular data or tabular observations but because our image and specifically our observations are an image in this case our cnn policy is actually going to be a lot faster to train so we've specified cnn policy so this basically uses a convolutional neural network as part of the policy rather than just a multi-layer perceptron then we've gone and specified our environment which is coming from here specified verbose equals one because we want logging and we've also specified a tensorboard log path now what we can go and do is go on ahead and train this now we're going to train this for a little bit longer so what i might do is might stop the training if it runs for too long and then we'll actually load up one that i pre-trained and see how that performs but for now let's train this on about let's give it 200 000 steps so again if you want to get a really really high performing model you might need to train up to a million or even two million steps but let's give it a hundred thousand and see how long that takes ideally we should be able to get a score higher than four from what you can see up here which is just random actions okay so no real difference there so what i've written is model dot learn and then pass through total underscore time steps and we'll specify that as a hundred thousand so again no different to what we did for our previous tutorial where we're training a model so we're gonna run this and we'll be right back okay so you can see that our training is kicked off so again we'll let this train and then we'll be right back as soon as it's done so it's gonna train for a hundred thousand steps so we'll give it a little bit of time okay so that is our breakout model now finished training so you can see that after a hundred thousand steps we've got an episode reward mean or average episode reward of 5.84 and an episode length of about 479 frames so not too bad overall now what we want to do is save this model and reload it before we do anything else so let's go ahead and do that now this is going to be really similar to how we've saved models before so again nothing too crazy there so that is our model now saved so if we go and take a look so you can see that we've gone and saved a2c breakout model now i've also got an a2c model that's been trained for 300 000 steps for breakouts so if you wanted to take a look at that one i'll include that in the github repo as well so you can take a look and try that one out but in this case we are going to test out our own model so let's go on ahead and delete our model and then reload it just to make sure it all works okay so that's our model reloaded now again what we can do is use the evaluate policy method which we had over here to test it out so remember that the max score that we got after testing out our five episodes up here was four so ideally this model should ideally try to get a little bit better than that so let's try that out so what we're going to do is we're going to use evaluate policy and we're going to pass through our model environment and then the number of eval steps so we're going to do let's do 10 and let's do render equals true and let's take a look you must only pass okay so this is actually pretty common so when we actually go and evaluate we can only go and evaluate on a single environment now remember correctly when we went and created our environment right here we had four environments so we vectorized them and trained them a whole heap faster now what we can do is we can actually single this down and leverage our vectorized model on a non-vectorized environment or an environment that only has a single particular environment inside of it so let's go ahead and create one of those first up okay so we've now gone and recreated our environment now this time rather than passing through four environments in our make atari environment function we've only passed through one but we're still stacking it up as though that there's four environments so this is going to allow us to leverage it so now if we go and run our evaluate policy method you can see it should all be running well so you can see it's playing way better now it's looking like got a five six seven six they're still playing way better than the random agent was so you can see there that on average we're getting a value of 6.1 with a standard deviation of 1.9 now what we could also do is we could also test out that bigger model that i had in there so again i can't remember how well that was performing but let's go ahead and test that out so the model path to that model is going to be a2c 300k models we can copy that name and try loading that up so i'm just going to update the a2c path and then load that one and then recreate our environment you don't need to recreate it but then let's try this one out so if you get your environment sort of freezing like this sometimes what you might need to do is restart your notebook so you can see there that it doesn't look like it's opening up so ideally what you should do is just hit restart on your kernel for this but make sure you save down your model before you do this i'm just going to hit restart hit restart again this should ideally close oh we want to stop that yep so that's good that's from another kernel now what we're going to do is re-import our dependencies so just import that and then scroll on down and what we're going to do is define our a2c path load up that model we need to recreate our environment from down here then load up that model and then try it again so there you go so that looks like it's performing way better already and you can see that this model obviously it's been trained a lot longer but it's getting into the tens and possibly the 20s when it's actually playing so again the longer that you train the better that this model is actually going to get now you could also try using some of the recurrent policies but at the moment they're not implemented in stable baselines 3. i will let you know once they are in the pinned comment but that sort of gives you an idea as to how you can go about training a reinforcement learning agent for breakout now what we'll do is we'll just clean this up so if we type in enb dot close that'll close our atari environment it's just this one left so that is all well and good so we went and did a ton of stuff here so what we did is we went and imported our dependencies and we imported a couple of new ones to work with atari we also went and installed the atari roms so remember you've got to download them from atari mania and again i'll include this link in the description below we vectorize our environment so rather than using a single environment we trained on four so this gives us a bit of a speed boost and then we also went and trained it up and then we went and saved it and evaluated it at the bottom and i also showed you the 300k model which you can see here it was getting an average score of 12.7 with a standard deviation of five so overall it was better but there was a higher standard deviation so that sort of gives you an idea of what's possible by just training a little bit longer hey guys editing nick here before we jump over to the next project i wanted to let you know that i ended up training the breakout model for an additional two million steps just to see what it would actually take a look like now after training for around about two million steps what we ended up with is an average reward around about the 20 point range now this obviously is a markedly improved result over what we had in our original model so ideally you can see the impact of training for longer this is what it looked like so as i was saying i ended up training the model for a whole heap of additional time steps so all up i ended up training the breakout model specifically with the a2c algorithm for around about two million time steps now when i evaluated this model it looked like we were getting around about an average score of 21 which is again still way better than our random model still better than our 100 000 time step model so you can start to see the impact of trading for longer now i'll also make this model available inside of the github repository so if you want to test it out for yourself you can start to see what that actually looks like so the model name is a2c underscore 2m model so a2c trained for 2 million steps so what we can do is as per usual load this up into our environment and what we'll do is we'll load it into our a2c algorithm we'll create our environment that has a single frame at a particular time and then rather than evaluating for 10 time steps what we'll go on ahead and do is evaluate for 50. so what i'll do is i'll run this and then i'll leave you to it so you can start to see the boost in performance and then we'll come back at the end and take a look at what our end score was so again we're going to run it for 50 evaluation episodes so let's go on ahead and do this and you can take a look [Music] so already you can start to see that this is performing way better so we're clearing the tens we're clearing the 20s and every now and then the ultimate score is hitting 30. so again way better than what we had in our previous models but again i'll leave you to it so you can enjoy the performance and take a look at how it's actually running uh [Music] up uh [Music] okay so that is 50 episodes now done so it looked like we actually cleared 50 around about halfway through there so again significantly better performance than our other two models now if we actually take a look at our scores you can see that our mean reward over 50 episodes is 22.22 so again way better than the other models that we were taking a look at and overall our standard deviation was 9.1 so again way better than what we were taking a look at before and ideally this begins to show you what is possible when you go and train your model for a little bit longer on to our next project now that is project one now done now we're on to project two so reinforcement learning for autonomous driving so for this particular environment we're going to be using the racing car environment so this is trying to get a car to drive around a randomly generated race track so let's go on ahead and take a look at this one so again still the same five steps that we're going to be going through but in this case slight bit different in terms of how we're going to set it up now the first thing to note is that in order to leverage the racing car environment you do need to install swig so in order to do that just take a look at installing squig and it's going to vary depending on whether or not you're installing on a windows machine or on a linux machine so for windows i believe all you need to do is download the swig file and then extract it and add it to your path for a mac i believe all you need to do is use homebrew to install it let's take a look yep so you can actually use homebrew so brew install swig so way easier if you're doing it on a mac cool so once you've got swig installed so again for windows you just download it extract it and then add it to your path and then you should be good to go for mac you just got to use brew install but again if you need a hand with that hit me up in the comments below then what we're going to do is install two new dependencies so we're going to need the box 2d environment and we're also going to need piglet so let's go ahead and install these up we've typed in jim wrong that should be jim okay all good so what i've gone and written there is exclamation mark pip install gym and then inside of square brackets box 2d so when using the racing car environment you need to have box 2d installed that's what the racing car environment is built on top of so once you've got that installed you should be good to go and then we're also installing piglet so again this is the dependency of that particular environment once that's done all you need to do is again import your dependencies so let's go ahead and do that alrighty so we've gone and written five lines of code there so the first line is importing open ai gym as per usual so import gym then the second one is from stable underscore baselines three import ppo next one from stable underscore baselines three dot common dot vec underscore env import dummy vec env so again this is really similar to what we did over here in our main tutorial so again we're going to be wrapping up our environment exactly as we had down there then the next line is importing our evaluate policy function and then last but not least we're importing os alrighty now the next thing that we're going to do is test out our environment as per usual so let's go ahead and do that okay that is our environment created so this is just a warning so you can sort of ignore that now what we can do is take a look at our environment again so emv dot reset so you can see that this is generating our track and we'll talk about that a little bit more we can take a look at emv.actionspace as per usual and you can see it's going to be a box and we've got a three different values between minus one and one if we take a look at our observation space you can see again it's going to be a box and it's going to be values between 0 and 255 and it looks like it's going to be an image so 96 by 96 by 3. so this means that we're going to have an image to be able to go ahead and train our racing environment now if we type in emv.render we can take a look at the environment itself you can see it's not popping up let's bring it up that's sort of what our racetrack looks like so you can see that we've got the entire racetrack there now we're actually going to test this out so we can type in emb.close so env.render should probably talk about this a little bit more so env.render allows you to render the actual environment that you're working in so this is an optional thing you don't need to render it does slow down training if you're rendering while you're training but it gives you the ability to see your agent in action so we can type emv.close to close that down so that should close down that environment so that's the old one cool now what we're going to do is we're going to go ahead and test out our environment again so again we can just copy this from our breakout tutorial so what we can do is just copy this testing code and bring it in and again this is almost identical right so we can uncomment our emv.close and go ahead and run this so this is going to test out our environment so you can see that we're actually trying to drive this car along this racetrack now ideally you're going to get more points the longer it stays on inside of the track and the more turns it does now because this is just taking random actions it doesn't actually know where the track is at the moment so it's just going to go straight you can see it's making a lot of movement it's kind of performing okay but it's not able to take the turn so the first time it gets up to that turn it's failing right all right we don't need to watch so you sort of get the idea so the goal is to get this car to go around the racetrack now we can actually stop this and hit in vr.close to close it this just sort of cleans it up that's fine we can leave that one open and now what we're going to do is go ahead and train our model again so again same sort of process this time we're going to be using the ppo algorithm so rather than using the other algorithm we're going to use a slightly different one so let's go ahead oh so rather than using a2c like we did for breakout we're going to use ppo here so let's go ahead and train up our model we should also take a look at what the different actions are so if we take a look uh what can we do mv doesn't look like we've got it let's take a look at how what we can pass a car racing environment open ai gym so let's take a look at what we've actually got here if there's any documentation on the different actions it doesn't look like it so again sometimes you're going to get better better explanations of what's actually happening in different environments it looks like we've got something here so the raw actually this is useful so the reward is minus let's make this a bit bigger so the reward is minus zero or negative point one for every frame and plus one thousand divided by n for every tractile visited so this means that for every track tile visited you're going to get plus 1000 divided by n where n is the number total number of tiles it says slightly confusing or complicated a reward function but you sort of get the idea so you get more rewards or you get more points for being able to go down each and every frame as long as you're on the tiles so the game is solved when the agent consistently gets 900 plus points so again this is going to take some time to be able to get to that point um so remember it's a powerful rear wheel drive car don't press the accelerator some indicators are shown at the bottom so we've got the true speed for abs sensors the steering wheel position and the gyroscope so again pretty cool so we've got a whole bunch of information here now you can see there that this one does have a finite set of reward statements which dictate whether or not it's solved so ideally you want to get over 900 points that's going to take a long time so i trained for 438 000 steps and i think i was getting in the realm of 50 40 sort of points so again can take a while to solve now what we're going to do in this case is again we're going to try to solve it and see how we go so let's do that so we're going to go and train our model so we're going to instantiate our environment and then we're going to go on ahead and train it okay so that is our environment now set up so we've gotten written e env equals jim dot make and then environment name and then again we're wrapping it inside of our dummy vectorize environment wrapper because we're not actually going to vectorize this one it's again pretty similar to what we did in the main tutorial then what we can go and do is set up our agent and our model so let's do it okay so that is our model set up so again we've gone and specified our logging path and this is where we're going to log out our tensorboard logs so we've gone written os.path.join and then specified training and then specified logs so that's going to be our directory and then we've actually gone and specified our agent so model equals ppo so again we're going to be using the ppo algorithm here and then we'll pass through cnn policy pass through our environment pass through verbose equals one and specified the tensorboard log path now again we're going to train but we're only going to train for 100 000 steps you might want to train for a whole heap longer if you want to try to hit that 900 score and if you do hit that 900 score do let me know in the comments below and share it out on twitter and linkedin i'd love to see it so in this case we are going to go ahead and train our model for a hundred thousand steps so let's go on ahead and do that so again to train our model we're just gonna write model model.learn so you're going to start to see there's a repeatable process to this so you instantiate your environment you create the environment vectorize it if you need to and then set up your model and then go ahead and train it so in this case we're going to go ahead and train it so let's go ahead kick this off and we will be right back so let's just wait for the training to kick off successfully and there you go so you can see that we're starting to get our output from our algorithm so we'll let that train for a hundred thousand steps and we'll be right back okay so that is our self-driving racing car now train so again we've gone and trained it for a hundred thousand steps 100 352 to be exact now again as per usual what we're going to go ahead and do is save our model and then test it out so let's do it okay so that's our model now saved so what we've gone and written is ppo underscore path to set up our path variable so in order to do that written os dot path dot join and then specified that we want it in our training folder and then our saved underscore mod or saved models folder and then we're gonna name it ppo driving model and then again we've used model.save to be able to go and save that down so if we now take a look we've now got this model here ppo underscore driving underscore model now i've also got another model that i trained for 428 000 steps so we'll take a look at that one as well but for now let's go ahead and delete our model as per usual just to double check this all works and then let's load it back up so model equals ppo dot load and then we're going to pass through our ppo path and our environment cool so that's all good now what we can do is go ahead and evaluate as per usual so let's go ahead and do that so we're going to pass through evaluate underscore policy and then we're going to pass through our model our environment and then the number of steps so we're going to do 10 steps and then we are going to pass through render equals true so it looks like our car's ripping it around the track doing a bunch of donuts so a key thing to note is that the car is high powered so it doesn't always so you can see it's going around the corner but it's having a bit of trouble sort of getting there this is great so you can see that because the car is so high powered it starts to lack traction so in this case it can get stuck in this loop whereas instead of actually driving forward it just goes and does donuts okay so it sort of gave up there got around the first corner up spinned out okay i think it's just gonna keep doing this so all right so that's sort of what a hundred thousand steps gets to you so not exactly the best uh racing car driver just kind of all right cool so let's stop this because it's clearly uh it's driving me and saying that it's just going all over the place and we're just gonna type in env dot close all right so that's now closed we've only got that one open and rather than using that one let's go ahead and load up the one that i trained for i think it was 438 000 steps 428 so let's go ahead and load up this model and again i'll make this model available in the github repo so i'm just going to change the name of the ppo path and then load this one up and let's go ahead and test with this model so you can see it is a lot slower now but it is at least sort of sticking to the track it's just cut the corner it's fine it's back on so you can see it's getting the score that it's getting is much higher so it's we're up to 190 200. so ideally you want to be able to get up to 900 so that means that it's going to have to accelerate off the turn so this is obviously trained for 438 000 steps so if you actually train for a lot longer you'd get a car that's actually rips it down the straights takes the corners appropriately in this case it's starting to sort of it's a little bit hesitant on the throttle which you can see there so it's it's going but it's maybe not going as fast as it could there you go it's just accelerated it's back on track there you go but you can see that this is obviously way better than the one that we trained for 100 000 steps so that sort of shows you the difference the training for a lot longer and i did nothing different apart from just trained for a longer period of time so again when you're training these reinforcement learning agents training for a longer period of time can obviously help you out a lot more and ideally produce a much better model so ideally for this i'd be looking at something in the realm of like a million to two million steps to be able to get something great so if someone does have the time and if you do run it for that long by all means do let me know if you'd like to see me do it do hit me up in the comments below i'd love to take a look at this again but for now that is our self-driving car sort of done it is a little bit glitchy on the throttle but you sort of get the idea now we can go ahead and close this so stop that environment and then run emb.close to close it now again remember in our main tutorial we also went through the ability to test it like this so rather than going through and using the evaluate policy we could also do this as well so if i copy step 6 from the main tutorial we could actually plug this in so in this case we've got our environment that's all good our model that's all good we could actually just test this out so let's run it there you go so you can see rather than using evaluate policy we're now using the i know what do you call it flow to be able to go and train this and you can see the car is getting around turn so we're up to what 250 now god i don't know how you can watch this after too many times it does get a little bit glitchy but it's going around its corners it's moving around and keep in mind we've only trained this on the image right so like we don't have any additional information but the image that's actually coming out of this which is actually pretty cool right let's get oh it actually took the corner that's pretty cool come on 270. not bad so again the the max score to consider this solved is 900 so ideally you'd want to train it for a lot longer to be able to get this performing way better and you can actually see our scores being logged down here so 255 181 276 it just got 214. but this is obviously way better than what we had in that random agent which was getting like negative values also i noticed that if it v is too far off the track and it's not able to see the track anymore it sort of gets stuck and just stops there this is on the 438k model but again you could train it longer and you'd get better results so that sort of gives you an idea as to how you can leverage reinforcement learning for autonomous driving and in this case racing hey guys editing nick here again so i also ended up training the self-driving tutorial for a whole heap more steps now again i trained this model for about two million steps and this significantly improved the performance of this particular model so in the actual tutorial we got around in the range of about 200 to 300 in terms of our reward estimate now when i went and trained it for a whole heap longer we were heading towards the range of around about 700 not quite completely solved based on the environment metrics but ideally you can see again it's performing a whole heap better this is what that looked like so as i was saying i ended up training the self-driving model for a whole heap longer all up i ended up training it for two million additional steps now the reason that i wanted to show you this is just so you can see the impact of training for longer so this is obviously one technique that you can go about leveraging in order to improve the performance of your models so again all up two million steps and i'll make this model available in the github repository so you can pick this up so if you class your minds back we had three different models all up now so we trained the first model which was just doing burnouts it's not really making it past the first corner we had the second model which was super jittery and then we've got this model now as well so i'll make this one available so in order to load this one up all i needed to do is again similar to what we did for our previous models i can just load it using the ppo path and model.load or ppo.load and then we can go and run this model now what you should see is that this model performs a whole heap better than the previous models it'll still spin out on a corner every now and then but again it's getting a lot further and scoring a lot higher than those other models that we train so let's go ahead and take a look at this one [Music] so you can see there it got up to about 800 so not too bad so every now and then it's gonna lose focus and sort of veer off the track but you can see it's performing a whole heap better than the other models that we had trained so it'll spin out but then it works its way back to the track and it eventually starts taking the corners pretty well again so every now and then you'll see one that performs not so well but you can see that this is performing significantly better than what we had before so again getting into the 700 range there we go that was another 770 score so you can see down the bottom again when we're evaluating our model what our performance is looking like so if we bring that a little bit further up [Music] and open it up it doesn't look like we're printing out so let's let these 10 episodes run and then you'll eventually see the total score or the automated sum of the results so i'll be quiet now and i'll let you enjoy this [Music] [Music] [Music] and that is all 10 episodes complete so you can see that we had a significant boost in terms of our performance simply by training for longer now if we actually take a look at the results of our evaluate policy you can see that our final score down here our average score over 10 episodes was 741 so again not quite hitting that golden 900 mark but again still way better than what we had in our previous models this also had a standard deviation of about 123 so again a reasonably high standard deviation in this particular case but this sort of shows you what's possible when you ideally go and tune your model and train for a little bit longer on to our next project now on that note that is project two now done now the last project that we're going to be taking a look at is reinforcement learning for custom environments now if you've watched my shower environment or shower custom environment tutorial this is going to be that same environment but we're going to be using stable bass lines as the algorithms to be able to solve this so without further ado let's kick off project three so again all of these notebooks are going to be available inside of the github repository so if you want to pick these up by all means do grab them let me know how you go with them and if you get stuck please do reach out to me i'm more than happy to help so let's go on ahead and do this so there's a bunch of dependencies that we're going to be importing here namely because we're going to be defining our own environment in this case so let's go ahead and import these dependencies and then we'll take a step back and take a look at those okay so we've gone and written nine lines of code there so there's quite a fair bit now in this case what i've gone and written is i've broken it up into three specific sections so these are our gym dependencies or open ai gym dependencies these are some of our helpers that we're going to need and then down here we've got our stable baseline stuff so first up from jim we're importing more than just the gym package this time so written import gym which is going to give us a pretty standard import then what we're doing is we're importing the gym environment class so to do that we've written from gym import env so this is going to be the super class that we're going to be able to use to build our own environment then what we've written is from gym dot spaces import discrete box dict tuple multi binary and multi-discrete so each one of these represents all of the different types of spaces that are available inside of openai gym so i wanted to sort of show you what each one of these different types of spaces looks like and how to actually use them we'll probably only use the two common ones discrete and box in our environment but i wanted to give you an idea as to how these all fit together then we've gone and brought in some helpers so we've imported numpy so import numpy as np we've imported random so import random we've imported operating system so import os and then we've gone and imported our standard stable baseline stuff so from stable underscore baselines three import ppo from stable underscore baselines three import common dot vec underscore emv import dummy vec nv again pretty standard and then we've imported our evaluate policy function so again this is really really common most of this is pretty common and jim is pretty common the new stuff is these couple of lines here so let's go and have a look at our different types of spaces so we've got a bunch that we've imported over here let's take a look at each one of these so first up is discrete so we can dive in discrete and then say we wanted three different actions we can do that so that's going to give us our discrete space now we can actually sample it and take a look at all the different types of values so again 0 1 and we should get up to 2. so you can see this is going to give you a value between 0 1 and 2 by passing through discrete equals 3. so if you had an action and an action mapped or each one of those actions mapped to a specific value so 0 1 or 2 that's how you'd use that then we've got a box so that's our box space so to do that we're in box and then we'll pass through zero and then comma one so this is our low value this is our upper value and then this this is the shape of the output that we're going to get so we're going to get an array that's three by three so ideally you'll have a list of lists so if we take a look at that by sampling it so again you've got an array and inside of that array you've got three individual rays and those arrays have three values so again exactly the same formatting so we've got those values between zero and one so again you might use this if you were trying to look at different types of sensors or if you had continuous variables you'd use a box now again if you just wanted three values you could do that and that's just going to give you three values as well so all i've done is i've reduced it into an array of three values then what we've got is a tuple now at the moment stable baselines doesn't support a tuple but if you wanted to use it you could still take a look so to that we can pass through a discrete environment or a discrete space and really your tuple space just allows you to combine different spaces so if we do that you can see our tuple is now combined of our discrete environment and our discrete box space so if we sample it you can see we're getting a discrete value first up and then we're getting our box second okay on to our next one so again so far we've done discrete box and tuple next one that we want to take a look at is dict so this is really similar to a tuple the only difference is that rather than passing through a 2-port or tuple you pass through a dictionary so let's do it okay so that is our dict space so what i've written is dict and then open braces and then to that we've actually passed through this dictionary here now this dictionary has two keys so height and then height is equal to discrete two so really it's no different to typing discrete two up here and then we've created another key which is speed and we've set that equal to box and then two box remember you're going to pass through three key arguments so you're going to pass through your lower value your upper value and then the shape that you want so in this case i've specified a shape of 1 comma which means i'm only going to get a single value back between the values of 0 and 100. so if we actually go and sample this you can see we've got our height key which is represented as 0 because remember it's going to be between 0 1 and 2 and then we've also got our speed which in this case is a value between 0 and 100 pretty cool right so that gives us a dict space now the next space that we want to take a look at is multi-binary so in order to create that space we've just written multi-binary and then passed through the number of positions that we want in our multi-binary space so multi-binary 4 is going to give us 4 positions so we can go and sample it and in this case you can see that we've got 0 1 2 3 and 4 positions and in this case it's going to be a binary set of values so either zeros or one so if we go and sample it multiple times you can see it's just different combinations of zeros and ones in those four positions cool now the last type of space that we want to take a look at is multi-discrete so again going to be pretty similar to multi-binary except rather than being binary values they're going to be discrete values between any value that we want now so let's go ahead and do it okay so that is our multi-discrete space so to do that written multi-discrete and then two that i've passed through a list and the values are passed are five two and two so if we go and hit sample so again you're going to get three different values and these values are going to vary depending on what parameters you've passed through to the list so because i've passed through 5 the first value is going to vary between 0 and 4 because i pass through 2 it's going to vary between 0 and two actually is this going let's actually take a look i believe max it's going to get up to yeah yeah so it's going to be zero to four and then zero to one and then zero to one again so again this is the upper cap so it starts at zero cool so we can keep going through that and you can sort of see what happens so if we go and pass through another value now we're just going to get another discrete value appended onto the end of that so that's really a summary of all the different types of spaces that you've got available inside of open ai gym so you've got a discrete space a box space tuple dict multi-binary and multi-discrete so discrete is when you'd have a discrete number of actions and those mapped through to a single integer box gives you continuous variables remember with your box you just pass through your lower value your upper value and then the shape of the box that you actually want your tuple allows you to combine different types of spaces together as a tuple but is not currently supported by a stable baseline so something to keep in mind so remember to your tuple you just pass through a set of braces and then the two different types or whatever types of spaces you want so we've passed you discrete here and then box here as well so again this is inside of braces there then we've got our dick space so again to our dick we just pass through a dictionary of different types of spaces so we've got our discrete and we've also got our box space there we've got our multi-binary space so again we've got that now keep in mind you could actually grab that multi-binary space and add it to your tuple as well so if we do that we are just in there so again now we've gone and added another type of space to our tuple so again we could do this with that dict so say for example i could call this um color i don't know so again you can add multiple versions to the two-point dick they're sort of like grouping spaces right so we've got multi-binary and then to that you pass through the number of positions that you want in your binary space and we've also got multi-discrete and this gives you a bunch of different discrete types of values so again there's not a lot of documentation out there on these so i figured i'd do a little bit of a crash course on them if you'd like to see more on that by all means do hit me up in the comments below but now we're going to be building our own environment now the goals of this environment are to basically build an agent to give us the best shower possible now what's going to happen is randomly the temperature is going to fluctuate because there's other people in the building so it's going to randomly go up and down now we know that our optimal temperature is between 37 and 39 degrees so we want to be able to train an agent to automatically respond to changes in temperature and get it within that 37 and 39 degrees range now keep in mind that our agent doesn't actually know that we prefer our temperature to be within 37 and 39 degrees so it's going to need to learn what types of adjustments it can make to get it to within that range just something to keep in mind so remember this is a simulated environment so our agent doesn't actually know how it accumulates its reward it just knows that by doing certain actions it's going to get a reward now we know it we want to get it between 37 and 39 but our agent doesn't so let's go ahead and build this environment so there's a few different functions that we need to implement in this environment to get it valid so let's go ahead let's build a shell and then we'll fill it up okay so at a high level that is our shower environment now we obviously haven't gone and implemented the different components into it but these are the four key functions that you need to have inside of your shower environment class so let's take a look at what we've got so far so what i've gone and read in this class and then inside a capitals or camel case i've got shower env and then to that we're passing through our env class which is from our gm environment up there and then colon then we've got four different functions that we've gone and implemented so we've got the init function so which triggers when we create our class the step function the render function and the reset function so to do that we're in def underscore underscore init and then two that were passing through self inside a pair of brackets and then a colon and then right now we've just written pass this allows us to define it without having any errors for now then we've defined a step function so def step and then to that we'll pass through self and then we're passing through the action that we're actually going to pass through to our environment so remember when we use env.step we pass through our action and it actually does something then we've gone and defined a render function so def render and then two that will pass through self and then colon and then pass so we're not going to do anything in our render function for now i actually as part of building this course i actually built out a giant or started building out a giant pie game environment but it was sort of blowing out of proportion so if you'd like to see a video on reinforcement learning for gaming which involves building a custom environment using pi game please do let me know in the comments below i'd love to hear your thoughts and then our last function is reset so def reset and then to that again we're going to be passing through self colon and then right now we've got past so let's go ahead and initially set up our init function and then we'll keep going okay that is our initialization function and now done so we went and wrote four lines of code there so first up we defined our action space so to do that we wrote self dot action space and we set that equal to discrete three so remember this is no different to saying discrete three and the three actions that we're going to have are whether or not we turn the tap up whether or not we turn the tap down or whether or not we leave the tap unchanged so this basically gives us three discrete actions now you could change this and have it actually as a box type action space where you actually turn the tap by a certain amount or by a certain number of degrees but in this case we're going to keep it pretty simple and say up down or hold then we've gone and defined our observation space so to do that written self dot observation underscore space equals box and then we've set it to equal to two numpy arrays so in this case we've got a low value so low equals mp dot array and then pass through zero and then we've gone and specified a high value so high equals mp.array and then to that we've passed through the number 100 so this means that our observation space let's actually extract that so we can take a look so this means that our observation space is going to be a value between 0 and 100 and have the value of 1. so if we type in dot sample we can do that so you can see that that's going to be the value that we get back now we can actually change this so we can just make it 0 and 100 and pass through shape equals 1 comma that should ideally give us the same type of output so delete that so there you go so same sort of output and if we type in dot sample again we're going to get the same type of output so again two different ways of defining it in this case i've just swapped it out but you can sort of see that you've got multiple ways of defining that box space then we've gone and defined as initial state so this is going to set our initial state so we'll set that equal to 38 plus a random integer between -3 and 3. so this means that our shower is going to start out at 38 degrees plus or minus 3. and the goal of our agent is going to be to get it within that magic range of 37 and 39 degrees now we've also set another variable so this is going to effectively represent our episode length in this case it's our shower link so we're only going to shower for 60 seconds it's a fast shot i know so what we've gone and written is self dot shower underscore length equals 60. so what we're going to do inside of our step function is decrease that by one second every time we go through and take an action so let's go on ahead and now define our step function okay so we've now gone and filled out our step function so in this particular case what we've got is let's say one two three four five six six different code blocks so the first one is applying the impact of our action on our state so remember we had three different actions so zero one or two so zero is going to represent decreasing the temperature of our shower by one degree one is going to represent no change and then two is going to represent increasing the temperature of our shower by one degree so in order to do that reasonably simply we've written self dot state plus equals the action minus one so remember our action is going to be discrete what was it three so if we go and sample that so in this case we've got one so by minusing one we're going to get the value so actually let's actually print it out so in this case if we take the action of one that is going to be akin to leaving the temperature the same so if we minus one again it's going to apply zero change to our temperature so self.state is going to stay the same if we get a different value say for example we get 2 by minusing 1 which is what we're doing here we're going to increase the temperature of the shower by 1 degree and if we get 0 we're effectively going to be subtracting one cool so now the next thing that we've gone and done is then decrease the shower time so every time we take a step or take an action we're going to decrease the length of our shower by one so remember we defined it up here initially to 60 seconds you could change this to something different if you wanted to so we've gone and defined self dot shower underscore length minus equals one so that's going to decrease it and then this is really really important so this is where we actually define our reward so again if you had a really complicated reward schema this is where you'd be doing it so what we've gone and done is we've written if self.state so remember state is our temperature is greater than or equal to 37. so remember the magic ratio it's got to be between 37 and 39 degrees and self.state is less than 39 degrees then the reward is one in all other cases so say for example if our shower is completely out of that range our reward is going to be negative one now you could also make the reward zero as well then what we're also doing is we're checking whether or not a shower is done because if our episode is done then we want to stop that particular episode so if self.shower underscore length is less than equal to zero then done is set to true let's fix that we've gone screwed that up then else done equals false so again if it's not if we haven't fully consumed the 60 seconds then the ash hour is not done then we're creating a blank info dictionary so if we wanted to pass additional stuff out of here we could do that in there and then out of this we're returning our self.state which is going to be our temperature our reward for that particular episode whether or not we're done and our info so again out of this we're returning all of these bits of information that we've gone and calculated in our step function now in this case our render function we're not actually going to do anything in here so we could implement biz if we wanted to we're not going to do anything there but if you wanted to you definitely could then the last function that we need to implement is our reset function key thing to know is that if you wanted a more detailed tutorial on how to implement render and again pygame i'd love to do something on that and if you've got a specific use case hit me up in the comments below because i'd love to hear some ideas as well in this case let's go ahead and wrap up this environment so for our reset function we just need to reset our initial temperature and we also need to reset our shower time to 60 seconds so let's go ahead and do it okay i think that is our environment now done so for our reset function what we're effectively doing we could actually potentially drop this self.state up here because we're going and re-initializing it inside of our reset but that's fine for now so what we've gone and written in self.state equals np dot array and then to that we're passing through our same random initialization function so inside of a set of square brackets we've passed through 38 degrees plus random dot rand int between -3 and 3. so again you could choose a broader random initialization if you wanted to in this case i've just chosen three and then we've specified as type float because remember our box is going to be we haven't specified that it's going to be an integer we could do that as well so you could specify d types in this case we're going to leave it as a float now what we're also doing is we're resetting our shower length to 60 so self dot shower underscore length equals 60 and then we're returning our self dot state so that should be our environment all well and good now now what we can do is we can actually test out this environment so emv equals shower env and then we can run inv dot observation space as per usual and you can see we're getting our box back and if we type in dot sample this is our initial temperature and if we keep doing that you can sort of see that there and we can type in emb dot action space and again we've got our discrete space dot sample and there you go so you can see that we've gone through the breakout tutorial the self-driving tutorial and you're probably thinking how these space is built well this is exactly how they're done when you're dealing with gaming implementations there's a lot more work done around the render around the observation space as well as around the reward space because it's a little bit more sophisticated and again if you'd like to see that done let me know i've started i've already got the template code sort of built love to do your tutorial if you guys are interested now in this case what we're actually going to go ahead and do is test our environment and train it as per usual so again what we can go ahead and do is let's just copy the exact same testing code that we used for our driving tutorial which is this here and this is under step two test environment we can actually paste this here now the cool thing is that we've actually gone and defined an environment to a state that we can actually use it as part of a template code so if we go and run this you can see that it's automatically gone and smashed through all of those episodes so it's got our score printed out now remember our score is going to increment by one if we've got the shower between 37 and 39 and it's gonna decrease by one if it's outside that range so you can see that just by running those five episodes we've got a high score of 26 and the lowest of minus 60. so again we can keep running this and you can see it's very quick and this is because we don't have a sophisticated render function and it's just it's all text based so again it's going to go very very quickly now in this case what we're going to go ahead and do is train our model and then save it so you can sort of see how to do this on a custom environment so let's go ahead and do that okay so we've gone and initialized our model and it's automatically wrapped inside of the dummy vec nv so even though we've gone and imported it over here looks like it's automatically wrapped so we're good to go so we've written log underscore path equals os dot path dot join and then true that will pass through training and then log so remember it's going to follow that same sort of logging directory that we set up and then we specified model equals ppo pass through the policy that we want to use in this case mlp policy this is different because in our breakout tutorial in our self-driving tutorial we had an image returned now we've got sort of tabular data or tensorbase data or actually well is sort of the same but we've actually got a array of values rather than an image so we're going to use the mlp policy then we're passing through our environment specifying verbose equals one and then specifying a tensorboard a log path cool now the next thing that we need to do is just go on ahead and train so again this is going to train really really quickly so you don't need to do a super long training run so let's just try it out so model dot learn to train and then total time steps i don't know let's set this to 4000 for example so let's go ahead kick this off and let it train so this should train reasonably quickly because again it's using an mlp policy and it's all tabular data i mean you can see the frames per second is five and that that's actually done so it literally went that quick so let's actually run it for longer so rather than 4000 let's give it 40 000. all right so you can see that that's now running it's doing about 600 frames per second so really really quickly and that's one thing to call that so with the game environments they're going to take longer to train versus like a simple environment like this so again the more sophisticated the environment the longer it's going to take to train so just sort of keep that in mind when you're planning your projects and when you're sort of committing to clients when you're building this type of stuff and again if you need help by all means do call me out i'm more than happy to help you can start to see that we're getting our episode reward mean so in this case minus 28.2 minus 25.9 so it looks like it's dropping so it should ideally get into the positives minus 21.7 minus 14.9 it's getting close minus 16.9 and again the episode length mean is going to be the same pretty much all the time it's going to be 60 seconds because that's the maximum remember i might need to train this for a little longer looks like it's getting close but it's not into the positives yet all right it's minus 15.5 let's actually give it another i don't know 20 000 steps let's run that so minus 4.12 so again you can see it's starting to get close into the positives minus 5.18 9.8 let's let that run and we'll be right back okay so it got pretty close it's episode reward mean over here got to about minus 4.22 so i guess this depends on the starting point and how the model actually develops from there so let's actually go and test it out and see how we actually go so again we can use evaluate policy here or we should save our model so model dot save let's define our path uh what are we going to call this shower model let's just double check our directory name again so it's training saved models so let's specify that and we're going to call this shower model ppo cool so we can type in model.save shower path and again if we go and take a look that should be [Music] shallow model underscore ppo so again that is now saved so we're good again we can delete our model and if we wanted to reload it we can just type in model equals ppo dot load pass through our path pass through our environment and then if we wanted to test it out we can run evaluate policy pass through our model pass through our environment pass through the number of eval episodes and we don't need a render this time because we don't have a render function so if we type in render equals true should throw an error it might actually not throw an error because we've got the [Music] what did we have because we've got past so we're all good now in this case we've got a mean episode reward of 12 with a standard deviation of 58.78 so again it's getting there but it's very it's got wide variance so again you could train this for a whole lot longer tighten up the environment make it a little bit more realistic than only being able to adjust up down and sort of steady state but that sort of gives you an idea as to how to bring this all together so in this tutorial we went through a bunch of stuff as well so we took uh so we imported all of our dependencies we took a look at all of our different types of spaces remember there's a discrete space box tuple dict multi binary and multi-discrete and just keep in mind that stable baselines doesn't support tuples yet we also took a look at how to build our environment so remember we had to define our init function our step function our render function and last but not least our reset function and then again in terms of testing and training and saving the model it's all very much the same but again if you've got a really sophisticated model that you'd like to build by all means hit me up in the comments below i'd love to help you out and if you do build some really cool environments do let me know i'd love to see them as well on that note we have now finished project number three so it comes to our wrap-up so hopefully you've enjoyed this course so in it we've gone and covered a whole heap of stuff and remember the core purpose of this course is to bridge the gap between a lot of the theoretical stuff that you see floating around there in terms of reinforcement learning and show you a practical set of implementation steps so we went through rl in a nutshell and talked about what reinforcement learning is and how it works we took a look at how to set up our environment with stable baselines we then went and built and took a look at some different types of environments using open ai gym in step number two we then went and trained a model we then went and tested and evaluated it so we took a look at how we can view it inside a tensorboard we then extended out some of our algorithms and specifically we went and implemented callbacks we went and used different algorithms so remember we all up we use ppo a2c and we used a dqn algorithm as well we even went and changed our policy architecture so again some pretty cool stuff happening there and then we went through our three different projects so remember we went and trained a model to play breakout we went and trained a model to race a car around a racing track and we also went and built our own custom shower environment so all up we did quite a fair bit now i want to leave you with some additional resources so if you haven't gone through david silva's reinforcement learning course i'd highly recommend you do his team are the team behind deepmind and the guys that actually built the alpha go model so obviously super smart dude got some awesome theories for floating around there and by all means i do recommend you check it out there's also a great book called reinforcement learning and introduction by richard sutton and andrew bartos some of the pioneers in the reinforcement learning field i highly recommend you go and check those that book out it's got some awesome stuff in it as well now in terms of what to learn next i love it when people give me ideas as to where to go from here and i want to give you the same so one of the things that we didn't cover in this course is hyper parameter tuning so one of the ways that you can improve how you train your models is to tune the hyper parameters that you start and you progress your algorithms with so again that might be something that you take a deeper look into and if you'd like to see a tutorial on that or a course on that by all means do hit me up building detailed custom environments for example we talked about this a little bit in terms of implementing a render function with pi game as well as integration with other simulation systems like mojoko and unity and then last but not least i think one of the coolest things that you could potentially take a look at learning is how to do an end-to-end implementation so say for example you actually went and built a cart pole robot and actually got went and trained in a simulated environment and then implemented it on a real environment perhaps using a raspberry pi based robot i think that would be an awesome thing to go and take a look at next but on that note that about wraps it up hopefully you've enjoyed this and thanks again for tuning in thanks so much for tuning in guys if you enjoyed this video be sure to give it a thumbs up hit subscribe and tick that bell and let me know any feedback or anything that you'd like to see going on from this and if you get stuck at all by all means do hit me up in the comments below i'm happy to help you out thanks again for tuning in peace
Info
Channel: Nicholas Renotte
Views: 408,105
Rating: undefined out of 5
Keywords: reinforcement learning python, reinforcement learning tutorial, reinforcement learning course, reinforcement learning example, reinforcement learning pytorch, reinforcement learning game, reinforcement learning
Id: Mut_u40Sqz4
Channel Id: undefined
Length: 181min 58sec (10918 seconds)
Published: Sun Jun 06 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.