Reinforcement Learning - A Simple Python Example and A Step Closer to AI with Assisted Q-Learning

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
so I'm gonna talk about reinforcement learning there's a topic I'm really excited about you'll see a lot of walkthroughs and videos out there to talk about reinforcement learning and always mention my step closer to a AI or feels like AI and it totally does it feels like there's that extra layer of intelligence that's refreshing to machine learning I've been at it for a few years and the way I've always heard machine earn is described as either supervised or unsupervised and now I'm hearing it recently as supervised unsupervised or reinforced Merton learning so it really made this way into this duo that's now a trio the main difference between supervised and unsupervised and reinforcement learning is supervised and unsupervised you need historical data to model reinforcement learning you don't it works on a reward system imagine is a kind of example we're going to look through learning where you want to go from point A to a gold point and you you you know it does not need to know the map you'll try every combination until it finds the best way to get from the starting point to your goal point and that's very powerful in of itself because there is no data so it's actually kind of searching out there and it's very interesting the the other difficulty I've seen trying to understand reinforcement learning has been that a lot of the blogs out there are very complex they either bring in game theory or deep search algorithms I just wanted a very simple Python example and this one I'll show you today and I'm copying most of the code from Mick's blog which is mentioned in in my walkthrough that finally made it very simple stripped and all they you know the superlative and maeĆ­n capped it to the point and thank you for that mech and it will look at a second example a little bit more complex where we bring in a little bit more intelligence to it which i think is very exciting but let's first understand what reinforcement learning is all about and then we'll move forward okay so let's start with a code let me just paste it into ipython and then talk about it so all the code is on the blog it's mentioned there please go there to get to the source we're gonna start with a point list the port list represents your map all your potential points you can go don't worry about going the reverse just think about it going one way we can go from point zero to 1 1 to 5 5 to 6 5 2 4 1 to 2 2 to 3 and 2 to 7 and our goal is 0.7 our starting point is point zero so let's look at it 0 1 1 5 5 6 and there's an it's mapped out 0 1 5 6 5 4 right that's exactly so 0 1 1 to 5 5 to 6 5 to 4 1 2 2 2 2 3 2 to 7 that's all you have to do when you create your points list that makes things a lot simpler and we're going to use Network X to visualize it and that's also nice by by describing like this you can easily this visual eyes you're actually you can do it a few times it's randomly drawn so do it till you know you get a decent picture that you can understand okay this was not too bad 0 20 point 0 to 1 and 1 to 2 2 to 7 that would be the ideal route right that's what we want our Q learning algorithm to figure out on its own right we're just gonna get the map and say hey you figure it out point 3 6 5 4 our noise and that's gonna have to contend with those as being things that don't help it reach its gold ok [Music] so we declared a matrix size because we have we go from zero seven we know our matrix will be eight to eight and we create a rewards matrix every word matrix is basically a representation in what I just showed you in a form that the cue learning algorithm can read so let's not worry too much about it if you want more details about that go to either mix blog or the other blog I mentioned in the walkthrough that goes through more of the theory just two blogs mentioned that are really good and simple so now we're going to take our points list that we created and us and fill out the values in our rewards matrix right now it's all negative one we've just initialized it and now as you can see it's a bit different point zero one is here and one zero so that the point and it's inverse are gonna be here one five is here and it's in first right so one five is here and five one is here right so it's gonna plot it's gonna everything where it cannot go is negative one or a can where it can go and it doesn't reach the reward it gets a zero so you have the you know we do here we're going to go from 0 to 1 and 1 to 0 because you can go both ways the only ones are not mapped twice the reverse is the goal points so you can go from point 2 to 7 or 7 to 7 it likes to have the loop back so these are the two goal points let's look at the map again you can go from point two to seven to win or you can go from 7 to 7 to win right it needs that for the algorithm but really we're interested in going from 2 to 7 not from 7 to 7 it's redundant but it needs it so we'll leave it and it creates our reward matrix of course this reward matrix is not shown to the algorithm it's only the algorithm use it after the fact to figure out if it made a point or lost a point right this is the reward system and a quick caveat they've used reinforcement learning with Atari games where they just see the pixel values on the screen they don't see the actual game they don't know if this asteroid or paddle board they just see the pixels on the screen and they can read the score and move whatever control there is maybe it's a paddle I can go left that's all he knows nothing else it can move a paddle read the score and read each pixel on the screen and it's gonna try every single combination until it sees the score constantly considered consistently going up and it knows it's doing well right that's kind of the concept here as well so now we're going to build the brains and this is very much copied from Mick's blog so for more details please go there but in a nutshell we give it an initial state state there's a function that's called available action that's going to return all the potential moves our bot could take right for example if it's at number one it can go either five or two right so it's going to return two potential moves the sample next to action is simply going to take one randomly and the update function this is kind of the brains of it including our cue learning function right this is the the the modeling function again go go to both of those blogs if you want to get more details but the the the one parameter that you can tune is the gamma that's that's a parameter the learning rate that you can tune everything else kind of has to be that way okay so now we and this update function is basically going to update this Q matrix which is right here the Q matrix is kind of oh it's where the model is going to save how well it's done on its different routes right it's going to keep a tally of where does well and the biggest number is eventually are going to become the most efficient route and this is what if this is what this update function is going to instantly figure out what where the bot went and then look at the reward matrix right and see okay this is where you want to go this is a score I give you right it doesn't really have access to guess the move to look at the move in the future right it can only look at it in the past but it will get a score based on how well it did okay so moving on and now we have the training and testing function so I'm just gonna let me just do this actually listen in bite chunk so the first thing is the training it's gonna basically loop through what we just showed you 700 times that's more or less a good a good mean of values and average values to get things done then it's gonna give us a score every time how well it's doing and fine it's gonna give us an overall score so let me run this and here is the Q matrix where it has learned the best moves to you know the ideal moves to get to where it needs to go and now we're going to run our testing and it returns a most efficient passes from 0 1 to 7 right and if we go back we indeed see that's 0 1 2 7 is exactly where we want to go right we want to start from 0 finished it's heaven so perfect it figured it out and let's see we can plot also kind of its journey to see how well it's how well it did so right we had we had it run 700 times it looks like right around 400 or right around this area right it converges meaning that does find the ideal path and all the rest all this is extra it does not need to do more but so that's it really right we gave it a map and we said you know go at it try many combinations and around 400 combinations it figured out what's the most the easiest route so what's exciting about Q learning and reinforcement learning is you may not even have a map right you may not you may just know that the goal looks like something like you know find our gonna look at example next about finding honey we'll say hey find the Beehive I don't know that I don't know the map but I want you to find it be half you could very well tell you know every every single step it would figure out where it can go next maybe can go left right up down right it could figure out the map on its own and you don't mean and you as the you know as the the owner of the the reinforcement learning program you may not know with a Mac but you can launch your your your bot it can figure out where to go and it will stop once it finds the honey that you may not know where the honey is but at least you know you want it to find honey that's what's really exciting about this approach it's very different and you really start seeing that you know automated things going out there in the real world looking for things and not stopping until they find him I mean it's it's terrifying but awesome at the same time so let's take this example one step further hopefully you're still with me and we're going to add environmental details so let me look at the the artwork that my kids did here you have zero point zero so the imagine this is a bot an automated bot looking for honey as his jar and he wants to get to the Beehive right the Assumption here is that there's a factory that makes a lot of smoke and bees beehives bees don't like smoke so we know that if the but finds smoke and it's a at point five he could be finding smoke already he would turn around right or if he saw a bee at point two he would keep going because he knows bees main beehive keep it going so these are environmental factors that I would like to collect as it looks for as it tries to find his goal of point seven right at this point doesn't it does it may not know that bees and factories have anything to do with what it's looking for so he the body needs to do it once figure its figure out how to get from point zero two point seven but all along the way takes snapshots of what it sees around it right oh I put five F I found smoke no to smoke at point two I found these note that there are bees right and it would catalog these things so when it comes back you can then use a more traditional classification model and kind of figure out it are there any environmental factors that help me find my goal and you'll find out that you know bees have a positive coefficient for hives and smoke have a negative coefficient for hive stuff so next time he goes out that pot goes out it can start thinking about oh you know though you know I don't really have a map though this is reinforcement learning and I'm rewarded I have learned in the past that you know bees do tend to mean that there's a hive around let's keep going so it's a bit of a crude example that I have but just to kind of give those wrong keys I thought so we're doing a slightly different math but something that can explain what we're looking at so our starting point is still point zero where is it it's right start right where it says start and we want to get two bees we want to give it to beehive so now we have that hint right we don't know yet but there are bees at point two and there's smoke at point four five six so any of these hopefully at point of I can turn around and bees I can't keep going going right so we're gonna read rerun our Q matrix basically fill out what is found along the word along the road the Q learning but we're going to add the this a different slightly different ability we're gonna add an extra ability of collecting environmental factors did at this point did you find bees or at this point did you find smoke right let me just run that it runs a and now I returned two extra matrices one where it found bees and one where I found smoke okay and then we're going to kind of subtract these matrices let me just show you what it looks like so that we kind of give it a bias and so now positive are going to be the bees and a negative values are going to be the smoke keep in mind that normally you would you know go out in the field get your data and then you do the modeling after the facts okay what did the bot return on its journey looking for the most efficient path and they found these different things what about on this path let's see if there's any correlation to the the you know the goal we're looking for and here we told us you know bees is positive and smoke is negative and now we can set it back on this journey with this type of map so here I make another assumption what we're going to do the exact same as before thus that's why I can give it the same environmental matrix which is which is a map of where the bees in a smoke are but in reality you would kind of you know have it go through a completely different map and figure out oh you know there's a bee plus one there is a smoke minus one but that would make just this walkthrough little bit more complicated but the concept remains the same ok so I'm gonna run the code and here now we're going to as we run our training code we're gonna say you know available actions with the environmental help so it's not only go out look for the next all the next potential actions look at the next potential actions with a positive bias thus this is gonna if it's fine smoke is going to reject it and not offer it if it finds bees it's gonna accept it or if finds nothing it's gonna accept it you can get a slightly better plot and here it does seem to want to converge a little bit earlier right the other one was around 350 this is definitely around like 253 so not a huge difference but the point here is not really to you know to worry about my example whether it makes sense or does it make sense the point here is to see how can you apply this q-learning which is really you know if you imagine physically as a bot going out there in the real world and kind of you know doing its thing forever until it finds its goal how can you use that to get you know additional data so the next time around you want to you know get your goal you can get there a lot faster so that's the idea and I think it's very exciting and I think in this approach such as these can be applied to all sorts of things you know medicine financial world all these things I think it's fascinating
Info
Channel: Manuel Amunategui
Views: 69,540
Rating: 4.6996999 out of 5
Keywords: Python, Reinforcement Learning, AI, Machine Learning, Bees, Q-Learning, ML, Reinforcement, Mapping
Id: nSxaG_Kjw_w
Channel Id: undefined
Length: 16min 19sec (979 seconds)
Published: Sat Sep 30 2017
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.