AlphaZero from Scratch – Machine Learning Tutorial

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
Alpha zero is a game playing algorithm that uses artificial intelligence and machine learning techniques to learn to play board games at superhuman levels in this machine learning course from Robert Forster you will learn how to create Alpha Zero from scratch thank you for tuning in in this video we are going to rebuild Alpha 0 completely from scratch using Python and The Deep learning framework pytorch Alpha 0 was initially developed by deepminds and it is able to achieve magnificent performance in extremely complex board games such as go where the amount of liquid bot positions is actually significantly higher than the amount of atoms than our universe so not only is this very impressive just from an AI standpoint on but additionally to that the machine Learning System of alpha zero learns all of this information just by playing with itself and moreover the various domains in which this algorithm can be used are almost Limitless so not only can this algorithm also play chess and shogi in a very impressive way but the recent Alpha tensor paper that was also published by deepmind showed that Alpha 0 can even be used to invent novel algorithms inside of mathematics so together we will first of all understand how this algorithm works and then we will build it from ground up and train and evaluated on the games of Tic-tac-toe and Connect 4 so that you will also understand how flexible Alpha 0 really is when it comes to adapting it to various domains so let's get started okay great so let's start with the brief overview of the alpha zero algorithm so first of all it is important that we have two separate components and on one hand we have the save play part right here and yeah during this phase our Alpha 0 model basically plays with itself in order to gather some information about the game and while we play with ourselves we also generate data and this data will then be used in The Next Step during training so in this training phase right here which is say the second component we like to optimize our model based on the information it gained while playing with itself and then next we basically want to fulfill the cycle and then use this optimized model to play with itself again so if the idea is essentially to repeat the cycle n number of times until we have yeah reached an array Network so a model right here that is capable of playing a certain game better than any human could because it just played with itself use the information again to optimize yeah itself and then play with itself again basically and then we just like to repeat that so in between these two components we have this Alpha 0 model right here which is just in Array Network and now let's look at this scenery Network more closely so let's see how this actually is architectured so basically we take the state SCS input and for the case of tic-tac-toe for example the state would just be bot position um so just turn two numbers and then we once we have the state yes input we receive a policy and the value as output so the policy here is this distribution and for each action basically it will tell us how promising this action would be based on this state we have received here as input so we'll basically tell us where to play based on the state right here so this is the policy and then the value is just a float and it will basically be a measure telling us how promising the state itself is for us as a player to be in so maybe a concrete example that make might make all of this easier to understand is something we have down below here so here for the state we just take in this position and remember that we are player X in this case and yeah if you're familiar with the roots of tic-tac-toe you should see that we should play down here because um yeah then we would have three x's on the lowest Road and yeah we would win the game so if our model also understands how to play this game then we would like to receive a policy looking some word like this where basically the highest bar right here is also this position uh down below that would basically help us to win the game so yeah this is how the policies should look like and for the value we can see that this state is quite nice for us as a player because we can win here so we would like to have a very high value so yeah in this case the value is just a float and the range of negative 1 and positive one so we would like to have a value that is close to positive one because this state is quite optimal okay now we uh yeah looked at how the general architecture is built and how the model is built but next we want to understand the self play and the training part more in depth so before we do that you should also understand what a Monte Carlo tree search is because that is vital for self-play and the general algorithm so the Monte Carlo research is this search algorithm right here and it will take a state on our case a block position as input and then we find the action that looked most promising to us and this algorithm gets this information by just as for declaring a root node at the initial state right here and then by building up this tree into the future so just like we can see right here so basically the idea is that we create these nodes into the future right here and we get these notes by just taking actions that lead us to the Future so this note right here for example is just the place we arrive at when we take this action one based on this initial State at our root note so each node first of all the state like S1 in this case and so S3 right here would just be the state we get when we first follow play action one based on this initial State and then when we would play action 3 based on this state right here so having these actions after each other basically and then additionally to the state each node also stores this variable W and W is just the total number of wins that we achieved when we played in this direction into the future so basically when we took action one initially and arrived at this node we were able to win three times when we walked into the future even further so next to this variable we also have this n variable right here and N is just the total visit count so basically the total number of times that we went into this direction so here we can say that we took action one four times and out of these four times we were able to win three times right so after we have basically built up a tree like this and we can then find the action that looks most promising to us and the way we do this is by looking at the children of our root node so these two notes right here and then for each node we can calculate the winning ratio so here we can say that this node one three out of four times while this node one one out of two times and then we can say that you know what this node right here has a higher winning ratio and thus it looks more promising to her to us and because of that we can also say that action one looks more promising than action 2 because this action was the one that even led us to this note here in the first place so basically the result of this tree right here that we got at this time would be that action one should be the action we we would like to play and then we could just use this information to uh say you know what in our actual game of tic-tac-toe for example we now want to use this action right here so this is great but now the question becomes how can we even build up a tree like this and also how do we get the information whether we win at this given position or not right so the way we come up with all of uh this data right here is basically by walking down all of these four steps for a give number of iterations so at the beginning we like we are in our selection phase and uh during selection we basically walk down our tree but we walk down our tree until we have reached the so-called Leaf node so a leaf node is a node that could be expanded even further um into the future so one example of a leaf node would be this note right here because you can see that we have only one child as this note but actually we could also expand in this direction if we say that for each node there are always two possible ways in which you could expand in this example right here so this is a leaf node and then also this would be a default right here because this node doesn't have any child right and then also we can say that this node also is the leaf node because it only has one side right here but we could still expand it in this direction so yeah these are Leaf nodes for example where these aren't any lymph nodes because yeah this one already has two children and this one also has two children already so yeah this is basically how selection looks like but now the question also becomes um when we walk down which direction we should pick right so when we just start at the top we now have to decide whether we want to walk down in this direction or whether we want to walk down in this direction right here basically the way of finding out the direction we choose when working down is by taking the direction or the child right here that has the highest UCB formula like this so while this might look a bit complex in reality we just take the child but um tends to win uh more often while also trying to take a child that has been visited at just a few number of times relatively right so on one hand we when we maximize this formula we want to take the check with the highest winning ratio and also um want to take the child that has relatively few visits compared to this large n which is just the total number of visits for the parent so if we would calculate the UCB formula for B both of these nodes right here we can see that the UCD formula here is higher because this node has been visited uh less often right and because of that we would like to work in this direction right here okay so during selection we can now say that we walk down in this direction right here and now we have arrived at this node so first of all we have to check if we worked out walk down further or if we have in fact reached the leaf node and now we can say let's see that we have which Leaf node right here right because we could still expand in this direction so we could still have a child right here and because of that we actually skip to the next phase of the MCTS and that is the expansion phase so right here in expansion we want to create a new node and dependent to our tree so I will just create this node right here and now we call it give it the state as 7. then we can say that we took the action A7 to get to this UD created node right here and at the beginning this node should have a winning count of zero and also a visit count of zero so I'm just going to declare W7 right here and also in seven like this and both of them should just be equal to zero at the beginning right because we have just created this node and added it to our tree but we have never been there chilly so now we can say that we further worked down here to the node we have just created and so now we have also finished this expansion phase right here and next we have arrived at this simulation phase and here the idea is basically to play randomly into the future until we have reached a state uh where the game is over right so we want to play into the future until the game is finished and basically here we can just say we play using random actions so we just play randomly and finally we have just reached this node right here and yeah here the game is over so this is the simulation phase and now when we have reached this terminal a node basically we then have to check whether we have won the game or we have lost or Draw has occurred so in this concrete case we actually have one and with that we can also finish the simulation phase so just finally now we have arrived at the back propagation phase and here we like to get this information from our simulation and then we want to give this information all the way up to our parents until we have reached the root note so here because we have won the game actually and during back propagation we can um first of all higher the number of wins for this node right here so we can say that we have one one time actually then we can also hire the total number of visits so here we have visited this node one time and we have one in this case and because we want to back propagate all the way up we can now also say that here for this note and the number of wins now is two so the number of points we got when we walk in this direction okay and then also the total number of visits should now be three because we have yeah visited it when we select it inside in this direction and now we can also back propagate uh all the way to our root mode and here we can just say that the general visit count is seven and the total number of wins would be five right okay so now this would be a whole iteration of this MCTS process and um yeah then if we would again um choose an action and we would just check the children of our route again and again could just choose the action based on the winning amount of wins right here so this is great uh let's also just have an example where we start from at the beginning so that you yeah might find it easier to understand the search algorithm as well so again we want to perform these steps right here so at the beginning we are just in our selection phase so we want to walk down until we have reached the leaf node but here you can see that this root node already is a leaf node right so basically we can now move on to expansion and create a new node right here so let's say that we just create this node right here by taking action one and then we will arrive at State one here's when then we can say that at this node that we have just created the total number of wins right now should just be zero and also the visit count should just be zero and so we have W1 and N1 just created right here and yeah both of them can just equal a zero like this so now we have also finished the expansion part for the first iteration and next comes the simulation part right so we have worked in this way basically right here and now we want to simulate into the future just by taking these random actions so let's write this right here and now we have reached a terminal node and now we have to check whether we have won or lost or a rule is occurred again so in this case we can just say that we have one again so because of that we actually can act propagate now and in back propagation we can say that the number of wins here is one with either this account being one as well and then we can do the same for our root node right here so we can just say that the number of wins for our root node is also one and the this account of our root node is also works well so now that we have finished with this iteration I'm just going to clean everything up here foreign and we can move on to the next iteration right so again we started selection and we walk down until we have reached the refund but again we can see that our root node actually still is the leaf node so let's just expand this direction now and here we will just give us action 2 and we will arrive at state 2. and here again we have to set the total number of wins to zero so W2 should be equal to zero sorry this also S2 and then the visit count so N2 should all also equal zero right here so let's write N2 equals zero as well so let's do it like this and yeah now basically we walk in this direction right here and next we can do simulation again so let's simulate all the way down until we have reached the terminal node and in this case we can say that we have actually lost the game so when I propagate all the way up I will only increase the visit count right here so the physical should now be one but the total number of wins should still stay the same being zero because we have lost here when we simulate it so because of that when we back propagate to our root mode we should also only increase the visit count while leaving the total number of wins the way it was earlier so let's also finish up with this iteration right here and now we can move on to the next iteration so first of all we want to do selection again so we want to walk down until we have reached the leaf node and now our root node is fully expanded so we have to calculate the UCB score for each of our children right here and then we have to select the child that has the highest qcb score right so when we calculate the UCB score for both of them we can see that the UCB score here is higher because we have won the game in this case so we just walk down this direction right here and now uh basically we can move on to expansion so let's expand in this direction here so let's create this new node right here and appended to our tree and we took action three actually when we expanded in this direction and then the state here should also be S3 because of that and now again we can declare this total number of points W so W3 and the visit count n three and both of them should just equal zero at the beginning so let's set them to zero and now again we can perform this basic can perform some new simulation again so we walk down in this direction right here and yeah so now let's do simulation and when we do that we just arrived at this terminal node again and again we can now see that in this case we have just lost the game after just playing randomly and because of that when we back propagate we only want to increase the visit count right here we're not changing the total number of wins so let's set the this account to two as well and then we can set the visit count to three okay so this is this situation right here great um so now we can just also clean all of this up right here and then we can move on so next uh when we want to keep doing this process right here we again have to do selection right so now we have to again calculate the UCB formula for both of these values right here and here we can see that the UCB formula actually gives us a higher results for this node just because of the visit count that is lower right here so when we now walk in this direction we again have to check if we have reached the leaf node or not so here we have just reached the leaf node so because of that we move to expansion and create this new node right here and this should just be equal to S4 then so we can say that we took action 4 right here and then again the W and the this account um should be set to zero I'm sorry 100 color so W4 and and for you being our number of wins and our visit count and let's set them to zero meters okay so now we also walk in this direction right here and again after um expansion we can now move to simulation so when we move here um begins terminal and in this case we can just say that a draw has occurred and in what in case of a draw we just would like to add 0.5 to the total number of wins because no players one here or no one has one so um because of that and we pack propagate here first of all we can increase the visit count and then let's set the number of wins to 0.5 and then up here we can set the wizard counter to and again we told the number of wins should be 0.5 and when we back propagate all the way up to our root nodes then we have the total number of visits as being formed and our um so the number of wins should be 1.5 so this is just a an example of how the multicabit research might actually look when you just start at the beginning and actually you can repeat the cycle for just a set number of iterations and you set this number manually at the beginning and alternatively you could also set the time and during this time you would just perform as many iterations as you can but yeah in this case for example you could just stop it for iterations but in practice you might run for a thousands of iterations okay so now we can also look at the way how our Monte colored research changes when we adapt it to this General Alpha zero algorithm so there are two key changes that have been made here so the first thing is that we also want to incorporate the policy that was gained from our model into the search process right and especially we want to add it to the selection phase right here and we can do this by incorporating it into this updated UCB formula so this way you see this P of I heart rate here so basically when we select the child and we want to take the yeah children with the highest OCD formula then we will also look at the policy that was assigned to it from its parent perspective so remember that the policy is just this distribution of likelihoods and basically for each child when we expanded we would also store this policy likelihood at the given position um here for the node as well and because of that we then also tend to select children more often that were assigned a high policy by its parent right so this way our model can guide us through the selection phase inside of our Monte Carlo research so yeah this is the first key change here so just generally this updated UCB formula right here and then for the next change we also want to use the information of the value that we got from our neural network and we can use this information by first of all completely getting rid of this simulation phase yeah right here so we don't want to do this random rollouts into the future anymore until we have reached the terminal state but rather we will just use the value that we got from our neural network when it basically evaluated a certain State and this value will then be used for back propagation right so this way we use both the policy for our selection and the value for our back propagation of our neural network because of that we know that our Monte Carlo research will approve drastically when we also have a model that understands how to play the game so this way at the end uh we then have a better search with a better model that can even create a much better model right so we can this uh yeah keep the cycle up as well okay so there's also just a small change uh here as well uh just a minor one and remember that when we get policy from our neural network we immediately get this distribution right with the probabilities for each potential child and because of that it's much more convenient now to expand an oil possible directions during the expansion phase so we won't only create one new node but rather all of the possible nodes that we can yeah expand on right so this is just a minor change and so in order to make this easier to understand we can just do some iterations here on the Whiteboard together so yeah this way it might be easier to fully understand how this multicolored research is adapted so first you might also see here that we also store this policy information here inside of our node and yeah this would just be the probability that was assigned to it from its parent right and for the root node we just have no policy probability because we also have no parent for other nodes we will also store this policy here instead of our nodes okay so let's start with an iteration here so first of all we want to walk down until we have which relief mode and here we can already see that our root node is in fact additional because we can say expanded in these two directions here so because of that we will then move to the next step right here which is the expansion phase and here we will actually create our new nodes so now when we expand we also want to get a policy and value from our neural network so because of that I will first of all write p and then value equals just calling our neural network it's the state of our root nodes and let's say this policy right here will just be 0.6 for the first child and then maybe 0.4 for the second child and let's say that this value so the estimation of how good this state of our root models is just 0.4 here okay so now when we expand we will create this these two new nodes right here so that these two and here we have action one right and here we have action 2. so because of that we also have state one here and we have state two right here so now we can also add the number of wins the visit count and the policy information to our notes right here so first of all I will create this W1 and set the equal to zero then I will create this n one and set that also equal to zero and now we have this policy probability one and this should be equal to the policy for action one and here the purpose fraction one is just 0.6 so yeah the policy here should be 0.6 because of that so now we can also add the information here to the other side so let's first of all set the number of wins or the value later on here zero this way for W2 and let's set n to 0 as well and the policy probability for the this child right here should now be equal to 0.4 right um like this okay so now we have expanded here and we skip this step here so now we move to the back propagation step and here we just want to back propagate this value right here so we will just do this by um yeah just adding it right here so basically in this case it doesn't matter much for our root notes but maybe it's a nice information so we can just add 0.4 here and then because of that also change we visit come to one point okay so this was one iteration and now we can move on to the next iteration so I will just clean this up okay so now again we want to walk down until we reach a leaf node so now we have to decide between those two children right and for each we have to calculate the UCB formula and then also we were just walk down in the direction where the UCB formulas the highest right so when we calculate this UCB formula we will see that no longer can we basically get the winning yeah um winning probability here because we have a visit count that is equal to zero So currently the way this is implemented we get we would get a yeah division by zero error so here we have to make sure that we will only check for the winning probability if we have this account that is larger than zero so because of that oh it just set that in Brackets right here so we will only check for this if we have a this account that is larger than zero under other cases we will just mask it out so we will just check for this part right here okay so now we can I find out the child where this value is the highest and we can see that we would just pick this child right here because it has the higher policy probability and because of that it's UCB formula will also be higher so because of that we would just walk in this direction right here so now we have reached a leaf node here and we would like to expand this node here so let's expand in these two directions so first of all this should be action three and this should be action 4 right here so when we create this new States we have state three and we have state 4 like this okay so now again we also have to calculate the new policy and the new value so we will get this policy and this value by calling our neural network f with this state here so S1 because yes this is the position we have walked down to and from here we want to find out what the policy would be for these children and also what the value would be for this note directly so let's say that the policy here is just zero point five and 0.5 okay and let's say that this value right here is just 0.1 so now um first of all we can also add this information here to our nodes again so let's first of all set W 4 equal to 0 and we can say n for equal to zero and then policy probability 4 should be equal to 0.5 and then here we also have here W3 which should just be zero and we have N3 which should also be zero and then we have also our policy probability three here and this should be equal to 0.5 and we can read that here Okay so yeah now we have expanded this to Children right here and I created them and sort the policy information here and serve them so now we want to back propagate again so here we will just use this value of 0.1 so actually we will add 0.1 here and we will raise this visit count by one so let's change that one right here and remember that we want to back propagate all the way up to our root note so here we will change the value so w 0 here to 0.5 so also add 0.1 to it and then we will just erase this number of physical by one as well so it will be two afterwards so yeah this is just this updated tree after these two iterations but yeah again if we would predict this we would just first of all walk down until we have reached reached a leaf node then we would expand and use the policy um here when we store the policy information in the new children and then we would just use the value for back propagating up all the way in our trade to our root mode right so yeah these are just the smart changes that have been added to our Monte Carlo research great so now that you know what a multi-colored research is and how we adapt our multicolored research in during our Alpha zero algorithm we can actually move to the self-play part right here so basically during self-play we just yeah play games with ourselves from the beginning to end and so we start by just getting this initial state right here with this which is just a completely blank state and yeah we also are assigned to a player and then based on this state player um yeah position we perform a multi-colored research and yeah this Monty College research will then give us this distribution as a return which is equal to the visit count distribution of the children of our root node right and once we have this distribution we then would like to sample an action of this distribution so in this case right here we for example sample this action which is highlighted here because yeah it turned out to be quite promising during during our Monte colored research so once we have sampled this action we want to act based on it so we play here basically as player X and then yeah we move to the next state and we also switch the player around so that we are now player Circle or player o and we have this state basically as import at the beginning and then again we want to perform this multicolored research right here and we gain this MCTS distribution then we just yeah sample one action out of it we play based on this action and then we can basically change the perspective again so that we are now a player X and we basically yeah continuously perform a Monte Carlo tree search sample an action out of it play uh this action right here and then change the perspective so that we are now the other player or the opponent and we do that until we actually have reached the terminated state so once a player has won or a draw has occurred and yeah so when this is the case we want to store all of this information we gained while playing um to the training data so let's first look how this information is structured so basically for each state we also want to store the MCTS distribution and then we also need to find out the reward for the given State the reward is equal to the final outcome for the player that we are on this given state so basically in this example right here x won the game so that means that for all states when we were player X we want the final reward or final outcome be equal to one and in other cases when we were player o and we want the reward to be negative one because we lost the game so basically that means that you know what we want this game as play X right here so we might just guess that this state also is quite promising because yeah this that it led us to win the game eventually so this is why we turn change this reward to positive one here when we apply X and this is also the reason why we change the reward to negative one when we are player um player o here so this um combinations of the state the MCTS distribution and the reward will then be stored as two bits to our training data and then we can later use these um for training right in order to improve our model so this is great but now we have to understand how training works so yeah let's look at this right here so at the beginning we just take a sample from our training data and yeah you should know now that the sample is the state the MCTS distribution and pi and the reward Z right here then we will use the state S as the input for our model then we will get this policy and this value out as a return and now for training The Next Step basically is to minimize the difference between the policy p and the MCTS distribution at the given State pi on one hand and then we also want to minimize the difference of our value V here and the final reward or final outcome Z we sampled from from our trending data and the way we can minimize the difference basically in a loss is first of all by having a mean squared error between the reward and the value here and then by also having this multi-target cross entropy loss between our MCTS distribution pi and our policy P right here then we also have some form of a true regularization at the end but yeah so essentially we want to have this loss right here and then we want to minimize the uh the loss by back propagation and this way we actually update the weights of our model Theta right here and we also thus get a model that better understands how to play the game and that has been optimized then we can use this optimized model to again play with itself in order to gain more information in order to train again and so on right um so this is however zero is structured and now we can actually get to coding so let's actually start by programming Alpha zero so first of all we're going to build everything inside of a jupyter notebook since the interactivity might be nice for understanding the algorithm and also this might help you if you want to use Google collab for example to use quiz gpus to train a physical more efficiently and we will start everything just by creating a simple game of tic-tac-toe and then we will build the Monte colored research around it and after we have gone so far we will eventually build Alpha 0 on top of the multi-colored research we had previously and then we will expand our portfolio to connect four as well and not only should this be easier to understand but it should also show how flexible fs0 really is when it comes to solving different environments or board games in this case so for Tic-tac-toe we want to use numpy as a package so we're just import numpy SMP right here and if you are wondering my version currently is this right here so let's just create a class of tic-tac-toe and simple in it [Music] and we first of all want to have a row count which equates three if a column count as well should also be three and then we also need the action size and the action size is just the amount of rows multiplied with the amount of columns so save dot row count time set of column count [Music] so that's great and now we can write a method to get our initial state so this is what we will call at the beginning [Music] and at the beginning the states just full of zeros so we will just return in Period zeros and we shape here is the row count for the number of rows and the column count for the number of columns [Music] and next we also want a method that will give us the next state after action has been taken so we write get next state here and as input we want the previous state the action and also the player so the action itself will be just an integer between 0 and 8 where 0 will be the corner uplift and eight will be the corner done right so actually we want to encode this action into a row and a column so that we can use it inside of our numpy array so the way we do this is that we Define the row as the action divided by the column count but we will use an integer division here so we will get no floats and then for the column we can use the modal law of the action and the column count as well so if our action would be 4 for example then our row would be a Ford integer division by three which we just return one and our column would be 4 modulo 3 which is also one so we would have Row one and then column one as well so the middle of our board which is exactly what we want so then we can set the state at the row at the given column to the player we had as input and let's also return the state so our code is more readable so that's awesome and next we also want a method that will tell us what moves are actually legal so these are just um the most weather field is equal to zero so let's write a method get valid moves [Music] also gives a status input and then we can just return um state DOT reshape negative one so we've flattened up the state and then we can check if the state is equal to zero and this will give us a Boolean array back so true or false but it's quite helpful if we have integers here so I will turns the type to np.u and 8 so unsigned integers and yeah that's awesome so now we also want a method that will tell us if a player has won after he took or he acted in a certain way so let's also write a method for that so just Dev check win [Music] and here we want the state and then action so let us first get the row again so we just say action integer division self dot column card [Music] and let's also get the column here um let's so now we can also get the player that played in the given position and the way we do the this is just that we check the state at the row of the column so basically we turn this expression here and the other way around um and actually in tic-tac-toe there are four different ways in which you could win a game so first of all you could have three in a row like this but you could have three in a column like this or three in this diagonal or three in this diagonal right here and we want to check for all of these four conditions and if one of them turns out to be true then we will return true as well and none of them are true then we're just returned for it so let's write out all of these Expressions right here so first of all we can just check if there are three in a row so we will use mp.sum of the state of the given row and all of the columns inside of that row and we want to check if that sum right here is equal to the player times the column count and yeah and then we can also check whether there are three and a column like this so we just use np.sum of state of all of the rows for a given column and then again we check if that equals player time set of column count oh row count in this case I mean let's be more flexible here and that's great so now we want to check for this diagonal and the way we can do this is by using the NP dot diag method so I put up the documentation right here and if we would have an array like this which we also want to have for Tic-tac-toe and then we can use this diag method right here to get the values from this diagonal so let's use this choice to check that someone has one so we use mp.sum of np.diac of state and we check if that's equal to player times save.row count or safe.columncard in this case it doesn't matter because you can only get this fully diagonal of broken and column count is the same and next we also want to get this opposite diagonal right here and there is no way to do it with the normal diet method but we can just flip our state and then use this old diet method again and since our state is flipped it would be as if our diac itself is flipped if that makes sense so we can just use all mp.sum of NP dot diac of np.flip of the state let's just say x is equal to zero so we flip like this and then we can take the opposite diode which is what we want so I also check if it equals player times save dot account [Music] so awesome um and next we always want to check if the if there has been a draw so um if the game terminated in any other way and since this might also be a scenario where we want to stop the game so let's just write a new method get value and terminate it [Music] and we want to State and an action [Music] and first of all we will check if check win is true in this case so if check when or save to check when [Music] of state of action then we'll just return one for the value and true for terminated and then we can also check if there has been a draw and that we know that there has been a draw of the amount of valid moves to zero so we can just take the sum of the valid moves and check if that's zero so I say if NP dot sum of safe.getwell moves of the given state [Music] equals zero then we just return 0 for the value since no one has one again and we will also turn return true for terminated and in all other cases the game must continue so let's just return zero and fights awesome so now we got a working game of Tic-tac-toe and additionally would also want to write a method to change the player since that might be different for different board games so let's just write a method get opponent [Music] and yeah take a player's input and then we just return the negative player so if our initial player would be negative one we would return one end of our initial player would be one then we would just return to a negative one so that's great and now we can test our game we built right here so I just say tick tock toe liquid stick attack top then I just say player equals one and I say State equals Tic Tac Toe dot get initial state awesome so then let's just buy it the while loops let's just say y true we'll print the state right here then we want to get our valid moves so we know where to play so I just say Well it moves equals tic-tac-toe dot get rid of moves of this state we have currently and we can also print it so I just say valid moves and you know what let's print the valid moves in a more readable performance currently we only have zeros and one but we actually want to get the position where we can play so I just create this array right here and I say e for e in range [Music] and then we can just say Tic Tac Toe dot action size So currently we get all the indices for the possible actions and then we want to check if the indices itself is valid so we say if valid moves of I equals one right so we want to get all of these indices and this can just be printed out like this and then we want to get an action so the action will just be an input that would be casted with an integer and let's just um write out the player like this um and get the action right here so then we want to check if the action is valid so we'll say if valid moves of action liquids zero and with just say print um action not valid and we will continue so we'll ask again this week and in other cases we want to move so we create a new get a new state and the news that will be created by calling tic-tac-toe dot get next state [Music] and we want to give the old set the action that the players should [Music] um great um so then we can check if the game has been terminated so we just say value is terminal equals tic-tac-toe dot get [Music] value and terminated [Music] and then we want to give the state and the action is input and then we say if this terminal print the state so we know where we are and then we will check if value equates one and then we'll say player has one so just say player so one or negative one for the player and in all other cases um we can just print that there has been a draw great and then we also want to break our while loop so we can end the game and in all other cases if the game continues we also want to flip the player um so we just say player equals Tick Tock toe dot get opponent of player so nice it should be working so let's test this out so here are the valid moves so it's looking nice so let's just pick zero for starters let's see nice okay we played here so we apply a negative one now so let's just play four at play Zero we can just say eight Okay negative one one for example play Zero I'd say two and play negative one and just say seven and nice we see here we got three negative ones inside of this column right here and thus we get the result that plan negative one has one perfect so now since we have cut our game of detect already we can actually build the multi-colored research around it so let's just create a new cell right here and then we want to have a class for our multi-colored research so I'm just going to call that MCTS for now and then we have our init here and we want to pass on a game so tic-tac-toe in this case and then also some arguments so these are just hyper parameters for our multicolored research so inside of our init we can say self.game equals game and safe.args equals args and then below we'll Define our search method and this is what we'll actually recall when we want to get our MCTS distribution deck so let's just write their search and for the search we want to pass a status input and inside here first of all we'd like to define a root node and then we want to do all of our search iterations so for now we can write for search in range of safe.ax num searches [Music] and then for each search we'd like to do the selection phase then the expansion fits then the simulation phase and then the back propagation phase and then after we have done all of these things for the given number of searches we just want to return the visit count distribution and uh it's the distribution of visit counts for the children of our root notes so let's just say a return visit counts yeah at the end yeah so that's the structure we have inside of Harmony College research and next we can actually define a class for a node as well that's right class node here and then have our init again and first of all we'd like to pass on the game and the arguments from the MCTS itself and then also we want to have a state as a note and then a parent but not all nodes have parents so for the root note we can just write none here as a placeholder and then we also want to have an action taken but again we have a placement of none um for the root note here as well so inside of our init we can just say surf.gaming quits games if dot arcs with arcs self.state with equal State and then again search for parent equals parent and serve.action taken liquid section table so next we also want to store the children inside of our node so let's write self.children equal it's just an empty list so these are the children of our node and then also we want to know in which ways we could further expand our node so for example if the root node would just be an initial State then we could expand in all nine fields for a Tic Tac Toe board and yeah we want to store this information here so let's just say save Dot expand the moves equals game dot get valid moves of State at the beginning then as we keep expanding here we will also remove this expandable States from our list here and additionally to all of these attributes we also want to store the visit count and the various sum of our nodes so that we can later perform our UCB method and get the distribution back at the end so let's say safe.viso account with 0 at the beginning and save the value sum should also equals zero when we start so yeah that's uh just the backboard or the beginning structure of our node and now we can Define our root node here so let's say root equals node and we have safe.game save dot arcs and then the state we took here's input and the apparent and action can just stay none then inside of our search we just want to say node equals root at the beginning and next we want to do selection so from the video you might remember that we want to keep selecting downwards the tree as long as our nodes are fully expanded themselves so for doing so we should write a method here that will tell us so def is fully expanded [Music] and the node is fully expanded if there are no expandable moves so that makes sense and also if the number if there are children right so um if our node would be terminated you obviously can't select past it because the game has ended already so let's say return NP dot sum of safe Dot expand able moves and [Music] that should be zero right and Len of self.children should be larger than zero so this way we can check whether node is fully expanded or not so during our selection phase we can just say while node is fully expanded [Music] foreign and here we just want to select downwards we say node equals node.select so now we can clear this right here and next we want to write a method that will actually select down so I'm going to write that here and the way selection works is that we will Loop over all of our children as a node and for each child we will calculate the UCB score and then additionally at the end we will just pick the child that has the highest UCB score so for the beginning we can just write best child liquids none since we haven't checked any of our children yet and then best UCB should just be negative MP dot Infinity [Music] and then we can write for child didn't save the children [Music] and then UCB equals self dot get UCP of this child right here and then we can check if UCB is larger than best UCB and if that is the case we say this child should be child and best USB should also equal UCB now right and at the end we can just return best chat here [Music] um so next we should actually write this method right here we use for calculating the UCB score and yeah we want to take a child yes it would and for the UCB score we first of all have our Q value and the Q value should be the likelihood of winning for a given node and then at the end we have our C which is a constant that will tell us whether we want to focus on exploration or exploitation and then at the end we have a math square root and then we have a log at the top with the visit count of the parent and then below we have the visit count of the chart so we want to first of all select notes that are promising from the win likelihood perceived by the parent and then also we want to select notes that haven't been visited that often compared to the total number of visits so first of all we should Define the Q value and that should generally be the visit sum of our child divided by its visit count uh sorry it's relative sum so the value sum of our chart divided by its wizard count [Music] and the way we implemented our game would actually be possible to have the negative values from and generally we would like our Q value to be in between the range of 0 and 1 because then we could turn it into a probability so um currently it's between negative one and positive one so let's just add a 1 here and then at the end we can also divide everything up by two and now there's just one more thing we could we should think about and that is that actually the Q value right here is what the child thinks of itself and not how it should be perceived as by the parent because you should know that in tic-tac-toe the child and the parent generally are different players so as a parent we would like to select a child that itself has a very negative or very low value because as a player we would like to put our opponent in a bad situation essentially right so we want to switch this around and we do this by saying one minus our old Q value here so if our child has a now normalized value that is close to zero then as a parent this actually should give us a q value that is almost 1 because this is very good for us to worked on the tree in a way that our child is in a bad position so now we can just return our UCB right here so I'm just going to write Q Value Plus safe.args of C and then that should be multiplicated with math.s square root and we should also import that up here so that we oh sorry I like that so that we can use it and inside of our square root we have math.log and here we have the visit count as a node process so just save.visit count and then we can divide everything up by a chart of this account [Music] yeah so that should be it and perfect so now we actually have a working selection method here and now we can move on so now we have moved down all the selected down um the tree as far as we could go and we have reached the leaf node or a node that is a terminal one so before we would like to expand our Leaf node we still have to check whether our node will have finally selected here is the terminal one or not we can do this by writing value is terminal equals search dot game dot get value and terminated [Music] and inside so we are referencing this method right here um so here we just want to write State and H is the input so we can take the state of our node here we have selected and then also The Last Action that was taken so just no dot action take and we still have to remember one thing and that is um that the note always gets initialized this way so the action taken was the action that was taken by the parent and not this note itself so it was an action that was taken by the opponent from the notes perspective so if we would actually have reached a terminal node and if we would get the result that yeah a player has one here then the player who has one is the opponent and not the player of the note itself so if we want to read this value here from the notes perspective we would have to change it around so here inside of our game method we should write a new method def get opponent value and here we can just say negative value so if from the parents perspective we have got a one right here is a return and we would like to turn it into a negative one from the child's perspective because it's the opponent that has one another child itself okay so let's use it right here so let's say where you equates safe.game dot get opponent value of the old value and we should also keep one thing in mind and that is that we're using this node.action taking method right here right but actually at the beginning we just have our simple root node and we can see that the root node is not fully expanded right because it has all possible ways we could expanded on so we will call this right here immediately with our root note and the way we initiated our root nodes we still have action taken it's none right and if we call this method right here with action taken being none um so we call this with action taking B none and we'll move to check one with action being none and then we will do this expression right here and this will just give an error so we have to say if action equates none [Music] just return forwards right because for the root note we always know that no player has won since no player has played yet so we can just leave it like this um perfect so now we're finished with all of this and next we want to check if this node is terminal and then we code back propagate immediately and if not then we would do expansion and simulation so I'm writing now here if is if not is terminal then we want to do this right here and at the end we can just back propagate right okay so yeah now we can do expansion and inside well this if statement I can just write no dig with no dot expand and so we just like to return the note that was created by expanding here and then inside of our node class we can write a new method that will do the expansion job so let's say Dev expand [Music] and okay that's so the way expanding works is that we sample one expandable move out of these expandable moves we have defined up here and then once we have sampled this move for Action we want to create a new state for our child we want to act inside this state and then we want to create a new node um with this new Step we've just created and then we append this node to our list of children here so we can later reference it inside of our select method and yeah at the end we can just return the chart that was created newly so for now let's get an action here and yeah we want to sample it from above so let's use np.render.choice so this will just pick a random value out of a list on numpy array and inside we can say np.where and then uh the condition is that safe.expand it will moves should be equal to one right because the way we programmed it all values that are one actually legal and we could expand on them and with mp.where you always have to use the first argument you get when using the method so yeah this should work now so yeah we just first of all check what moves are legal then we usmp.where to get the indices um of from all legal moves and then we use np.random.choice to randomly sample one indices and this legal NDC will then be the action we got right so that's great and now we also have to make this move not uh expandable anymore right since we're already sampled it so we can say it's safe.expandable moves of action and that should just be zero here nice so now we can create the state for our chart and at the beginning we can just copy our own state now we would like to act instead of the state of our chart so we can say child state equals safe.game dot get next state so we reference this right here and then we can use State action and player so let's move up again so for State we have just child State and for Action we have the action right here and for the player we will use one so you might have noticed that we haven't even defined the player yet which is weird right because for Tic-tac-toe you have two players obviously but the way we will program our Monte Carlo research is that we will never change the player or never give the note the information of the player but rather we will always change the state of our child so that it is perceived by the opponent's player while the opponent's player still thinks that he is player one right even though so the parent that the child both think that they are player one but instead of changing the player we flip this state around so we turn all positive numbers into negative ones and vice versa and this way um we can actually create a child state that has a different player but still thinks it's player one so this just makes the logic much more easy for our game and it also makes the code um here valid for one player games even so that's very nice so now we can say child State equidsafe DOT game dot change perspective of child State and then the player should be negative one right because in tic-tac-toe when you change the perspective or when you create a child node you always want to be the opponent here so let's write a new method def Dot change perspective of the state and of a player and in tic-tac-toe we can just return the state multiplied with the player so like I said when we would change the perspective to play a negative one so two opponents then we would turn all positive ones to negative ones instead of our tic-tac-toe game um so nice so now we have come here and we have got on our child status ready now so next we can just create the child itself which is a new note so let's say child equals node and up here we can see that first of all we need the game so it should just be saved.game then save.ox and then the state which is just a child state and for the parent we can choose ourselves as a node and then for the action taken we can use this action right here we send it above so let's just take that and now we want to append this child to our children so I'm saying I'm going to say safe.children.append child right here and at the end of our method we can return the child we have just created so awesome that's the code finished here so we have gotten this expansion part done and after we have expanded we want to do simulation so from the video you might remember that we formed these rollouts and the way rollouts work is that we just perform random actions until we have reached a node that is terminal and then we will look at the value so the final outcome of this game so who won essentially and then we'll use this information who won um to back propagate basically right so nodes where the player of the node itself one of the randomly choosing actions are more promising than um the ones where the opponent one right so yeah we use this for our back propagation so here we can just say value equals note dot simulate and then we obviously need to write a simulate method up here so let's do that for now so Dev simulate of self and first of all again we have to check if this node that has just been expanded this uh terminal one so up here we just say value is terminal equal itself.game dot get value and terminate it and inside we can again use self.state and save dot action table and we have to again flip this value around to the opponent's perspective because um always when notice one one it's not because of the node itself but because of its parent that has taken this action here so we have to say value equals save Dot Game dot get opponent value of the value we have gotten here and yeah then we could just say if this terminal then we would like to return this value up here and in all other cases we now want to perform our rollout so in order to do that we can just say roll out state equaled self.state dot copy [Music] and then we say Roy odd player liquids one at the beginning and here we're actually going to use the player just that's just because we need to find out if the value we got at the end should be perceived by this get opponent Value method or not when we use it to back propagate our node uh it's value right here so inside of our product we can say why they're true and then again we want to sample an action like we did up here so first of all we can say valid moves equals Dot Game dot get when it moves of the raw out state [Music] and then here we can write action equates mp.random.choice of np.where off valid moves equals one take the first one yeah and yeah that should be our action and now we can use that action to get the next state so we can write raw out States equals and then we can say save.game dot get next state [Music] and then again we use the word state and the action here and then for the player we'll use Royal player [Music] and after we have got now a new state we have to check again whether this state is terminal so let's write value is terminal equality save dot game Dot get value and terminate it [Music] and we just use this newly updated real estate as input and also the action we took and now if this terminal is true we'd like to return our value so let's write that as an if statement [Music] for the value we again have to check whether we were a player one when we won the game or hit a terminal note so we say if row out player with negative one then we would flip the value up so that we don't get a positive return since the opponent one and notes we as a note so we can say that you include self Dot Game dot get opponent value of value and generally we can just return this value here like we did up here above so nice now in our other cases if we haven't reached the terminal node immediately we want to flip our play around so we can say raw out player equality save.game dot get component of the old old player yep so that should be it um so yeah now we can do simulation after we have expanded and the final step would now be to do back propagation and we can just write a new method here so let's say no dot back prop type gate and then we pass the value here as input we have gotten either here or here and we can write that here inside of our node class so Dev back propagate [Music] and then take value here's input so when we back propagate first of all we want to add the value to our value sum and then we want to add our Visa card with just a one so count that up and then we want to keep it propagating up all the way to our root node so let's write save dot value of sum plus equals value and then let's write safe.visit count plus equals one then when we back propagate up we again have to remember that our parents is a different player than us so we should write value equals search Dot Game dot get opponent value of value here and then we say if parent is not none which is always true except for the root note where we have this as our placeholder um then we're just going to say parent dot back proper gate that you write here and that should be so yeah we have this recursive back propagate method and that should be all we need for MCTS so finally we have gotten all of the results we have back propagated all of our visit counts and values and so on so now we would actually like to return the distribution of visit counts so to do that we should just um write a new variable action props so the probabilities of uh which actions look most promising and at the beginning that should just be MP dot zeros and the shape here can just be the action size of our game so save.game dot get dot action size so for Tic-tac-toe just this one right here and now we want to Loop over all of our children again so let's say for child and served or children and then we write action props at child.action taken and this probability should be equal to the visit count of our child so let's say child.visit count and that and now um we want to turn these into probabilities and the way we do that is we're just divided by its sum so that the sum will be equal to 1. and let's just write mp.sum of action cross right here and then we can return this [Music] so that's looking promising so let's just absorb and there is the inverted syntax here so I just should these should remove this here and now that's working so let's test this out and we can create a MCTS object here inside of our test script we used for the game so let's write MCTS equals MCTS and for the game we use tic-tac-toe and now we still have to Define our arguments so for C which we use in our UCB formula and we can just roughly say that we want the square root of 2 which is what you might use generally and then for the number of searches we can set that to a thousand [Music] nice um so let's also pass on the arcs here inside of our multi-colored research and then during our game first of all I should remove that um we should say that we only do all of this so acting ourselves if we are player one [Music] and in other cases so I take the toe when player is negative one we want to do a multi-colored research so we can say MCTS props quits mcts.search of State but you should remember that we always like to be player one here when we do our multi-colored research so first of all let's write neutral State liquids and then we can say safe.game dot change perspective and also not such a game by Tick Tock toner yeah um and then we'll use this General State here and for the player we can just set a player which is here always negative one so we always change the perspective then instead of MCTS search we should just use this Nutri State we have just created and that's great and now out of these probabilities we want to sample an action and to keep things easy we can just sample the action that looks most promising and the way we do that is by using mp.arkmax and this will just return the chart that has been visited most number of times and inside here we can use the props of our MCTS yep and that's just the action we have here so let's test result to see if this works okay so we still have a problem here uh also um for parent we need to use self.parent [Music] and here I have a typo and so just want to say best child [Music] and here we don't want to use save to Children obviously but the child children of our root note so instead of MCTS we can say for child in root of children so perfect so now we can see that this multi-colored research moved here inside the middle and now we as a player let's play a bit here to check this out and we can just play here MCT has played here then we can say I will play seven and now the LCT has played here and from the roots of tic-tac-toe you know that we're the MCTS has hit three negative ones here instead of this row here so it has one so nice so now we can see that this MCTS is working with okay so now that we have our Standalone multicolored research implemented right here and we can now start with building the neural network right so that we can then later use the alpha zero algorithm in order to train this model that can understand how to play these given games right and before we start uh yeah with building um let me just briefly talk about the architecture that we have for our neural network so yeah this is just a brief visualization here and first of all we have this state right here that we gave us give us input to our Android network and yeah in our case this state would just be a board position so for example a board of tic-tac-toe right here and actually we encode the state so that later we have these three different planes next to each other like in this image right here and actually we have one plane for all of the fields in which player negative one is played so where these fields will then be turned to once and or other fields would just be zeros and then we also have a plane for all the empty fields and this region then between the ones or other fields being zeros and then after this last plane for other fields in which player positive one played and yeah here again the these fields will be turned to once and all the other fields will be turned to zeros so by having these three planes right here it is easier for our Network to basically recognize patterns and understand how to play this game right so essentially we encode our board game position uh almost so that it looks like an image afterwards right because we have instead of these RGB planes we have these planes for player positive one uh player zero or empty fields and play a negative one right and because we our state looks like an image we also want to use neural network architecture that works best for images right and so because of that we have these convolutionary blocks then inside of our neural network right so first of all we have this backbone right here and this backbone will just be the main part of owner Network um where we do most of the computation and here for the backbone like I said we use these convolutional blocks and we also use this resnet architecture so the residual Network and this is just a research paper that turned out to work really yeah well with um understanding images and here basically what this resonant architecture does is that we have these skip connections like this so when we do the feed forward through our backbone we won't just always update the X by moving forward right but rather we will also store this residual so basically we store the x value or the you know value that is passed through and before we give it to our conf block and then at the end the output will just be the sum of the x value before conflog and the x value after our conf block or conf blocks and because of that our model is much more flexible because it could technically also mask out conf blocks by not changing the X at all and just using this residual skip connection X right so here's also just an image from the paper so yeah generally the main idea is to also store this x right here at this position as a residual and then have the skip connection later on so that the output will be the output from the conf blocks summed with the initial X so the skip connection right here so this is just the arrest net architecture that we use um instead of our backbone right um so after we have gone through our backbone um we split our model architecture into these two parts so first of all we have this policy head here above and at the beginning here we just have a singular conf block and this conf block will take the output from our backbone as input like this and then after we have gone through that conflog we will then flatten out the results here and then we just have this linear layer or fully connected layer between the outputs for account block and then also these neurons right here and at the end we just want to have uh yeah nine neurons in the case of tic-tac-toe because we want to have one neuron for each potential action right and then tic-tac-toe there are nine possible actions so yeah at the end we will just have these nine neurons right here and and normally we would just get these logids so just these standard outputs but when we want to get this readable distribution telling us where to play we also have to apply this soft Max function and this will basically turn the outputs from our nine neurons to this distribution of probabilities and yeah then each probability we're basically indicate us how promising a certain action is right okay so yeah this is why we have these nine neurons here and then where we also in practice later called the softmax method right so the second head this this value head right here and here we also have the singular conf block that is different uh from this one right here so we have this connection here as well and this also takes the output from our backbone as input and then again we would like to flatten the output of this conf block here out and we want to have this linear layer of fully connected layer and finally we just want to have one neuron right here because remember this that this value head should tell us how good this state right here is and it should just give a float in the range of negative 1 to positive one is output and because of that we only need none one neuron right and the best way to get this range of negative 4 into positive one is by also applying this 10h activation function onto this last singular neuron right here so here's just a visualization of the 10h function and basically this will just squish all of our potential values into this range of negative one to positive one and yeah this is exactly what we want for our value head so ultimately we have this model right here that takes this encoded status input and on one hand it will output the policy so just yeah this distribution telling us where to play and on the other hand uh it will give us this value so just this estimation telling us how good this state is right here okay so now we can actually build this inside of our jupyter notebook and here the first thing is that we want to import pytorch because we want to use the Deep learning framework pytorch so first of all I will also print the NP dot version so we always get that right here and then we will import torch like this let's also print torch dot version right here okay so let's just do that okay so this is my current torch version and I think for yeah the loss we will use later on we will have this multi-target cross entropy loss and this has been added as a feature just recently to pytorch so I would just recommend to get this version right here but or any newer version um so if you have any problems you can just use this one and I also have Cuda support right here because of my Nvidia GPU but you could also just get the CPU Standalone pytouch version depending on your setup right okay so now additionally to torch we also want to import torch dot n n s and n right here and then we want to import torch.nn.funk channel like this s this big F right capital f okay so now we can import these things right here and then I will just create this model and I will just create this new cell above our MCTS implementation so I will just call this class res net then we have an N dot module and inherit from that so here we have our init and first of all we have a game then we have a num of rest blocks so this will just be the length of our backbone right instead of our architecture and then we also have num hidden yes so this will be the hidden size for our conflict blocks right Okay so inside first of all we have to call it super dot init and now we can Define the start block and this should be equal to an N dot sequential and first of all we just have this conf block at the beginning so nn.conf 2D and the input your hidden size is just three because remember that we have these three planes at the beginning and then the output uh dim should be num hidden then for the kernel size we can just set that to three and also we can set heading to one right because when we have kind of sizes 3 and padding as one this conf block won't actually change the shape of our game State directly so it always be where the number of rows times number of columns for these images essentially right for these pixelves if you want to look at this look at it like an image so after our conf 2D we also want to have this batch Norm so NN dot patch Norm 2D then here we have just num hidden as our hidden size yeah in this page number will just increase the speed speed for us to train um essentially so now we can also have this num and Android value and yeah this radio will just turn all the negative yeah they use two zeros so clip them off basically and this way it's also faster to train and save up okay so this is just our start block right here so first of all just this confer to D then this batch number to D and then our value and now we can also have the save.backbone like this this should be equal to NN dot module list and then here we just want to have this array and basically we want to have this array of different rest blocks right inside of our backbone uh remember from this image right here and in order to do that we also have to create this rest block class so let's write class rest block like this of NN dot module and then first of all we have our init again and then we just have num hidden s input here and again we have to inert our super class as well um like this [Music] okay so now for each rest block we also want to have our first conf block so save dot conf1 this should just be equal to nn.conf 2D of num hidden for the input hidden dim then num hidden for the output in the and then Kernel's size should be three and then padding should be one here so this is our first curve block then we also have our first batch Norm block so bn1 this should be equal to NN dot batch Norm 2D of num hidden as we hidden then we kind of our second conf block so save to conf2 should be set to nn.conf 2D and like this and then again we have num hidden here and I'm heading here within a kernel size of three [Music] and a padding of one and now we can also Define the second batch from success.bn2 set that equal to nn.batch Norm today off num hidden as well so then let's also Define this forward method so let's write Dev forward the first one we have safe and then here the X so the input and then remember that we instead of our rest block want to have this residual so that we can later have this skipped connection so let's create this residual and set that equal to X and then now we want to update this x right here separately and we can do this by first of all calling self.conf one of X then calling self.bn1 on top of that and then also we want to have a radio here so just F Dot sorry capital f f dot value like this [Music] and so this is our X at this position and then we also want to feed through our second conf block here so let's update X by calling save.pn2 of self.conf to of X right here so this way we have yeah went through these two call blocks right here and then now we can just sum up um this output with the residual right so X plus equals residual right here and then now we can just again call the value here so X should be f dot value of x and now we can just return that X right here so this way we have this nice residual connection here and at the end we just return the sum right okay so now we actually can create this backbone right here so in order to do that we want to create a rest block of num hidden just for I in range of num res logs like this so this way we also have our backbone right so this is great and next we can create this policy head so safe. policy head like this this should again be equal to nn.sequential and instead of the sequential first of all we want to have just this conf block so nn.conf 2D and here we have a num hidden as our input hidden size then we have just 32 as our output in size and then we can also Define a kernel size and set that to three and have a padding of one and then again we want to call it batch Norm here and just this Ray U here [Music] okay so after we have went through this conf block remember that we want to flatten our results so let's just call it NN dot flatten here and now we want to have this linear layer right so nn.linear and for the input we want to have these 32 planes multiplied with the number of rows and multiplies with the number of columns so time scale dot row count times game dot column count right because remember that we have a kernel size of 3 and a padding of one for all of our conf blocks and because of that we won't change the actual shape of our state right so because of that we can just when we can want to get the input size of our flattened output here and we can just multiply the number of hidden planes so 32 with the row count with the column count as well then for the output of our linear layer we just want to have game dot action size right okay um yeah so that's great and now we can also Define a value head it's safe to the value head like this and this should be set to nn Dot sequentia and here first of all we also have nn.conf 2D and for the input hidden them we have just now hidden and for the outputting we have just three and for the kernel size we can set that to three and heading to one again and now we can also have a batch Norm here so patch node to D and have three as the hidden size and then again call it just already on top of it [Music] and flatten our results here [Music] okay so now we also want to have this linear layer at the end [Music] and again for the input we have just three times game dot row count timescame dot column count and then for the output we just want to have one right because we just want to have one neuron here and then also um we want to have this 10h activation function right so we can just add an N dot 10h right here so now um we have here this party finish and next we can write this forward method and for our resnet class so it's right there forward and take the input X here and then first of all we want to send this x through our start block so X should be safe dot start block of X right and then we want to Loop over all of our rest blocks right here instead of our backbone so let's write four rest Block in safe Dot um backbone like this um then here we can just set x to the rest block of X right so that's great and now we have these X values right here and next we yeah I can just get this policy and this value here back so first of all you can write policy and here we can just set that equal to save dot policy head of x then this value should be searched where you head of x and now we can just return our policy and our value right here [Music] okay so let's just run that for now and now we also want to test that this is working so I would just create this new cell right here and here let's just start by creating a game of tick tock so so this should be equal to this tic-tac-toe class right here and let's get this state by calling tick toe dot get initial State like this and then let's just update that state by first of all setting state to tick tock toe dot get next state of State let's just say that I don't know we play at the position two for the player negative one I'll post one in this case and let's also play at um position seven for the player negative one okay um [Music] so now we can just print the state for now and yeah we get this state right here okay so next we should um remember that we also have to encode our state when we give it to our model so we also have to write this new method right here so here we just write def get encoded state of safe and then just off the state right here and remember that we want to have these three planes right so we can get this by writing n coded State equals NP dot stick and then here we want to Stack first of all these eight voltage that are equal to negative one with the state of all empty Fields so that the fields that are equal to zero with these state where all four of the fields they are equal to positive form like this so yeah this should be our encoded State and here these widgets be booleans but rather we want to have floats so let's just set the type here for NP dot float 32. and now we can just return this encoded state okay um so let's test this out by writing encoded State and you're setting that equal to Tick Tock toe but get encoded state of this state right here let's just print that here okay so yeah now we see that we have this uh just stayed right here and this is our encoded state right so first of all we have this plane for all of the fields in which planet nectophone is played so just this field right here which is encoded as a one here and then we have these empty Fields right here and then we also have these fields where play a positive one played so just the sphere right here right okay and so this is great and now we want to get this policy and this value right so first of all we have to turn this state to a tensor now that's so because of that we can write tensor State and set that equal to rch dot tensor of encoded State like this then when we give a tensor to our model as input we also always need this batch Dimension right um but here we just have one state and not a whole batch of States so because of that we have to unsqueeze [Music] if the Zero's axis so what this does is that we that it will essentially just create further brackets around this encoded state right here and this way we can pass it um through our model right okay so now we have this tensor state right here then we can get this policy and this value by calling our model that's why I need to Define our model so that's right model equates res net and for the game we have Tick Tock too for the number of rest blocks we can just set that to four and for the num hidden we can set that to 64. okay so now we can write policy value equals model of this tensor state right here and then we want to yeah process the policy of the value so let's first of all get this float from our value by calling value dot item right here since this will just be a tensor on WE rather like to have the float and we get this by calling dot item with this tensor here and then we can also change the policy so here we can just write policy which equates torch.soft Max of this policy right here and for the access we have to choose one because we don't want to apply this soft mix on our batch access but rather than this axis where we have these nine neurons right so this x is one and now we first of all have to squeeze this again so that we remove the patch axis essentially and then we can detach that so we don't have a grad here when we turn this to numpy now so then let's also take this to CPU and let's call this to numpy here like this okay so now we have this policy and let's just print out both of these things so right after there you want to print out the policy and see whether this works okay so nice uh now first of all we get this value of 0.3 for this state right here and then this is our policy distribution telling us where to play so just this number of probabilities now you know what let's also make this a bit more visual so let's import matplot lib like this dot Pi plot um right here yep s Pat and then here we can just write plt.bar first of all of the range from our Tic Tac Toe dot action size and then these policy values right here let's just write pat.show okay so yeah let's uh so this is basically how our policy looks like so for each action and we have this bar right here um yeah telling us how promising this action is and then obviously our model was just initiated randomly currently so we can't expect too much here and so nothing actually but um when we later have this trained model we can expect this nice distribution of bars telling us where to play right okay so now we also have this Modi model built and next we can start with incorporating our neural network into this multi-colored research right so before we do so I I'd still like to apply a small change here so um let's just set a seat for torch so that we have reproducible results so that you can make sure that you at least bit something similar to what I've built right here so let's write torch.manual seat and let's just set the C to zero like this so perfect and then we can update this multicolor tree switch right here so the first thing is that we now want to give a model here's input so this will later on just be our res net so let's also declare safe. Model S model foreign we want to do so basically after we have reached the leaf node we want to predict a value and a policy um based on the state of our Leaf node and then we want to use this policy when we expand so that we can incorporate the probability and solve the policy into the UCB formula that makes sense so that when we choose a note during our selection phase we are more likely to choose nodes with higher privacy values because of these were the ones that seemed more promising to our model right so we want to choose these more often and work down the tree inside of these directions where our model guides us so this is the one uh thing we use our model for and the next thing is um that we want to use the value we get from our model to back propagate once we have reached the leaf node so actually we can completely remove the simulation part so we don't want to do any rollouts with random actions but instead we just use the value we get from our model so let's call this right here so we can write policy and value equals self dot model and here we want to pass in the state of our node right but remember that we want to encode it beforehand so that we have this three planes based on the players um we have in tic-tac-toe or just for the interference right so uh you might remember this from the last checkpoint so first of all we can write safe.game dot get encoded state and here we just have no dot state and now we want to turn this encoded State into a tensor uh instead of just a numpy array so that we can give it as input to our model directly so we can write torch.tensor off safe.game.gate encoded state of node.state so now we have to unsqueeze at the zeros axis and um so let's write unsqueeze of zero right here and this is because we always need an axis for our batch right and currently we aren't patching any states so we just create this empty batch basically with just this one stack right here so this is why we unscreased here and yeah then we get a policy and the value as a result and now remember from the model checkpoint that we want to turn this policy um which is just uh with which are just logits currently so just nine uh floats we want to turn this into a distribution of likelihoods so first of all we can write policy equates torch.soft Max of policy and then we have to set access to one because we want to apply the soft Max not on the batch axis but rather on the axis of our nine neurons right in the case of tic-tac-toe so after we have called the softmax we would like to squeeze again at the zeros and axis so that we remove this batch axis now so actually we want to read revert this statement right here after we have feed forwarded our state through our model and next we can call CPU in case we might lose Jeep use GPU later and then we can use numpy right here and also we should use The torch.nograd Decorator on top of our search method um so because we don't want to use this policy and value for training immediately rather we just want to use it for prediction so we don't want to store the gradients um yeah of this tensors right here so we just use no grid as a decorator so this way um yeah it's faster to run our Monte Carlo research so great so now we have our policy and we have turned it into a numpy array and we have applied the softmax activation function but next we also want to mask out all of the illegal States all of the illegal moves I'm sorry because uh when we expand we don't want to expand inside of directions where a player has already played for example right so we have to mask out our policy on these illegal positions so we get the valid moves first of all by calling self.game dot get valid moves and then for the state we can just use node.state so the state of our Leaf node right here and um then we want to multiply our policy with these valid moves [Music] so now all the illegal moves will have a policy equal to a zero which is great but again we have to rescale our policies so that we have percentages and we do this by just dividing with its own sum so that the sum of our policy will turn to one um and yeah this will give us this nice distribution we want so let's actually also change the value and we want to just get the float uh from our one neuron of our value head so we get this by calling value equals value dot item so in pie torch if you call Dot item on a tensor that has only one float you will just get the float as result so that's great so now we have our value and we have our policy so then we want to use this value for back propagation and we want to use this policy for expanding right so let's call node.expand with the policy as input and let's remove this simulation part directly so that we will use this value right here for back propagation later on in case we have reached a leaf node here and we expand right so this is great and now there's some smaller things we can just remove so first of all we can just completely delete the simulation method here and then there are some more things and first of all when we call the expand method we immediately expand in all possible directions because we just have this policy right here and um it doesn't make sense anymore to only expand in One Direction once you had reached a leaf node but rather we expand in all directions immediately so um this also means that we don't need these expandable moves right here anymore um because yeah we will check the expandable moves by calling dot Reddit moves later on so also um we can change this is fully expanded method here and now we can just check whether the land of our children is larger than zero because remember when we expand once we expanded all the directions so if we have one child we have we know that there aren't any more children we could have right so yeah this should be enough here so yeah now this is working but next we have to update um this node.expand method right here so let's do that here so first of all we take this policy yes input and then we want to Loop over our policy right so we can write for action and then prop so just the probability uh at a certain action in enumerate policy and then we can check for each action whether the probability at the given action is larger than zero so let's write this prop it's larger than zero and we can remove these two segments here and if this is the case let us just create this new child right here and let's depart it to the children but remember that we want to use this probability here inside of our UCB formula later on during selection so we want to store this probability inside of our node object and we do this by creating this new prior input here so let's just prior to set prior to Zero by default and let's define price after prior here as prior so the prayer will just be the probability um that was given here when we initiated the child and it is equal to the policy at the given action from the parents perspective when we expanded our parents so um yeah now we have defined this prior right here but we also want to um yeah use it here when we create a new node so let's just give here prop as input for the prior uh at the end like this so great so now we have all of this ready so um next we have to update the UCB formula since Alpha zero uses a different USB formula than just a standard multi-colored research so I just brought an image up right here so actually we remove the math.log and we just wrap the square root around the visit count of our parent then also um we just add a one to the visit count of our child and yeah so so that's basically it and so first of all we have to say that if we call this UCB formula on a child that has a visit count of zero then we can't calculate a q value right so let's just set the Q value to zero in that case as well so let's say if child dot wizard count equals zero then Q value should equal zero then edits I'm sorry Q value should be equal to this formula ratio so because now we don't back propagate immediately on a note that has just been created during expanding so it is actually possible now to call this USB method on here on a chart that hasn't been visited before so yeah this way we change this part right here and then also we want to update this part here below so that's right like of this image here so let's first of all remove this math.log from here and then here we want to add a one the visit count of our child and then we also want to multiply everything up with the prior of our child right so that we also use the policy when we um yeah when we select um downwards of the tree so this should be working now so let's see see if we got everything ready so yeah I think that's it so let's just run this right here so first of all I will just update C to be equal to 2. and then we can create a model right here so let's write model equals resnet and then for the number for the game we can just use sticker tool for the number of rest blocks let's just say four and for the number of our hidden planes we can just set that to 64. then you know what also evaluate our model right here so currently this model has just been initiated randomly right so we can't expect too much but let's try this out so oh sorry uh we're still missing the model that he has argument inside of what yes so yeah if it's [Music] so nice so now we see that with our updated MCTS um we played right here and you know what let's say um we would like to play on one so yeah the MCTS played here and let's okay uh I don't know it's just on this Edge right here and the MCT has played here so negative one and has won the game so now we have our updated multi-colored research ready and next we can actually start to build this main Alpha zero algorithm so in order to do that let us just create a iPhone 0 class and I will Define that just below our updated MCTS sorry let's remove that here and Define it down below here so let's write class Alpha zero and for the image we'd like to take the model then also an optimizer so this will just be a pie torch Optimizer like Adam we will use for training and then also a game and some additional arguments yes input and next we can Define all of these things right here so let's write self dot model its model self dot optimizer liquids optimizer and then write safe.game equals game and save dot arcs should equal access way and at the end we can now also Define multi-colored research object inside of our fs0 class so let's write safe.cgs should be equal to MCTS and when we initiate it when we initiate our multi-colored research we want to pass in the game the args and the model so let's just run it game arcs model just like this so yeah let's just be standard in it of our fs0 class and then next we want to define the method inside and from the overview you might remember that we have these two components which is the save play part and then the training part and outside we just have this Alpha zero wrapper right here so let's create our self-play method this and let's just pass that right now and let's also create our Training Method like this and let's pass that here as well so next we can have this learn method and this will be the main method we will call when we actually want to start this cycle of continuous learning where we run sort of play get data use this data for training and then optimize our model and then close if play again with the model that is much smarter right so for this cycle we want to first of all Loop over all of the iterations we have so let's say for iteration in safe.args of num iterations [Music] and then for each iteration we want to create this memory class so just the training data essentially for one cycle so let's write our memory list here let's uh so let's define memory right here and then we want to Loop over all of our self-play games so let's write four set play iteration in safe.args of save or in range sorry we need to refinement shift [Music] um so in range of safe.args of num save play iterations like this and then for each iteration we would like to extend the data we just got out of our self-play method and we want to extend this to this memory list right here so let's write memory plus equals save dot save play um just like this so let me extend the new memory and actually we'd like to change the mode of our model so that we so that the model is now an evalu mode so that we don't have these batch knobs for example during safe place [Music] and now we want to move on to the training part so first of all we can write safe.od.train like this [Music] and then we can save for Epoch in range of safe.args num epochs and yeah for each epochs we like to call this train method right here and we like to give the memory as input so let's write safe.train of memory like this so let's also add memory here as an argument so that's great and at the end of an iteration we like to store the weights of our model right so let's write torch.save of safe.static like this and for the path we can just write model and then the iteration [Music] with instead of an F string and write dot PT for the file ending so let's also save the static of our Optimizer so let's write torch.save off safe.optimizer dot State picked then for the path again we can write Optimizer and then iteration [Music] and take dot pts file ending here and we have to turn our string into an F string yeah so this is great and um now that we have gotten our main Loop ready um we would like to focus on this self-play method right here so inside uh we again have to define a new memory so this will just be uh yeah the memory inside of one own self-play game and then we have to define a plane at the beginning so let's say player should equal 1 at the start and then we also have to create an initial State and we get the state by calling save.game dot get initial State like this and then we want to have this true Loop and inside of this Loop we want to first of all call the multi-colored research based on the state we have then we want to get yeah these action props as result and next we like to sample an action from this distribution of action probabilities and yeah so after we have sent it an action we want to use this action to play essentially so we will get a new state um yeah by calling self.game dot get next State uh based on the Old State and then once we have this new state we want to check if this state is a terminal one so if the game has ended and if indeed the game is ended already then we want to first of all return all of this data to the memory here and here the data should be structured in this Tuple form where for each instance we have a given State we have the action probabilities of our MCTS and then we have the final outcome right so whether the player who played at the instance also was the player that won the game at the end or the player that lost the game or the player that he didn't draw right so this should be incorporated in the final outcome so so we can write y true like this and here first of all we have to remember that when we call mcts.search we always like to be player one so we first of all need to get this neutral State and we can get this by recording safe.game.change perspective so let's write it right here and for the state we just like to take this state right here and for the player we just want to take the player above and yeah so that's great so now that we have our neutral set we can get our action props back from this MCTS search right here and we call it save.mcts.search and we take this neutral State as input so now that we have our neutral State and our action props we can store this information instead of our memory so that we can later use this information to get gather this training data so first of all we can write self.memory dot append then we append the structure right here of the neutral State and the action props and then also the player so yeah now we want to sample an action of the action props right and we can do this by calling NP Dot random.choice and for the number of different options we can use the action size of our game so in this case of tic-tac-toe that is equal to nine so let's say save.game.action size and then for the probabilities we use for sampling we just want to use these actual props right here we got from our MCTS so like this so now that we have sampled in action we like to play based on this action rate so we can say stage should equal safe Dot game dot get next state and we take give this Old State the action and we play ideas input and now that we have gotten our update State we want to check if the state is terminate or not so we write value this terminal equality save Dot Game dot get value and terminated [Music] and we give the updated State and the action we use to get this updated State yes input and then we can say if this terminate is true so if it's terminal and then we want to return this data right so first of all we can write return memory should be equal to this new list right here and then we want to Loop over all of these instances inside of our memory or sorry let's not write set of memory about rather memory just like this because we want to use this mirror right here so we can write four hist portrait state Test action props and then his player in memory and for each instance we now want to get the final outcome right at the given instance because we need to store that inside of our training data so we can say this outcome and this outcome should be equal to this value right here if his player is also equal to the player um we were when we achieved this value right so if we were player positive one when we played here and have indeed won the game and then we will receive a value as positive one as well and we want to set the his outcome for all instances in which player positive one played to positive one and we want to set the his outcome to all instances in which player negative one played to negative one right so this outcome should be equal to value if his player is equal to player event edits we just want to receive the negative value right um and now we can um depend this information to our return memory so let's write return memory dot append and yeah again we need to append a tuple so I have brackets inside of these brackets right here and we want to attend the neutral state but remember that we like to encode the state when we give it as input to the model so we can already encode it right here so let's write save.game dot get encoded state of this neutral state [Music] for the state and then just the action props from our MCTS distribution and then just the hist outcome we cut here [Music] so just like this and sorry we went to append this to our original memory read but not to return immediately but after we have looked through all of this we now can actually return this written memory just like this and all in all other cases if we haven't reached the Terminator note yet then we want to flip this player around so that we are a player negative one now uh if we were player positive one previously so let's write play articles.game dot get component of player like this [Music] so there's still some minor things I would like to update right here so first of all let's make this code more General by changing negative value right here to the value as perceived by the opponent depending on the game we have uh inside of our iPhone 0 implementation so that we could also um yeah run this in games where we have only one player right so let's write save Dot game dot gets opponent value of the you like this so let's just more General if we um quote this method right here so get opponent value and the next thing would be that we want to visualize these Loops right here and the way we can do this is by having these progress paths and we can get them by using the tqdm package so let's write import TQ DM dot notebook and you know what from tqdm.notbook we want to import a t range like this so then we can just replace these range chords below with a t range so just a small difference right here and now we want to check that this iPhone 0 implementation is working as we have built it so far so um we want to create an instance of alpha 0 and um we like to do that we need a model an Optimizer again and some arguments so we get the model first of all by building a resnet but initially let's just create an instance of tic tac toe let's write tic-tac-toe liquids click attack toe like this and then we can write model that equals rest net and for the game we want to use the stick-tac-toe instance which is created and for the number of rest blocks we can just say 4 and for the hidden dim we can just say 64. and now we want to Define an Optimizer right so let's use the atom Optimizer as an built-in pie torch so let's write Optimizer equates torch.optum dot Adam and then for the parameters we want to use the parameters of our models such as model.parameters [Music] and for the learning rate we can just use 0.001 just a standard language for now and um next we also need to refine these arguments right here and we can do this by creating just another dictionary and for the exploration constant we can just choose two again where the number of searches we can choose 60 and then for the number of iterations on the highest level we can just choose three for now and then for the number of self-play iterations um so the number of safe play games we play for each iteration um durations we can set that to 500 for now and the number of epochs can just be equal to four correctly you know what let's also lower this amount to 10 for now so that we can see that we can save our models and let's just run that right here oh sorry and now we want to create this instance of alpha zero and this should be equal to the cipher0 class and the input should now be this model we've created the optimizer we have created and the game and args we have created so it's right model optimizer and for the game we can write tic-tac-toe and for the arguments we can use the arcs here then we can run Alpha 0.11 like this let's see if this works so nice so now we get this a good looking bar we can see that we are running these safe player games great so now we can actually implement the Training Method inside of our Alpha 0 implementation so let's just change this method right here and the first thing we always want to do when we call this method is that we want to shuffle our training data so that we don't get the same batches all the time so we can do this by calling randle.shuffle here of the memory and if we want to use the use this we have to import random here above just like this and nice so now we can actually continue um with this implementation right here so next we want to Loop over all of our memory and we want to Loop over it in the batches right so that we basically have a batch index and for each batch index we can then sample a whole batch of different samples and then use these for training so the way we do this is by writing for batch index then range and here we want to start at zero we want to end at the length of our memory and then we want to step in the size of safe.args.batch size [Music] and now we want to take a sample from our memory right and we can get this sampled by uh calling the memory at the batch index at the beginning and we want to call it until batch index Plus safe.args dot batch size but just remember that we don't want to call it on any batch index that might be higher than our Len of the memory so we have to be careful here and we can check that we don't exceed the line of our memory by calling Min of Len of memory minus one then the batch index plus self dot arcs of batch size right here so this way we won't ever exceed this limit correctly here so now we want to get the states the MCTS props and the final rewards from our sample and we can do this by writing State and policy targets for the mcds distributions and value targets for the final reward we get from Save play and we get this from our sample by calling the zip method so the way um this works is that we basically transpose our sample So currently we have just this list of tupils and each Tupac contains a state a policy targets and a value but by applying this asterisk here and then zipping it we actually get lists of States lists of policy targets and list of value targets so we transpose this sample around basically so now the state is a list full of MP arrays the policy targets is a list full of MP arrays and value targets is a list of MP arrays but it's much more helpful if we actually have all of these things as NP arrays immediately so let's change them by saying value and policy targets and value targets and we should be equal to NP dot array of State and mp.array of policy targets and NP dot array of value targets like this but remember for the value targets that currently we only have this numpy Rand inside we immediately have all of our float values but it's much more helpful if each of these value actually is in its own sub array later on if we compare to the output of our model so we can change this by calling reshape here then we say negative one for the batch axis and then we just say one so that each value will be in its own sub array essentially [Music] so this is great and now we can turn all of these three things into tensors so first of all we can write state in which liquid storage dot tensor of state and for the D type we can just say torch Dot float32 even though um yeah write this encoded State might be um and just adjust right now nothing float so that's nice to change the detail here and for the policy targets we want to also turn this into a tensor and change the detail to flow 32. um forge.tensor like this of policy targets here yeah and then the D type should be equal to torch dot float 32 like this and then we have our value targets [Music] and here we also want to turn this into a tensor and have value targets as the old input and the D type again should be equal to torch.float32 so great um so now we have all of our tensors ready and the next step is to get the out policy and the out value from our model by letting it predict the state we have here so let's write out policy out value equals self dot model of the state right here [Music] and then now we want to have a loss for the policy and a loss for the value and we can first of all Define the policy loss and you might remember that this is a multi-target cross-entropy loss so here we can write F dot cross entropy like this and then first of all we have our out policy and then we have our policy targets and yeah so next we have our value loss and here we use the mean squared error so we can write value [Music] loss equals f dot MSE plus like this and then first of all we have our out value then we have our value targets and now we want to take the sum of both of these losses to finally get just one loss so we can write loss equals policy loss plus value loss right here so now we want to minimize this loss by back propagating so first of all we can write optimizer.0 grid [Music] then we want to call last stop equals and then we want to call it Optimizer dot step so this way pytorch does all the big propagation for us and actually optimizes this model right here so this way we now have a better model and then we can use this during self-play so let's test this out so here I will just uh quite this again you know what let's actually set number iterations to 500 now and I will now test this for the number of iterations that we have here and then afterwards we can finally see how this model has learned and yeah feel free to train this way but you don't have to and we will make it more efficient later on so we can also just test it here on my machine okay sorry so I forgot to define a batch size right here so let us just pick 64 as our default batch size and then we can run this cell right here and we get these nice progress bars during training and I have just trained uh yeah the cell right here and it took roughly 50 minutes just using the CPU of my machine so now we actually have trained for a total number of three iterations and remember that we save our model at each iteration because of this expression right here so now let us actually check what the neural network we have trained understands of the game and we can do this by moving up to this cell right here so this is just where we tested our randomly initiated model so here when we Define our model let us actually load the static so let's write model to upload static inside we can write torch.load and for the path we can write model and then two since we want to check the last iteration and then PT for the file ending and you know what let's also put our model in the valley mode so let's just run this right here and then for this state right here we get this distribution of where to play so our model tells us that either we should play in the middle right here is player positive one or we should play on this field right here and the corner is player positive one and you know what uh let's also test it for this state right here so um two and fours player negative form and then [Music] play a positive one we have played on the fields of eight of six [Music] so first of all now you can see that we have copied the spot state from the image right here and yeah then we also encode it and then we get this distribution of where to play so now the neural network correctly tells us that we should play on position 7 because we would win here and yeah for the value we get a result that is close to positive one because we would just win if we play here so we have a high estimation of this position right here as a player positive one so yeah and this is even better than the distribution we have right here so remember this is just what this scenario network is a mathematical function put out and we never did a Monte Carlo research or anything when we got this result here so we can perfectly see that this mathematical function so this neural network actually internalized the multi-colored research by playing with itself and now we can just call this neural network with a given State and we get this neat distribution of where to play and also this nice float here telling us whether the given state is a good one or not for us as a player great so now we can apply some tweaks to our Alpha zero algorithm and the first week we want to yeah add right here is that we want to also be able to use GPU as a device since that might be way faster for training depending on your setup so we can add GPU support by declaring a device right here and for the device we can say that it should equal to torch.device and the device should be Cuda so yeah Nvidia GPU if torch.cuda dot is available [Music] and then edits we can just set it to CPU like it was initially and yeah we want to then use this device and give it to our model as an argument [Music] so this also means that we have to edit right here to our resnet class so let's set device right here and then we will first of all add it as a variable here so self.device should equal device and then we also want to set our model to the device it was given so let's write save.2 device right here so that we turn our model to GPU for example if we have Cuda support on our machine yeah so that's the first part and now we also want to use this device first of all during set play and then second of all during training right so let's move to the safe play part right here and we use our model during self-play when we get these action props right here during our MCTS search because yeah we call our model right here so yeah when we get this policy and this value right here and yeah we also want to turn this tensor into a tensor that is on our device we have declared so we can do this by writing device should equal self dot model dot device like this and now we also want to use GPU for uh or we want to add GPU support during training so let's move to the training part right here and we can first of all yeah move this data to our device since we use it here's input to our model let's write device equals search.model.device and then we also use these policy targets and value targets and we compare them to the outputs of our model so we also have to yeah move them to the device of our model so let's write self.device let's write devices with self dot model dot device again right here and also right here down below okay so yeah now we should have GPU support as well so that was fast and now there are several more tweaks so if uh yeah the next week we can apply right here is that we also want to add weight decay since um we have this A2 regularization in our loss um for f for zero um like you can see on this image right here so we want to write weight decay and we can just set that to 0.001 like this so this would be fine and yeah so now we have way Decay and GPU support added to Alpha zero and next we also want to add a so-called temperature so remember when we sample an action from these action props right right here currently we just use the visit count distribution of the children of our root node as the probabilities and then we take these probabilities for sampling right but actually we want to be more flexible because sometimes we would rather like to explore more and so also uh you know sometimes take a visit count uh take children or actions more often that haven't been visited that often and um sometimes we also want to exploit more so so we want to tend to always just pick the actions that look the most promising to us so the way we add this flexibility is that we get some temperature action props so we basically tweak this distribution right here and then we will use the Tweet distribution as the probabilities when we sample an action so let's write temperature action props like this and we get these by first of all using the old action props then we want to power them by 9 divided by self dot arcs dot temperature select this and now we also have to Define temperature in our arcs right here let's write temperature and set that to 1.25 in this case so now we have declared this temperature right here and actually what this does uh in the case for a temperature that is larger than one you can see that you have the number we use for the power actually gets much smaller and what this then does as a result is that we squish all of our probabilities much closer together so that we actually do more of exploration because we higher our temperature actually gets to Infinity the more it would be as if we would just take completely random uh actions sampled from just this random distribution and on the other hand if our temperature would move down we can see that this number would actually move up so if our temperature would turn to zero for example or close to zero and not zero directly then it would be as if we would take just the arc Max of our distribution right here so we want to do exploitation and not exploration right so this temperature right here basically gives us the flexibility to decide whether we want to rather take a more random actions that do exploration or whether we just want to exploit and take the actions that look most promising at the current time so yeah this is great and now we also have uh yeah our final tweak we want to add to Alpha zero and that is that we also want to apply Some Noise to the policy that was given for our root node at the beginning of a Monte Carlo research so when we um when we give our root note a policy we basically want to add some noise so that we also explore more and move in some directions that maybe we haven't checked before so that we also make sure that we don't miss on any possible promising actions so the way we add this noise to our root note is that we first of all have to um yeah add these statements uh to the top right here as well so that we um also add the policy at the beginning and basically separate the from the steps right here so first of all we should get a policy and the value by calling save.moded and then here inside we have torch.tensor then we have safe.game dot get encoded State sorry [Music] and then for the state we can just take the state um for root but this is equal to this state right here so let's just save this one and for the device again we want to use the device of our model so that we could use GPU [Music] um so self dot model.device right here and then again we have to unsqueeze at the end so that we have a axis for our batch [Music] so this great um so now we get this policy at this value and actually let's just turn this value into a blank variable because we don't need it at the top so this should be more readable now and now we want to again tweak our policies we have a tab right here so let's first of all use soft Max right here so policy should be great to torch.soft Max of policy for the access we want to use the first axis and so that we don't use the soft mix on the batch axis then again we have to squeeze at the end uh turn our tensor to a CPU and then just get the numpy array out of it so now again we want to get some valid moves and then multiply our policy with them so that we mask out the illegal moves so the lid moves should equal to save.game dot get valid moves off state right here and then policy should be multiplied with these [Music] and then again we can now add some noise right here so after that we would just again turn our policy to this distribution of probability so we want to divide it with its own sum [Music] so now here we actually want to add this noise to our policy so that we yeah do some more exploration during a multi-colored research so the way we can add this noise is shown here on this image so first of all this should be the updated policy for our root node and then we get this updated policy by first of all multiplying our old policy with this coefficient right here and this coefficient is just a number smaller than one so basically we lower all of our policy values uh in selfie distribution and then we add um the basically opposite coefficient so also number smaller than one um multiply it with this noise right here and yeah this should just be our updated policy so this Epsilon value here is just floats smaller than one this n value right here is a noise called dirichlet random noise and we get this by just using NP dot random.urysled right here and yeah basically for us this will just give this a distribution of random values and we basically will add them to our policy multiplied with this coefficient that is smaller than one so this way we basically change our policy a bit to also incorporate some Randomness so that we then can explore more rather than just always walking down the tree and the way that our model tells us because at the beginning our model doesn't know that much about the game so we also want to make sure that we do some exploration as well so let's add this right here so policy should first of all be 1 minus self dot arcs of delish LED Epsilon [Music] like this and then multiplied with our old policy and then we want to add the second part right here so this should be safe.args of D Rich Let's Epsilon again [Music] and then we want to multiply it with NP Dot random.drichlet and yeah inside here we just [Music] um want to use this Alpha value so give that as input to our MP3 random.tree slat and this Alpha basically changes the way our random distribution looks like and is this based on the number of different actions we have for games so in the case of tic-tac-toe our Alpha is quite high so roughly equal to 0.3 if you just want to use it because that turns out to be looking well but it might also change depending on your environment here then we want to [Music] use hash dot arcs let Alpha now let's actually move that to the next line so it's more readable and then we want to multiply this array with the action size of our game [Music] so [Music] and actually we also have to move this part here below so that we mask out our moves after applying this random noise or not before so let's just copy them over below okay right so now we also added this um yeah noise to our policy and now we can also use this policy we got right here to expand our root node so let's write root dot expand and then um yeah just use this policy right here and now there's just a minor change we can still add and that is that currently when we expand our root node at the beginning [Music] um we won't back propagate immediately so that means that our root node will have children but the visit count of our root node will still be equal to zero so that we means that when we select right here we will call this UCB formula with the visit count of our parent being zero and if this wizard count of our parrot is zero then basically this means that all of this will basically move away because we just turned to zero as well so um this also means that we don't use our prior when we select a children for the child for the first time so actually in order to make this better we should just set the visit count of our root node to 1 at the beginning so you know what let's write visit cones let's set that one right here and this also means that we have to get visit card right here and the default it should just be zero but yeah for the root note it should be one at the beginning so that we um immediately uh use the information of our prior when we select the child at the beginning of our Monte Carlo research okay great so now we can run all of these sets again and then see if this is working so we also have to add a direct light Epsilon here [Music] and yeah this should just be equal to 0.25 and then the result Alpha and I just set that to 0.3 like I told you okay so now let's test it out okay so we still have a CPU device somewhere and let's see where that could be oh um I haven't run this right here this is the arrow sorry so here we have to also add a device and for this right here you know what let's just say torch dot device of CPU and then we can run this again okay so perfect now that's running and yeah for this current Setter with tic-tac-toe it shouldn't be that much faster but later on when we use more complex games GPU support will be much nicer and also this these other tweaks should help to make our model more flexible and actually yeah make it easier to always get to a perfect model at the end so nice so I will just train this again feel free to also train it but you don't have to I can I've also uploaded all of the weights and you can find them on the link below in the video description so let's just train that and then we can briefly check it again and then we will also expand our agent so that we can also play the games of uh Connect Four as well to tic-tac-toe so see you soon okay great so now I've trained everything right here and now we also want to check that our model has learned how to play the game of tic-tac-toe so in order to do that right here we have to move up to this again and then you know what you know what let's also Define a device right here so let's write device equates torch.device of Cuda if torch.cuda dot is available right here then else we can just use CPU for our device and then first of all we want to use this device right here when we Define our model and then we also want to use this device for the tensor state that is a stack we'll use for the monitor predict the policy and the value so let's write device right here as well and then we also have to the USB device when we load the static from this file right here and that is when we have a map location that's right that location equates device here as well so let's now let's just run this again and we see that we get this nice distribution and this nice value again so we can also see that our model has learned and by the way um here we aren't even masking out our policy so the model has learned itself that you know these moves aren't that great right here because so you can't even play here because someone else has played without us even yeah masking these illegal moves out so that's nice and now we can move on to also train our model or iPhone 0 on the game of Connect Four great so let's also define a class for the game of Connect 4 but before we do that let's also add this representation method to our tic-tac-toe game so let's write Dev wrapper like this save and then here we can just return the string of tic-tac-toe and this string is yeah also the result we would get when we call this object here instead of a string representation so this means that we can use this information down below when we save our model for each iteration during learning so we can just add game right here so basically we'll also add the name of our game to this model and this Optimizer weight or Statics when we save them so that we won't override our models when we train for the game of Connect Four um yeah so that's great and yeah now that we have done this part let us actually just copy this game over right here [Music] so that's great and now let's use it to define the game of Connect Four and for the game of Connect 4 first of all we have a row count that is equal to six we have a column count that is equal to seven and then the action size is just equal to the column code [Music] and then also let's add a variable for the number of stones you need in a row to win in the game so let's write save dot in a row and let's set that equal to four and for the representation we can change the string to connect four here and yeah the initial State can stay the same because we just want to get this here blank array with row count being the number of rows and column count being the number of columns so that's great and now we have to change this get next State method right here because this action will just be a column uh so we're telling us where to play it and the first step now is that we have to check get a row back um and we get this row by looking at the given column then we want to look at all of the fields inside of this column that are equal to zero so the that are empty currently then we want to basically take the deepest empty field inside of this column because yeah this is where we would play and connect for and yeah this uh would just then be the row we use for playing so we get this by calling np.max since yeah in a number array um you start at zero at the top end then move your values get higher if you move down basically on the rows so we take np.max of np.where off this state at the given column yeah and the state here should equal it to zero right so yeah MP dot where will just again give us V indices of all of the empty fields and then empty.mp.max will have give us the highest in ECU because yeah this is where we would play and Connect Four because yeah the stone will just fall down until um yeah this is uh already non-empty field so now we have a row and technically we also have a column because of this action right here so we can now write state of row of column and set that equal to this player right here we have given here's input and yeah so now that's great and next we want to move over to this method right here so we have this this get valid moves right here and for the git valid moves we just want to check the um yeah row just at the top and then we want to see whether the fields at this row are equal to zero or not and these are all available moves so we can just write state of a zero so at the basically upmost row yeah and then this should be equal to one um if it is valid and yeah otherwise we just have an illegal move on this given action okay and now we want to check for the win and frankly I'm just going to copy this method over right here so basically where we do the same checking asymptote we are where first of all we check for the vertical there are possibility of a win and then for a horizontal one and for these two diagonals based on this action right here that was given as input it's just that compared to tick-tac-toe it's a bit more complex where we actually have to uh walk in both directions and then keep checking whether we have stones in these directions that are equal to our own Player sign also have this get value and terminated method and this can just stay the same for both games good opponent can stay the same get opponent value can stay the same change perspective also stays the same and get encoded State yes also the same rate here and I just noticed a small mistake and that is when we use get next step right here we can't reference column because here we have defined our columns action here from the top so let's just replace column with action like this and now we can check um that this is actually working in practice as well so we can do this by making our code more General and here we just replace tic-tac-toe with game then we use game inside here and we use game inside here as well use game like this again here um and that's okay right here so let's remove this instance of take your toe and yeah this one it's wet and yeah so that should be fine now the should be no more instances of Tic-tac-toe and next let's replace this game take that Toe with the game of Connect Four let's set the number of searches to just 20 so we can validate let's say 100 here and then let's change the size of our model to be equal to 9 for the number of rest blocks and 128 for the hidden dim so that we have a larger model than we let the trend Connect Four and now we can check whether this is working or not so first of all we also miss the argument of device so let's add that right here as well so that's right torch Dot device of Cuda if torch.huda dot is available like this and then it's we can just use a CPU and yeah now we want to use this device instead of our model class so let's use it right here and yeah so that should be fine so let's run this again okay perfect so now we have this game of Connect Four let's say that we want to play at the position of four or five sorry and you will still have a problem so delete Epsilon is also not set right here so let's copy these values over um like this but let's add teams like Epsilon to zero so that we don't use noise here um okay so I've played here the model has played here you know what let's play as one that the model has played zero again let's play a two [Music] see well multiplied play that zero again and then we can just play at three like this and we have one so at least we check that the game of Connect Four is working now but yeah obviously our model is terrible because we don't do that many searches right here and also we haven't trained anything yet so obviously our model is garbage but let's change that now so let's also train a model for the game of Connect Four and then again test our results here and we can do this by just applying some minor changes right here so first of all we should also make this cell right here more generous so we write game and then for the game we just will use Connect 4 here and then we also have to set the game to just game instead of the resonate and you know what because Connect 4 is harder to train let's change the number of rest blocks to 9 and let's change the num of our hidden Dimensions to 128. and then we also have to use game right here during our Alpha zero initiation and then we can also slightly tweak these hyper parameters here so first of all for the number of searches we want to use 600 then for the number of iterations we can use something like 8 maybe for now and then for the batch size we can also increase that to 128 so yeah and next we can just run that and this should take a lot longer um but we will also increase the efficiency of it later on so feel free to just use the weights that I've uploaded in the description down below so you don't actually have to train this all yourself since it might at least on my machine correctly take a couple hours so see you soon okay now I've trained Alpha zero on the game of Connect 4 using these arguments right here and frankly I've just trained it for one iteration since even that one iteration took several hours to train right so now we want to increase the efficiency so yeah that will be way faster for us to train also for these rather complex environments but yeah so still it was nice to train for this one iteration because now we can evaluate our model and find out whether we had at least learn something so let's move to this evaluation set right here and now when we create this model we also want to load the state date from the weights that we saved after we have trained for this one iteration here right so here when we create our model in this line right here we can write model dot load State stick like this then inside we can write torch.load and then we want to reference this path right here and so we can write model and for the iteration we will just use zero since we started zero and we have only trend for one iteration and for the game we will use Connect Four and for the file ending PT again and let's also Define a mapped location and set that equal to device so this way if your device would change during training compared with the duration you could still load the state deck like this right here so yeah this might be nice so let's run the cell again and you know what let's just play it six here so yeah now I'm going to play it in the middle like this so let's just play it six again and yeah our model to defend this position so let's just I don't know okay let's see right here and the model bit of this Tower right here let's play Let's see again building up the tower let's just play that zero again and our model finished this Tower right here and first it has won the game so we can still see that our model isn't playing perfectly and obviously I just played in the corners to yeah check for some simple intuition about the game on the modded side yeah but we can still see that it is much better compared to the initial model we had where it only wanted to play instead of this uh column right here right so um that's great um and now we also want to increase the efficiency of our fs0 implementation right so actually now the main part is completely done and you should have this intuitive understanding of how the algorithm works but like I said it would be nicer to increase yeah the speed we got when training and especially the speed rings have played with it so how can we do that and here we want to basically parallelize as much of our Alpha zero implementation as possible because these neural networks are built in a way where we could batch up our states to then get here parallel predictions for the policy and the value especially when we run ourselves play Loop so one maybe way in which you could parallelize this other implementation here is by using a package like Ray that would basically yeah create this threads that are independent from each other and then basically use that to yeah harness all of your GPU power but I'd rather like to build this parallelized version in a more pythonic way like this so just inside of our Jupiter notebook again so let's actually do that um so we will actually basically patch up all of these different states then get policy and value predictions and then distribute them again to our safe play games so that we play several safe play games in parallel and this way we can essentially drastically increase the speed of our Alpha zero implementation so the first thing we want to do here is that we also want to add this argument of non-paralleled games here to our dictionary come on and for now I just set that to 100 so the idea is that we want to play all of these yeah self-play games in parallel and then when we reference our model we will batch up all of our states instead of our MCTS search and this way we can drastically reduce the amount of times that we call our model in the first place and yeah thus on one hand we fully utilize our GPU capacities and on the other hand here we can just decrease the amount of times we call our model and thus also um yeah have a higher speed right so the way we can yeah change this or update our other implementation is the first of all we want to just copy this class here over so I would just create this new class here and I will call this iPhone 0 parallel and then we also want to update the Monte Carlo tree search so I will also copy this class here over as well and I will call this MCTS parallel like this so let's write it like this so yeah now we have our parallel F0 and mcds classes and here we want to change the implementation of the save play Method and because of that also the implementation of this search method right here so now that we have got this class um we also want to create a new class and here we basically want to store the some information of a given self-play game so that we yeah can Outsource some information from our Loop when we Loop over all of these given games so let's create class and I will just call this SPG for set play game and then we can also have an inlet here and for the unit we want to pass in a game so click the tool for example or Connect Four and then first of all we want to have a state here so initially this should just be the initial state of our game right so we can set save.day to game dot get initial State like this and then we also want to have a memory inside here and this should just be an empty list at the beginning and yeah then we also want to store a root and that should be none at the beginning and we also want to store a note here this should also be set to none at the beginning and yeah so this is everything we need here for this class right here and yeah so this way we can later store some information here okay so now we want to update this set play Method right here um so first of all we want to call it less often since every time we call this method um we will call it 400 potential games that are played in parallel right so here and when we Loop over our numset play iterations we want to divide that with the games we play apparently so save dot args of sorry and I'm parallel games like this and also when we um yeah create this if play game class right here I've just forgot to add those brackets right here so we should just add them yeah okay so yeah inside here first of all um we can remove this state right here since we store our state and we save click games right and then also we can really name this memory to the return memory since this right here won't be the actual memory um in which we store our yes State and action props and Player Two Bits initially but rather this will be the state that we will return at the end so uh the list we currently have here right which we won't need later on okay so then the player can stay just the same because even though we play games in parallel we will always just flip the player after we have looped over all of our games right so this way all of our games in parallel will also always be at the same player right because we always start at player one and then we can just flip for other games pyramid as well Okay so now we also want to store our self-play games here so I would just call this spgs like this and maybe let's just sorry maybe like say play games like this and here we want to have a list of all of our Circle games so one Circle game should be created using this SPG class we have defined below here and inside we want to pass the safe.ame we have here this our Alpha zero class and we want to create the set play game for SPG in rangeof.args and then nump parallel games right so this way we create this list for all of our safe play games and yeah we will basically give them the game in each case okay so now we have these safely games right here and next we want to change this file through um you know expression here because this way we would stop after one game has finished but rather we want to keep running our self-play method until we all of ours have play games are finished so we can check this by saying while the land of SP games is larger than zero so we implemented this way that once a software game is finished we will just remove it from RSP games list because of that we can just check here whether we have any set play games inside of our list right so uh great um so the next step right here would be to get all of these states of our self-play games first of all inside of our while loop so let's get all of the states here and we can get all of the states first of all as a list by saying SPG dot state for SPG in Sp games like this [Music] so yeah this way we just have a list of all of our states that are installed inside of our class and now we can turn these states into just a number array so the best way to do this is just by calling np.stack here of this list of numpy arrays and this way we will basically get one large number array right okay so here we have all of our states and next we also want to get the neutral States so first of all we can just change the naming right here and then here when we change the perspective we just have to write States here instead of state and now we will change the perspective for all of our states basically in just yeah one function call right so this is what we can also increase the efficiency since um yeah changing the perspective works the same way if we have several States because we would just multiply our values with negative one in case that we have player set to negative one and yeah here we we can just do the same right so this is great and now we can also pass these neutral states to our Monte colored research and we have basically finished this part right here so let's now change the Monte Carlo research search method and yeah so first of all we want to now give States yes input so these are the neutral States we've just created and now we can also first of all I change the order so let's just paste this one here below and then we can work on the policy and the value first and so move it like this Okay so yeah first of all we get this policy on this video and like I said we can also do this with all of our batched States so that's where we just get several policies and values back here as a return and that's great but we still have to keep two things in common so first of all we don't need to call it unsqueeze here because now we have a batch so we don't need to create this fake batch access right and then we also um have to update this get encoded State method right here so because the way it is currently implemented um it's only working if we pass it one state at a time because this get encoded State method which yeah create these three planes for the fields in which play negative front plate or these uh the fields that are empty or the fields and which player positive one is played right so it will always first of all give us these three planes then inside of these planes that will give us the other fields right but we want to change the order here because now the first axis we basically want to have here all of our different states and then for each state we would like to have these three planes right so we don't want to have these three planes at the beginning but after our batch access basically right if that makes sense so we have to basically swap the axis of this encoded state right here so that it would also work if we have several States and not just one that we use when we call this method here so first of all we can check whether we have several States or not and we can check this by checking whether the Len of state.shape it's really quick to three right because normally as the land of our of just one set would be two right where we have the rows and then the columns and if we also have a batch axis then the Len would be three because first of all we have a batch then we have the row something we have the columns and yeah so if this is the case then we want to create this new encoded state [Music] like this um and it should basically be equal to NP dot swap access like this and then here we want to pass in the old encoded state then we want to swap the axis 0 and 1 right so that we don't so that the basically the new shape should be the size of our batch then three and then the number of rows and then the number of columns right instead of having three then the size of our batch and then V number of rows and number of columns right if that makes sense so basically we swap these X's right here um so that's great and now we can also copy this over to the game of tic-tac-toe as well and yeah oh sorry I just noticed that we had NP dot swap access using an e instead of it I so that's just my mistake and let's update this for both of our games and yeah so this should be working now okay so now we can keep updating this CTS search method so first of all um we now get these policies and these values package return right at the beginning we don't need our values so we just get these policies and we can rename the state here to States and then when we keep processing the policies it's nice that we call the soft Max this way but again we don't want to squeeze since we want to keep our batch axis the way it is currently and then when we add the noise to this policy we also have to make sure that the shape of our noise is equal to the shape of our policy so we can add a size right here and the size should just be equal to policy.shape of zero right so this way we basically create noise for each different state so for each different self-care game essentially okay so this is great and now we have this yeah process policy here ready for us and the next step would be to allocate this policy to all of our self-play games so we first of all have to also add the self-play games here to our MCTS search as the argument and let's also add them here because of that and now we can work with them so here we basically want to Loop um over all of our self-play games so I will just write for I then SPG in enumerate and then say play games like this and yeah inside of each separate game we first of all want to get the valid moves and we can get the valid moves from the state for the given Supply game right and we get this state by looking at the states at the position of I right here since this is the index we get when we enumerate through our self-play games and this is also the same index as the one we have when we basically stack all of our states up together right so we can just uh yeah call it like this so um now first of all we get our valid moves and then we multiply the policy with it but we don't want to multiply all of the policy with it but rather just the policy at our position right so let's also find the policy for the city to play games let's just write SPG dot policy like this this should be also equal to the policy at the position of I so now we can multiply this SPG policy with our valid moves and then again we also want to divide this SPG policy with its own sum so that we at the end have this distribution of percentages right okay so now we create this root right here and first of all we should also give yeah these states at the position I here for the state and then also we have to make sure that this route will be equal to SBG dot root right because we want to store it inside of this class right here okay so now we have this SPG dot root and we can also expand it so let's write SPG dot root dot expand like this and then we will expand it with the safe play game policy now we have here okay so that's why we already have saved a lot of uh yeah a lot of calling our model right because we batch up all of our states here so this is great and now we also want to keep doing this inside of our actual iterations right so here when we basically have our num searches we also want to Loop over our circular games again so let's again write for I then SPG in enumerate SP games like this and then we can continue with this part right here so for each set play game first of all we want to get the note and we get this note by calling spg.root right here and then we can just continue with these things right here and yeah so now um here we would call it our model again right and we want to basically yeah also parallelize this model call right here so the way we can do this is by turning this if statement around so instead of checking whether we haven't reached the terminal node we want to check whether we have reached infected terminal nodes so this is terminal and in this case we will back propagate directly so I would just copy this over right here and in all other cases we want to store the node that we have reached here inside of this search instead of ispg class as well right so edits we can just write SPG dot node should equal this node right here and to make sure that we always start with node being one so that we can later differentiate between the self-play games in which we have reached a terminal node and then which all the safe play games that are uh that we want to expand on here later on we also have to say that SPG dot node should be equal to none like this so because of that we can just move that out of this Loop right here and you know what I believe we also don't need this enumerate Loop so let's just remove it so it's a bit easier to read it this way Okay so now we basically have stored all of our expandable nodes instead of our SPG classes so first of all we now want to find out which safe play games are expandable or not so let's write expandable spgs oh that's right let's say games like this [Music] um and here we want to have the list and then we want to store the mapping index so basically the index we can later use when we allocate this policy and this value back to our list of software games so we want to get the index basically for each safe play game um so for each SPG where our node is expandable right here right so we can just write mapping index for mapping index then in range of Len um SPG or save play games like this and then we only want to get the index for all of our safe play games where um safe play game dot node is equal is not equal to num right since this will be the safety games that are expandable so if SP games [Music] of mapping index [Music] dot note then is not none like this yeah this way we get the yeah indices for all of our expandable safe play games and next we can check whether there are even any expandable games so you can just write if Len of expand to the SP games larger than zero and in this case we again want to yeah basically stack up all of our states and then also get these encoded States right here so that we can then get this policy at this value so first of all we can write states that set that equal to NP dot stack again of and here we can just write SPG dot note dot state for SPG in SP games like this and then oh sorry for again we have to change this up a bit sorry rather we have to write SP games of um I I believe or just mapping index like this if we want to reference this one again here so self-play games of mapping index.note.state form mapping index and then in this expandable SP games right here so in expandable SP games [Music] like this okay so this way first of all we get this list of yeah the state for the nodes that are expandable um yeah basically for all of our self-play games that we can expand right here right so okay now we have these states right here and we also want to encode them but we can just move over to this part again so I will just copy this over right here [Music] so now again we get this policy and this value back and inside here first of all we can use these expandable or these states right here and then we also don't have to unsqueeze again because we already have a batch then here we don't have to squeeze again because we want to keep this batch as it is okay so yeah now we don't have to have to add a noise right here so now we can again Loop over our um expanded SP G's essentially right so we don't uh have to do this and stuff because if statement that you could I mean not really changing much here so now we want to first of all have the index um that we can get our policy at the given position and then we also want to have this mapping index so that we can then allocate this policy at index I to the self-play game at index mapping index if that makes sense because the yeah index um here when we enumerate is not aligned with the index of our safe play games right because we only look for over a selected number of self-play games right so for I if the mapping index has been enumerate then we want to again enumerate over our expandable let's Speed games like this and then here we can just um yeah basically get these valid moves and this policy and then we want to expand and back propagate here to move that magnet here um so that's great and now first of all we again want to get the SPG policy and the SPG value so let's get SPG policy by yeah getting the policy at the position of I of the SPG value should also be the value at the position of I so let's call this right here okay so now we have our SPG policing of on our SPG value so now we're going want to get these valid moves and we get this well it moves by first of all getting the state of our node for the given self-player game and here we want to use this mapping Index right so here we can just call it safe play games at the position of mapping index and then here we want to get the node of this state right here and what let's also just um I think just throw this into a very big I think this is nicer so just write mode Eclipse um save play games of the position of mapping index.note and here we just want to use node.state when we um get these valid moves right here and then here we can multiply the SPG policy with the valid moves as the scale of over node and then again we want to divide it by its one sum like this so now we can get the item yeah I believe I think this doesn't do a big difference here um yeah because this already is a number array so I believe we don't need this actually so now first of all we want to expand our node using this SPG policy right here then also we want to back propagate here so I will just copy this over here oh sorry um copy this over from here and we want to back propagate using SPG dot value okay so this is great so basically with this implementation we have the Monte colored research code working for this number of parallel games now at the end we would get these action props right here but I believe this is not that helpful in server ircts class because here we and you have our we have this parallelized and this is how to parallelize these action props right here so red that we just copy this out and instead we want to work with our action props instead of OSF Planet method here so I would just paste this over to this part here so because of that we also don't get any action process return here but we just change this Visa play games that we passed on here right okay so this is great and now let's also just run this hook to make sure that we don't have any direct arrows okay so torch is not defined but that's just because I haven't run these sets above so let's just have quickly do that here [Music] so let's check whether we got any error Okay so okay this is working for now okay so now we want to update this part here below and then we are finished with this parallel implementation so after we have done this paralyzed MCTS search we again want to look over all of our self-play games right so we can just write for I in range of Len of s p games and here remember that we want to remove the self-play games from our list if they are terminal right and it's a bit yeah it's quite bad to remove uh yeah safely game from a list and then keep iterating over it from basically zero to the Len um because yeah then we would yeah basically not have a perfect alignment between this index here and yeah the actual safe play game we want to reach if we mutate our list instead of our Loop right so we can fix this by uh flipping this range here around okay so now we want to Loop over this flipped range of our safe play games right and then we can just tap all of these things in right here and the first thing we want to do is that we actually want to get this safe play game at the given position so let's just write SVG equals self-play games at the position of I and then when we get our action props here we don't want to Loop over the root of children directly but rather SPG dot root or children like this so yeah this should be working and yeah so then we have this action props right here and we don't want to add this just to the general memory but rather to the memory of our Circle again this way and we don't want this yeah when we call the neutral state we can just either reference this one or just the state of the root note um for our circular games which is right SPG dot root dot State like this to get this Nutri state and then we have these action props here this should be working fine and then we have this player right here so then we also get this temperature action props and we get this action right here so this is all working well and now we want to update the state and here we again want to call it spg.state so that we update the state instead of our SPG class and we want to update this by first of all having the Old State so spg.state right here and then by just having this action right here and having this player right here so I think this is also fine and now we get this value and it's terminated by save the game look at the terminal and again we have to call spg.state instead of here and now we check whether a instrument is true so here we don't need to Define this written memory like I said but rather when we Loop over our memory first of all we have to declare set this to SPG dot memory and then we want to append this to this General Regional memory that we have defined up here right so then we can just delete this Return part right here and yeah so this is uh that's something that we have finished inside here and then we want to flip the player um yeah after we have looped over all of ourself play games and at the end we just want to return this return memory appear above and also I just noticed that when we um yeah are finished right here with this play game so if our safe play games terminal then we also want to delete it from our list of soft play games right so that this while loop here it's working so here we can just write the and then SP gains if the position of I so yeah this way we just shorten our safe play games list and because we Loop in this other direction it will still work to just keep looping instead of our for Loop um yeah because we will basically just remove the circle again at this site when we look in this direction right and we also have to change this MCTS to MCTS or parallel here right oh and I also notice that we have this Arrow right here where we return these action props so obviously we want to remove that line right here because we just want to return the return memory and this second was the part of the Monte Carlo research that we removed and so let's just run this again and now we can try to train our model for Connect 4 using this parallelized fs0 implementation let's write F0 parallel instead of here and yeah everything else can just stay the same and I believe now the speed should be up by about 5x so this should be a massive Improvement so let's just run this cell again and we get this nice progress bar so I will just train this for these eight iterations here and yeah this will still take a few hours on my machine but after that we will have this a great model that has a very good understanding of how to play the game of Connect Four and then we can first of all evaluate it again and maybe also play against it in this nice visual way and maybe we will lose against it so we'll see okay so now I've trained in a network using this IFA zero parallel class for eight iterations right here uh yeah on the game of Connect Four so this still took a few hours I believe we could have been even faster if we would have further increased this number of parallel games here but still that's fine and so now we want to test this here instead of our evaluation set so first of all let's um change the path uh your for our weights to model 7 connect 4. right since we trained for eight iterations and then next we can also set number of searches to 600 years where so yeah we copy this yeah value from the arguments above and yeah this is still quite a small number when you would compare to other search algorithms that I use in practice so let's check whether we get nice results here so let's run the cell and oh sorry I've accidentally accidentally ran this one right here um so let's run this right here so now we get this Connect Four board and I believe we want to play in the middle right here so let's see where the model plays and the model plate on top of it think we want to play on this side right here yeah I think that should be fine so let's play on the phone [Music] um yeah the model plate here I think we either want to play here or there or maybe there so I would just put some pressure on the model by playing Neon 5. [Music] so now I think I think we're doing fine when we play here we just have to be careful so let's plan for [Music] so now we could also pressure here and then our Motor Company here right because someone could defend this so I think well probably I would want to play here [Music] um so if again sorry I'm very bad at this yeah probably yeah would have been much better I guess to play anymore [Music] um so now we can't play here anymore we could still play here or there I think this is valuable here so let's move on top of you know this position here so let's make it three so we'll multiply it here um we want to get it we have some pressure here so maybe we'd like to clear here we also the same here so play on two I guess oh no then if we play here our model will play here and we would be trapped so we can't play as well anymore um so maybe we could play here but that's not that good either um maybe that's fine here it's not too valuable so yes we put it here okay so where did the model play so just play it here Okay so thing we have to definitely play here I don't know if we still lose them and since play on five now [Music] okay so where did the play now um there's some pressure here or so oh we lost this way and we're also this way okay we can absolutely destroy it um yeah I don't know we have to just just play the three now I guess and yeah took this right here and yes won the game because of that so nice uh I could have even played here as well um so that should be yeah just I mean I played quite in a bad way I guess but so it was super nice and so now we could see that even though we have around 600 searches our model was still able to at least destroy me here in this game so this is fun and before we conclude with this tutorial here this is some minor mistakes I noticed um so first of all when we get these temperature action props in fs03 we currently aren't using them for our probabilities so let's change that right here and so now let's use them here and then also when we use our Optimizer whether we're gonna whether we want to call it safe.optimizer because then we yeah go for this Optimizer here instead of just um yeah calling this Optimizer right which might not work on All Occasions yeah so that's just a smart thing let's also add that to this iPhone 0 parallel so temperature action props here and then save.optimizer in our Training Method um okay so now we can just lastly let our model play with itself and you know let's also get this nicer form of visualization compared to just printing out the spot state right here so yeah this is what we will do next and here I want to use the Kegel environments package so we'll just write import Kegel environments right here then you know let's also print out the version so maybe you're interested in that um like this so yeah that's my version right here and then using this package we can first of all create an environment and we get this by setting nth equal to Kegel environments dot make and then here we can just use connect X to get this standard Connect Four environment I mean this also tweakable so but here we just get the standard connector environment and then next we would want to have some players so this should just be a list of our agents and then next we could just say enter run using these players we have defined and then end dot render with mode being equal to IPython instead of our Jupiter notebook here so now we also want to Define our players right so here I've just created this agent class yeah and I've copied it over right here so basically we just do some pre-processing with this date and then we will yeah just call our MCTS much search either or just get a prediction straight up from our model depending on our arguments so basically doing something similar than what we do up here in this Loop right okay so yeah now we want to create these players as Kegel agents and before we do that we also need this model this game and these arguments right here so we just copy them over um from here basically like this and then let's paste them yep [Music] so first of all we have our game of Connect Four then we don't need a player here um then we also have to add some arguments right here so first of all we want to search set search to true um and then we also need a temperature and the way this is built up here when we have temperature equal to zero we don't uh do this yeah division to get our um yeah updated policy but rather we would just get the arc mix directly so let's just set temperature to zero right here so that we always get the arc Max of our policy and well the MCTS distribution Guided by our model okay so now we have this right here let's also set the reflect Epsilon to one so that we still have some Randomness here and so next we have our device right here we create our model like this and then we use this path right here so this should be fine so now we can actually Define our players so let's set player one equal to calculated agent and like I said we first find it this model this game and these arguments so let's write model game and arcs just like this so let's let's also do the same thing for player 2. so I mean so this way we have more flexibility if we would want to try different players and then here for the players we can just fill this list with first of all setting player Wonder run and then player 2. run like this okay so this should be working now let's just run the cell and um get these nice visualizations also now we get this neat animation of our models playing against each other so we have two players but just one model here playing against itself basically and I believe this should come to a draw yeah so uh what it is that advanced I guess that it can defend against all attacks so in this uh just this nice animation and now we can also just yeah briefly do the same for Tic-tac-toe as well so here we just have to change some small things so first of all oh sorry we have to set the game to tic-tac-toe and event also um yeah we can change the arguments so let's just set number of searches to 100 and here went for our resnet we can set the number of rest blocks to four and the hidden them to 64 I guess so then we want to update this path so I think the last particular terminal we trade was model 2.pt if I'm not mistaken so then also here we have to set this to tick tock towards way and then we can just run this okay so now we get this nice animation immediately of our modest plugins playing against each other and again we have gotten a draw because yeah the model is able to defend all possible attacks um so this is still very nice so feel free to do some further experiments uh yeah your own so maybe you could also set the search to fault so that just your neural networks directly will play against each other without doing these hundred searches in this case and yeah so I think we're finished with this tutorial this was a lot of fun and um I've created this GitHub repository where there's yeah there's a jupyter notebook store for each checkpoint and then also um I have this weights folder where we have the last model for Tic-tac-toe and the last model for Connect Four that we have trained and if any questions feel free to ask them either in the comments or just by sending an email for example and yeah I might do a follow-up video on mu zero since I've also built that algorithm from scratch so if you're interested in that so um would be nice to let me know so thank you
Info
Channel: freeCodeCamp.org
Views: 100,376
Rating: undefined out of 5
Keywords:
Id: wuSQpLinRB4
Channel Id: undefined
Length: 247min 54sec (14874 seconds)
Published: Tue Feb 28 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.