ReBeL - Combining Deep Reinforcement Learning and Search for Imperfect-Information Games (Explained)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hi there take a look at this variant of the game rock paper scissors uh it's like usual rock paper scissors except with the added complexity that when either player chooses scissors then the rewards and the losses are doubled so for example you see right here player 1 chooses rock and player 2 chooses scissors so both the reward for player 1 and the loss for player 2 are double the size now you might know that in original rock paper scissors the optimal strategy is to play one third of each of the three choices at any time so you basically take a fair three-sided coin dice does that exist i'm not sure um and you throw it and whatever side is up that's what you play however here since one of the options is different the sort of optimal strategy shifts and interestingly it shifts as follows what you want to do is you want to play rock and paper both with a 0.4 probability and you want to play scissors with only 0.2 probability that is pretty interesting uh you might in intuitively conclude that you want to go more where there are more rewards to be had but of course also you lose more so you might also conclude well it doesn't make a difference ultimately but why does the why does the sort of optimal strategy shift such that you want to decrease your likelihood of playing scissors let's just quickly analyze this game before we jump into the paper because this game is sort of a microcosm of what the paper of today is about so the paper of today is called combining deep reinforcement learning and search for imperfect information games by noam brown anton buchten adam lehrer and chi chen gong of facebook ai research so this paper brings basically what alphago or alpha zero has done for perfect information games it brings this to the domain of imperfect information games and we'll see what the difficulties are in this and what can be done to solve it and not only do they have an algorithm but they have the interesting theoretical results that under some conditions namely under the condition that neural networks do something useful will actually converge to nash equilibriums um in these games so that is pretty cool so practical and theoretical paper right here um as always if you like content like this don't hesitate to share it out and tell me what you think in the comments this is not my field so i might get quite a bit of stuff wrong right here uh also if you haven't seen the the negranu poker challenge so it's i think it's the last video i did be sure to check that out just to see how you have to think about situations like this all right let's get back to this uh rock paper scissors example right here interestingly to note is that these these dashed lines here means that player two cannot decide which of these states they're in so player two doesn't know what states are in for player two this is all basically the same state it'd be really easy right if player one plays first and then player two sees what player one does and then they they just act they always win however player two doesn't so they have to sort of decide what to do independent of which state they're in uh especially this is a this is a symmetric game right this is a two player game because that's two players it's zero sum because whenever one player wins a reward the other player loses the same reward and um it is also it is that makes it symmetric so all the both players play at the same time though that is not necessary in general but here it's the case all right so this means in this particular case whatever strategy player one has player two must have as well so we'll just do the analysis for player one so let's say you deviate from this optimal strategy right we claim that this here is the optimal strategy uh playing 20 of scissors let's say player one doesn't believe it player one deviates from it and says nah there is so much reward there i'm gonna get some more of that so they upped this right they up this to like let's say point uh i don't know 0.33 like doing the classic one-third or even higher right they up this go more scissors okay and they probably want to take this mass because they have to take it from somewhere they probably want to take this from rock and paper let's say they just take it equally from rock and paper towards scissors to up the to up the probability that they play scissors so from paper and from rock they go towards scissors now player two observes this right they can just play against player one for a while or what we're going to assume is that everyone announces their strategy publicly uh it's the same thing you can just observe someone for a while or they can just announce their strategy um it's we'll treat this equally so player two observes player one playing scissors too often so player two knows they are very often in this situation right here in this right state they can't directly observe but they infer i must be very often in this right rightmost state where player one chooses scissors and therefore you see player two's payoffs it's zero here minus two here and two here so they'll say well i or also have this optimal strategy of 0.4.4.2 what i can do is i can simply knowing that i'm a lot in this state i can simply take some mass from paper and put it on rock so i play rock way more often and i reduce the amount i play paper right scissors doesn't matter but now i lose 2 less often and i win 2 much more often and player 1 in turn loses 2 much more often and wins much less often right so player 1 wanted to get more reward but they're sort of being punished by player two for playing this too often now you can say well it player one can do the same thing knowing that player two plays rock too often now right they've taken away mass from paper towards rock knowing that player 2 has taken rook player 1 knows that either they're here or they're here right and in this case player 1 can say all right uh you play rock too often obviously if i play scissors then i'm going to i'm going to lose but i've already decided i want to play scissors much more so they're trying to make it up right here so what they can do in this case is they can say when i play paper i win one instead of if i play rock two i win zero so i know player two is playing rock way more often than they should so i'm going to punish player two by playing paper more often so let's erase this arrow let's say we play scissors sorry we play scissors no let's not erase this we play scissors by moving from rock and we also move from rock to paper like we're almost never playing rock we're just playing scissors more often because that's what we started with and we're playing also now paper more often so now we basically do the same thing that player 2 did to us we are upping the likelihood of this thing happening and and decreasing the likelihood of this thing happening so now we can say now i also i play paper more often um now i i also win more often here and you lose more often but you see because the rewards are doubled over here the fact that player two can achieve this is much more meaningful than the fact that player one can achieve this okay and that's why uh player one will be punished harder for deviating here so that's sort of how you reason about these strategies so if player one will play this point two too often um they will be punished harder than player 2 for deviating in response to that and the same counts for the symmetric part this is a very important concept right here namely you can see player two strategy depends on player one strategy even though you could conceptualize this game of player one plays a move and then they play a move but they don't show it yet right they play a move they take like a picture of their hand doing rock paper scissors and they just don't show the picture yet and then player two plays a move so now we are basically back in we're in this game where it's sequential in nature and usually in a sequential game you can just do a sub game analysis so you can just say okay and do a sub game analysis but the the subgame analysis depends on the strategy of player 1 because you don't know the situation this is different than a full information game and this is illustrated right here so they say usually what something like alpha zero does is you are game starts here right and then you have two actions to take you maybe take this action okay now your opponent has two action maybe they take this action all right and now you have two actions again which one do you take what what something like deep q learning or actor critic learning would do is they would simply put a neural network here they would look at this state and they would simply tell you which action to pick like this action right here sounds good to the neural network in contrast to that alpha zero if i draw the same situation right here alpha zero what it will do is it will say well i could do this or i could do this if i do the left thing then i'm going to have my opponent's gonna have two options they could do this or they could do that if they do the left thing again and so you get the idea it sort of goes down the tree and it does this over here right sorry this should be so it goes down the tree i'm stupid and it evaluates it kind of calculates ahead it uses its internal simulator to look ahead and it could technically do this until it reaches the end and then it would know if it reaches the end state every time here it would no it could simply backwards calculate which one is the best option for me to do right now however this game is often very very deep so the tree the depth here is often so deep that you can't solve the whole game so what alpha zero does instead is it says i'm not going to play until the end i'm going to play a certain amount ahead right i'm going to think some limited depth ahead and i know alpha 0 does this adaptively but bear with me i'm going to think some limited depth d ahead so here in this case d is equal to two because we think two layers ahead and then at the end i'm going to replace everything that comes after with a single value that indicates how good this is for me okay so and this thing right here is very hard to get of course if you knew how good anything is for you then you have solved the game but alpha zero at this point the neural network comes in right it this is a neural network it's a black box so it simply asks for each one of these states how valuable do you think that is okay how valuable do you think that is okay and so on so it asks for each state the neural network how valuable that particular node is and then it does the same backwards calculation so we've sort of substituted going to the end of the game by the neural network but it is still more powerful than asking the neural network at the very beginning like we do here okay the the power comes from combining the learning this is this is the learning and the search this here is the search right so this is what alpha zero does and this is what this paper does for imperfect information games so imperfect information games is when you don't know a particular thing about the game at any point so there's hidden information like in poker and the problem is right here if you do the same thing for this game right here and you look from player one's perspective and you say okay uh this game is very deep actually it's just too deep right but let's assume that's too deep for you and you want to replace you want to say okay i'm just going to look ahead d equals one that's all i can afford i go ahead and at the end i'm going to ask my neural network what the value here is and the neural network will tell you accurately that the value at each of these nodes is zero so the average value if you can see right here the average value of each of these nodes is zero uh depending of course on how player two acts but in this case it's zero so as player one this information will not lead you to the correct optimal conclusion the correct optimal conclusion being this point four point four point two okay player one like it's indifferent any strategy could work here right um if there is some regularization it'll probably come to the point the one-third one-third one-third right since all the values are equal um it might conclude it's probably best if i distribute my actions or something so you can see the problem right here and the problem is that this value right here it depends on the strategy of player one okay the and this is something that alpha zero has no concept on for alpha zero the value of a node only ever depends on what comes downstream in imperfect information game the value of a node also depends on what has happened upstream so on the strategy of the upstream events and that is as i said that is that is quite important also for alpha zero once i have evaluated a game tree and determined the value of a node like this i can evaluate the same game tree again and the value is going to be the same but for the same reason because the value depends upstream the value of this node right here depending on upstream if i change my strategy so if here i determine either action one or action two with a certain probability if this search process results in a result that tells me this is how often you should pick action one and that's different from what i searched with right then all of these values down here are going to change and i can basically search again so these are the problems of imperfect information games that we're going to tackle so you see this poker thing is sort of a microcosm and this was already half of the paper if you understood why exactly searching um using kind of a value estimator with this combined with this tree search is a problem in imperfect information games so let's quickly go through the abstract then we're going to have to define a few terms and then we can go into this algorithm the algorithm is called rebel it's a general framework for self-play reinforcement learning and search that provably converges to a nash equilibrium in any two-player zero sum game okay it says that in the simpler setting of perfect information games rebel reduces to an algorithm similar to alpha zero and they say we also show rebel achieves superhuman performance in heads-up no limit texas hold'em poker while using far less domain knowledge than any prior poker ai so last video i've had a comment uh which is correct that is not the best uh holdem ai out there as far as i can tell however it is a very performant one that uses very little domain knowledge of poker so it like alpha zero removed basically all domain knowledge out of the games it played this right here i think the domain knowledge is to the extent of it is given a limited set of bet sizes even though it's kind of no limit told them where you can bet whatever you want um it's given sort of a limited bet limited size of bed sizes like half the pot full pot two times the pot and so on in order to make the actions discrete i think that's just easier for this algorithm but in any case the algorithm is applicable pretty much anywhere where you have a two player zero sum uh imperfect information game or perfect information okay so let's shortly go over a little bit of background so we're going to need some terms right here the first term we're going to need is what's called a world state so a world state is the state of the world i know easy easy but it's quite important that to see that in poker what is the world state so in heads up now let me told them there are your cards you get two your opponent gets two cards right and then there are board cards like at at the end there are five but um maybe there are only three or they're none yet depends on the state of the game so the board cards you know this is maybe an ace a king an eight you know your two hole cards which is maybe an ace and an ace but you don't know your opponent's cards okay we're also going to assume that the actions are always public for the purposes of this video they don't not not necessarily for rebel the algorithm but for us let's just say the actions are all public so the world state is the fixed entire state of the world so the world state would include the your cards the public cards and your opponent's cards so the world state is sort of like a super user can look at all of the cards okay that's the world state no one knows the full world state but it still exists okay what we also need is uh so there's the concept of actions there's an action space uh which in poker is something like you can bet you can raise and so on so these are your classic actions and there is a transition function like in classic reinforcement learning so the transition function depends on the world state and the action and it gives you the next world state and after an action each agent receives a reward that is also a function of the world state and the action okay so important to note that this is the reward you receive but you don't know the you maybe know the function but you don't know the world state right so you can't explicitly sort of predict your reward you can maybe predict the distribution all right the next concepts are the concepts of observation since we are in an imperfect information game an observation and the world state these are not the same thing like in in chess you need to look at the board and that's all there is there's all there is to know so the world state and the observation are the same thing here there is the concept of private and public observations okay so public observation is like is what everyone knows in each step whereas private observations are things that are just revealed to you personally in poker the private observation is simply your two whole cards and the public observation is the middle cards so this is the public observation and this is your private observation so the private observation is different for each player while the public observation is the same i guess you could model the public observation as simply another player that doesn't get any hole cards but you know that that's a question of semantics all right the observations can also include the actions that happen so far it just for completeness if you like you can you can get information about hidden actions and so on there's lots of mathematical freedom here but just the concept is you have private observations to each player individually and then public observations the subscript i here always denotes a individual player while you see there is no such subscript in the public in the public observations all right the next concept is a history and a history is pretty much what you think a history or a trajectory is a finite sequence of legal actions and world states denoted by this so you can see it's simply the history of world states and actions that happened again no one knows the history fully but it's still it is still the case and i know i know you can i don't know quantum mechanics many worlds theorem blah blah blah um we'll just assume that whatever you don't know these these are fixed cards they're actually there they have a value even though no one has looked at them yet so the world state is is defined even if you don't know it so the first real interesting concept here is called an info state okay so the info state is like the world state or like the history but it's conditioned on what an individual player knows okay the info state also called an action observation history for agent i is a sequence of an agent's observations and actions so you can see it's very much like a history except that it doesn't have the world states so usually there will be the world state here you said no there is the observation for player i at each of the time steps okay and these observations they include public and private observations and along with the actions but we'll say the actions are public anyway so an info state is basically the history as it looks to player i okay that's an info state in our original game we said that player two can't distinguish between the three nodes so if you look at the three nodes individually like this node one node two node three these are three different world states with three different histories and to player two they are simply the same info state because all it all player two knows is that player one has taken some action it doesn't know which action so the observation that player two has is exactly the same therefore it can't distinguish so you can see that the info state is sort of the correct abstraction that we're going to look at here inco in you know in turn for if you look for player one it looks different even though for player one it's also three different world states it is also three different info states okay because player one knows which action they have taken so player one can decide which of these three states um player two is in so player one this to player one this corresponds to three different info states okay so the info states is always conditioned on a player and it is the sort of unit that we'll look at here right so the info state um briefly the it includes the observations and actions for a given player and the observations include the private and the public observations the unique info state corresponding to a history for agent i is denoted by this the set of histories that corresponds to some info state is denoted by large h so as we said if you have an info state there are many different histories that could have led to the info state okay so there are many different like there may be for player two it looks like three different histories that could have happened lead to the same info state okay that's but any given history determined fully determines the info state if i tell you what happened you can give me the info state for each player you can say ah uh player one played rocks therefore player two is in that info state and player one is in that info state so that's why there is a unique info state for each history but there is a set of histories for each infostate so the last um last concept from here is a policy a policy is again what you think it is so it is something usually it's something that maps from an observation to an action or from a history to an action or from a world state to an action but here it is a function necessarily that maps from an info state to a probability distribution over action so two things important here the input to the policy is an info state since the players they can't distinguish between the world states as long as they correspond to the same info state therefore their policy necessarily must be taking an info state as an input so player two's policy cannot depend on what player one did because it it can't distinguish it can depend on the strategy of player 1 but not on the concrete action the second thing is that we map to a probability distribution over actions now this is usually the case in in rl if you frame it as a general principle however here it's going to be quite important that this is always a probability distribution very often in these games your strategy is probabilistic so there is no single best move in rock paper scissors but the best thing to do the best strategy is to play each move with a one-third probability or the modified version um at the beginning but it's important to see that a a policy will output a probability distribution and i will also call this the strategy of a player so the strategy is going to be the the policy and i i like to call it a strategy because it's sort of it's a kind of a plan what you would do in each situation and we're going to see that that is going to be a central theme uh lifting in solving these games right here using rebel so policy profile is simply a tuple of policies uh so it's simply the policies of all players that's the policy profile um if you combine the policy profile with some with some info state or some history you can calculate the expected value so the expected value for a given history uh given that the players play policy pro players play pro policy profile pi so this is all players play their strategies in history h and we're going to look at player i and its value so we can calculate the expected value of some policies so i can i can given this function v i can input okay here's what happened and here's how everyone's strategy now tell me in expectation what the first player is going to net from this okay solving the value function is pretty much equivalent to solving the game um so if you if you give me a good value function i can solve the game by simply choosing the next action that gives me the best value function but there's a difficulty we said okay we know pi strategies are public but we don't know what history we're in right so even if you had the perfect value function i don't know what to input so this is going to be a problem all right the last thing is a nash equilibrium you might know this term a nash equilibrium is a policy profile such that no agent can achieve a higher expected value by switching to a different policy our goal here is going to be to find a nash equilibrium strategy for these games and the rebel algorithm is going to provably converge to a nash equilibrium all right so okay there's also the concept of a sub game a sub game is defined by a root history it's simply if you're in a it's simply a game that starts at some intermediate state that's a sub game okay alpha zero for example constructs sub games in fact it constructs these depth limited sub games because you only solve up to a certain depth and at that point you sort of ask your value estimator what the value is this is different in different uh things like you can also do this this kind of monte carlo estimation where you just play one trace to the end and so on but the notion is we iteratively construct these depth-limited sub-games that means we play for a certain depth and then we evaluate at that depth and the question is how are we going to evaluate okay so this is all sort of the build up so we've built up that we can't deal with world states like in classic games we need to deal with info states okay and now with info states we have a problem namely we can't use the alpha zero algorithm again because it will result in the thing on the right okay because if we simply ask our value estimator our value estimator even if it's perfect we it won't lead us to the correct strategy because the the value estimator here is the wrong tool if we don't know all of the information because of this fact that the value of a node doesn't only depend on the downstream actions but also depends on the upstream strategies okay so in an info state we can't distinguish where we are and that means our value estimations are going to be rather uh useless if we just apply this algorithm straight forward so we need a way to transform a game where we don't know everything to a game where we do know everything okay it sounds a bit weird but that's exactly what we're going to do right here so we're going to go from world states to public belief states and um the world states are sort of what we would like to have but don't know the public belief states those are going to be things that everyone knows so if we go from world states to public belief states we're going to be in a situation again where everyone knows everything and therefore it is a perfect information game it's going to be a different game but if we find the solution to this different game we're going to end up with the solution to this to the original game for that they ask you to imagine the following game consider a game in which one of 52 cards is privately dealt to each players okay so you get a card your opponent gets a card one card by the way 52 for those of you maybe in different parts of the world that's the number of cards in a standard card deck for like poker and blackjack and so on um i know different countries have different things like in switzerland you'll very often find 36 cards to a deck but just that's why because 52 appears like a bit of a weird number in any case so on each turn a player chooses between three actions fold call or race so these are the the sort of standard poker actions you can either throw away your card if you don't like it you can uh match the bed of your opponent or you can put in some money or some more money yourself and at the end i'm going to guess yeah here eventually the game ends and players receive a reward so let's say whoever has the higher card wins the all the money in the middle now consider a modification of this game in which the players cannot see their private cards okay instead their cards are seen by a referee on the player's turn they announce the probability they would take each action with with each possible private card the referee then samples an action and the players on the player's behalf from the announced probability distribution for the player's true private card this is this is weird so usually you'd look at your card like i have an ace okay and then you come up with a with a sort of strategy you come up with a policy you're going to say i'm going to raise with probability ace is pretty good so i'm going to raise with probability 0.7 i'm going to call with the probability of 0.2 and i'm going to fold with a probability of 0.1 so this here this would be a an appropriate policy let's say for getting an ace at the beginning right maybe this goes back and forth a bit and you might change because you might change your belief you don't know what your opponent has now the game changes namely the game is going to be your opponent gets a card and you get a card and you don't get to look at even your card so now you don't know your opponent's card and you don't know your card but what you can do is you can announce to the referee you can say okay referee i am going to do this if i have an ace i'm going to raise with 0.7 call with 0.2 and fold with 0.1 if i have a king i'm going to okay i need a bit more space if i have a king i'm going to raise with 0.6 i'm going to call with point three and i'm going to fold with point one and so on until if i have a two i'm going to raise with probability zero i'm going to call with probability point one i'm going to fold almost all of it okay so you get to announce your entire strategy to the referee the referee who is a super user or i don't know god so or i don't know choose your favorite deity um seize everything seize all the cards right the referee will input will take this entire table that you give it as input it will go look at your card it will see ah it's a king or it's an ace and it will then choose the appropriate sub table here for you and then it will sample an action from that so instead of you looking and uh just producing this table you produce all the tables for all the things that you could have and then the referee does the same thing for you okay and so does your opponent and you simply do this so now you see it's a bit of a different game the the namely the actions are different so the action is no longer that you produce or sort of policy is no longer you simply look at what you have and you determine the probabilities now the policy is you spout out this table for all the things you could have and in each case for all the things you could do the important thing is so they say okay at when the game starts each player's belief distribution about their private card is uniform random and also about the opponent's private card right however after each action by the referee players can update their belief distribution about which card they are holding the obey's rule likewise players can update their belief distribution about the opponent's private card through the same operation so it's important to note that this already happened before so even if in the original game you would update your belief about the opponent's private card according to bayes rule or whatever you rule you want you will simply try to infer what they have um now the difference is you also have to infer what you have uh depending on what actions the referee does so you sort of treat yourself like an like a player like a a different player like an opponent player that you don't know the private cards of thus the probability that each player is holding each private card is common knowledge among all players at all times in this game so that makes it such that you you don't know your opponent's card you don't know your card you have to use sort of the same algorithm to determine what everyone has so that means that all the knowledge is shared like no one knows the true private cards but everyone knows the same things okay so so if no one knows then everyone knows the same it's sort of it's a bit like uh it's a bit like probability socialism no one has anything everyone's equal sorry that's a that was a slight right there uh so the important thing they say the critical insight is that these two games are strategically identical okay that's the and that's very surprising but if you think a bit about it it it becomes clear that your strategy up here is the same as down here you simply don't fully announce it every time explicitly but we we said anyway that uh policies are public therefore this game here is equivalent to this game these are the same games but the latter contains no private information and is instead a continuous state and action space perfect information game okay while players do not announce their action probabilities for each possible card in the first game we assume that all players policies are common knowledge and therefore the probability that a player would choose each action for each possible card is indeed known by all players okay so um and and this you can even lift the restriction uh that you know or don't know the opponent's strategy so you don't actually need to know it but we'll simply assume that everyone knows everyone's strategy they just don't know their um their private cards so the this is a new game that we've constructed where uh it's it's a bit different right there are different states and different actions so the states that we deal with in this game let's quickly analyze this so what's so we have state and action in the in game one the state is an info state so this is an info state and the action is going to be a probability distribution over actions so p of each of the actions okay in this game down here we have different states and different actions now the states we're going to get to in a minute but what's the action the action is to send a table of all these probability distributions in each case like in case i have this in case i have this in case i have this so that's going to be the action the action is going to be to send this entire table to the referee okay now what are the states this is this next section we refer to the first game as the discrete representation that's the the top game and the second game as the belief representation an example above a history in the belief representation which we refer to as a public belief state is described by a sequence of public observations and 104 probabilities the probability that each player holds each of the 52 possible private cards okay the so this is going to be the state it's going to be called a public belief state and it's described by the sequence of public observations and 104 probabilities so the probabilities that probability that you have an ace you have a king you have a queen and so on like the distribution over your cards and this distribution of your opponent's cards so it's simply the info it's like an info state of someone that just observes the game that is going to be the public um belief state okay likewise an action is described by 156 probabilities one per discrete action per private card in general terms a pbs is described by a joint probability distribution over the agent's possible info states you see it's a it's a distribution over info states so this state is a distribution for each info state or they also call this a public belief state so now we've gone from a game that is imperfect information to a game that is perfect information okay this is this is this has unknowns like many like oh this is different for each player but here all the information is known and these two games are equivalent it's just that you can see already the problem like the the states are way bigger because it's a distribution over each state that could be and the actions are also way bigger namely it's an one policy for each state that you could be in so these are massive amounts but in theory that makes no difference right so they say since any imperfect information game can be viewed as a perfect information game consisting of public belief representations or public belief states in theory we could approximate a solution of any two-player zero-sum imperfect information game by running a perfect information rl plus search algorithm on a discretization of the belief representation okay so nothing stops you from simply taking this and running alpha zero on this new thing on this new thing with the states being public belief states and the actions being descending around of these giant tables and you might have to discretize it as it says but that's feasible um so you can think of constructing this game tree but each node here is going to be a public belief state okay instead of a world state like in alpha zero or like an info state like we started these imperfect information games with and then you can construct your tree down here and then you know but this is infeasible because these public belief states are just too large and the actions are also too large there are so many actions uh these are super high dimensional so this is not feasible um and we're going to so they have to find a way to do this thing but um to to sort of do it in the domain of the original game and that's the i feel that's the entire trick of this rebel paper is to take the this idea let's do this search over the public belief states but somehow this this thing down here um because what we need is we need the values of these right if we figure out the value of this public belief state and the value of this one right this is beta1 this is of beta2 then we would know which action to take and then action is this huge thing but if we knew the values of these we would know which action to take however this is not feasible so we need to find a way to figure out these values using the original formulation of the game and that's what they do in the exact next section right here so they go on saying however as shown in the example above believe representation can be very high dimensional so conducting search is as is done in perfect information games would be intractable they say fortunately in two-player zero-sum games these high-dimensional belief representations are convex optimization problems rebel leverages this fact via conducting search via an iterative gradient ascent-like algorithm so i don't know what this sentence means that the belief representations are convex optimization problems maybe this is misformulated or i'm just not understanding it uh well enough in general this section here is a bit of a mystery to me um but i can sort of tell you what what i understand of it okay so they say rebels search algorithm operates on super gradients of the pbs value function at the leaf nodes rather than on pbs values directly this is the first indication we don't want to work so we want to construct this search tree and at the leaf nodes we need value functions right like in alpha zero now since we operate on public belief states we would need value functions of public belief states however rebel finds a way to not do that specifically the search algorithms require the values of info states for pbs's okay so they find a way to connect the values of info states to the values of public belief states and just as a reminder an info state is a state that as it looks to one player that could have many different histories a public belief state has all the info states that could lead to the public observation so all the info states that you could be in right with all their histories here basically a distribution over all these info states that entire thing is one public belief state now they are going to say we can determine the value of a public belief state so the value of this is going to be equal to and we can somehow approximate this with the values of these thing here we somehow don't need the value of the entire public leave state we connect this to the values of the individual info states and that's i mean that's done fairly easily because you simply sum over so um you can say the value of a given info state conditioned that you're in public belief state beta is simply going to be kind of the expectation over all the histories that could lead to this info state multiplied by the value of each history like you can have the value of a history given some policy and therefore you can approximate the value at a given info state and this theorem one here is where they connect the value of a public belief state to the value of an info state so they say for any public belief state for the beliefs of player 1 and player 2 infostates respectively and any policy pi star that is a nash equilibrium of the sub game rooted at beta so now we root sub games at public belief states this thing holds right here so as you can see this connects the value of the public belief states this is what we sort of need um in order for the search algorithm to work it connects it to the value of an info of info states and info states are way lower dimensional than public belief states so it connects it connects the value of this right here to the value of let's say this okay this this might be an info state here s and the value it connects the value of the global public belief set to the value of this particular info state and it does so via this term right here so this term right here this is just the unit vector in the direction of that particular info state and this here is a super gradient of an extension of the value function to unnormalize belief distributions as i understand it this g is the gradient with respect to probably beta1 if we care about s1 to v1 of beta something like this um as i said this is where i don't 100 uh see through it but what i understand is that this connects the value of the public belief state this thing to the value of the individual info states that are part of this public belief state so we don't need a value function for public belief states we can simply get away with learning a value function for the individual info states and that's what they do so the only the learned part here in this algorithm this is the first time we see like a neural network since rebel's search algorithm uses infostate values rather than learn a pbs value function rebel instead learns an info state value function so we're going to input a public belief state yes and we're going to get a value for each info state we're going to get a value here so we'll simply learn a value function as sort of a vector output you could also input the public belief state and the info state and get out a single number i guess that would turn out to be the same thing okay so the infostate value function directly approximates for each info state the average of the sampled values produced by rebel at beta so we're going to learn this in a sort of bootstrap fashion like like alpha zero does it a bit like temporal difference learning so what we're going to do in this algorithm is we're going to start out then we're going to construct this sort of this subtree and we're going to do this in the discrete representation of the game now that's the genius of the rebel algorithm we're going to sort of evaluate these things in the discrete representation in the info state representation and then we're going to be able to use what we find right here in order to determine the value of the next actions to take as far as i can tell okay so that there is only one thing left to do right we need to know how does how does this step here work so we we said we want to do this research over the public belief states but we can't it's too cumbersome um therefore we can now we can evaluate values of a public belief state but we still need to do to determine the policies and that's where the self-play reinforcement learning comes in so bear with me for one second this is going to kind of snap together all that we've looked at so far in this section we describe rebel and prove that it approximates a nash equilibrium at the start of the game a depth limited sub game rooted at the initial public belief state is generated this sub game is solved by running t iterations of an iterative equilibrium finding algorithm in the discrete representation of the game but using the learned value network to approximate leaf values on every iteration okay so it might seem a bit a bit complicated but what we're going to do is we're going to here is what i think happens and this is a bit unclear to me we're going to take a any public beliefs that we find ourselves in they call they tell the the beginning of the game but any any public belief state okay so the public belief state is maybe here and it contains many different info states now what i think happens here is that they may be sampling one of the info states i don't know or they may input the public belief states at the beginning this is unclear to me but then they're going to solve the game in the discrete representation so they're going to use a classic solver to solve the game up to a limited depth okay so this limited depth is going to be sort of these steps in into the future this is going to be in the classic representation so classic states and classic actions now the solver that they use for this is counterfactual regret minimization this is a solver that works with infostates okay so you can actually use cfr to solve poker however you can't solve all of poker because the game is too big right so but you can solve a sub game provided that you have good value estimates here at the end so that since they use cfr that leads me to believe they don't use the entire public belief state as an input to cfr but they either maybe sample an info state or they actually sample one particular history that happened um that is unclear to me however what they do is they they do this they solve the sub game using cfr and then out of that they get a strategy okay so here you ask your solver what should i do given you know given my estimates of the values right here and the cfr will say i know what you should do here is a strategy here is a policy that you should do now if this were alpha zero if this were fully observable then you would be done right you you'd say okay i'm done cool um that's what i'm going to do however what we saw above is that um your values right here your values down here they are dependent on what comes before you specifically they are dependent on this strategy okay now cfr needs some sort of an initial strategy okay and it outputs a best strategy for the given values but now that you have another strategy these values here they are no longer valid and you computed the strategy with the values so what you're going to do is you're going to plug in you're going to use this thing to compute new values okay more values you're going to construct another c or the same sub game with new values and then use cfr again to solve that and that will give you the next policy for these values but then the values change again and so on now this is going to converge eventually but you're going to have to run a couple of iterations of this for this to converge in fact i believe it's the the running average or the average that's going to converge um but you're going to solve a number of these sub games okay until you reach the actual best strategy and you're going to do that down the game tree so from this thing you're going to construct sub game you're going to construct one two three updating the values solving it and then once you have it you sample some state in between from that you're going to solve this sub game again one time two time three time and so on until convergence and so on so this multiple solving of the same sub game that's what we have to do so it is the price we have to pay for solving the game in the discrete representation because we can't solve it in the belief representation because it's too big there we would only have to solve it once but here we have to solve it multiple times so this is the entire algorithm right here you can see while the while we're not in a terminal state we're going to construct a sub-game and initialize some some policy and then for each step we're going to do first um sorry we also set the leaf values so this setting of leaf values that's simply um forwarding like if i know the policy i can go set the leaf values using my neural network right my neural network can tell me what the value at each of the leaf nodes are that's what we train it for so in the set leaf values there is a neural network you see this by the fact that there are parameters right here and then we're going to do repeatedly the following two things update policy so this here is where we use the solver cfr so we determine the best policy given the current value estimations and then we're going to set new values given the policy so see cfr it will take in the last policy and it will output the next policy and set leaf values will in will take in these parameters which meaning this here that's going to be some kind of mlp or neural network and we're going to do this then we're going to loop back again and do the same thing solve the game set new values solve the game set new values solve the game set new values okay eventually by aggregating all of this information we are going to be able to compute the expected value and that's going to be the value of the public belief state altogether and as we said if we know the value we can sort of take the best action in fact here i believe that the policy that comes out this average policy is the nash equilibrium and we can simply sample an action from that all right that's what they describe here they use we describe rebel assuming the counterfactual regret minimization decomposition cfrd algorithm is used okay this is a depth limited version of cf of cfr that's an entire research direction by itself right here counterfactual regret minimization is simply used as sort of the inner solver kind of a helper function to call and that thing by itself is an entire entire algorithm it's like a very complicated algorithm okay on each iteration cfrd determines a policy profile in the sub game next the value of every discrete representation leaf node is set to this and this is this is the neural network right so we're going to use the neural network to set the leaf node values of the discrete representation okay um this means that the value of a leaf node during search is conditional on the policy thus the leaf node value change every iteration given pi and the leaf node values each info state has a well-defined values this vector of values is stored and next cfrd chooses a new policy profile in the process repeats for t iterations all right that's the rebel algorithm and they also describe how they actually sample data for learning with the exploration and they also show that running algorithm one with t iterations of cfr in each subgame will produce a value approximator that has an error of at most this for any pbs that could be encountered during play okay so they're going to say that the value approximator given that it is sort of idealized will actually converge to a good value approximator if you uh sample it depending on how many iterations of cfr you do but you can see that the more iterations you do the better of an approximation you get and if you have a good value estimator as we already said you uh basically have solved the game um the last thing is that they determine now what do we do at test time you might not have thought of this this this was this seems sort of obvious if you know alpha zero but they determine that at inference time you can simply run this same algorithm except you you don't want to produce training data from it and you don't want to learn anything you simply want to run this algorithm too if you run that algorithm at test time that will actually give you a nash equilibrium so that's theorem three right here if algorithm run one runs a test time with no of policy exploration value network with error at most this and this and was trained as described in theorem two with t iterations of that then the algorithm plays a this kind of approximation nash equilibrium where c1 and c2 are game specific constants okay so you can see right here that the nash equilibrium is going to be perfect depending on how many iterations you do and depending on i believe how accurate your neural network is yes your value network error okay if you make that smaller your nash equilibrium is going to be better pretty pretty cool so that was the algorithm they do a bunch of experiments where they say what kind of network uh they if they use the value net or not if they use self play or not and uh they can also introduce a policy net uh i believe for initializing or or searching more effectively they compare against previous things like deepstack libratus and so on they do beat top humans as you can see poker has been for a long time kind of an not so solved game by machine learning but this area has been over for a while right now and they do release the code of um i believe of the liar's dice so they have the code released for rebel and the implementation for liars dice but not for uh poker because that's what they discuss in the broader impact statement so let's quickly look at broader impact then we're done so i just to say i love this broader impact statement it is um it describes like it praises the paper so it's kind of more advertisement for the paper uh it it does almost like no harm to the paper itself to its reputation it is actually accurate so this broader impact statement actually makes tangible predictions and it doesn't go beyond the or it mostly doesn't go beyond the tangible things you can say about this algorithm and it actually has as a conclusion and action that they take so and further it is nothing like what the original specification of broader impact statement says and that makes me happy so good job on this one we believe rebel is a major step towards general agreement finding algorithm yada yada so they say if this is um this is good because many things are sorry sort of these kind of uh games if you can extend it to multi-agent and so on so this is the technology good section but then the bad section is interesting the most immediate risk posed by this work is its potential for cheating in recreational games such as poker while a algorithm already exists they say why why they are better why this particular algorithm could be used for cheating where the others can't be done so easily by the way this algorithm um by nature of performing these searches over and over again it needs a lot of compute like it needs a lot of compute the learning isn't the problem the problem is performing these searches over and over and over again um yeah so it's not super easy to replicate like don't don't try this at home however if they were to release the pre-trained network that would make it easy and they also say if they release the code that would maybe make it easier to cheat if you can simply run maybe you know you don't have the hardware but given massive poker winnings who knows retraining the algorithms to account for arbitrary cheek styles this is a requires more computation as feasible in real time that's about the other algorithms however rebel can compute a policy for arbitrary stack size and arbitrary bet sizes in seconds so that's at inference time partly for this reason we have decided to not to release the code for poker we instead open source our implementation for liars dice a recreational game that is not played competitively by humans okay so it's a concrete prediction of the impact of the of this work it has a concrete action to kind of its conclusion and it doesn't dabble in who know if uh if we now solve these two player imperfect information games then surely in the future bombs will fly in stuff like this um yeah good job on this again all right so this was the overview of the paper we started with the notion of info states and info states are kind of like states in classic reinforcement learning and we determined that we can't really use the sort of alpha zero a way of doing things because the value of info states not only depends on downstream things but also on upstream things and the values here yeah that makes the values at the end of the tree not constant and that means uh we can't really use that as we saw in this poker thing then we converted the game from an infostate representation to a public belief state representation where now it's sort of it's again a everyone knows everything game therefore we could use the alpha zero way of doing things however since the states and the actions are so large because it consists of these giant tables of numbers we can't use the alpha zero for computational reasons luckily they find a way to connect the value function of public belief states to the value functions of info states and therefore we can use a solver in the classic in the discrete representation to approximate or to to um to use in this search procedure as long as we run it multiple times and sort of keep updating its values by doing that um we can use this in this self play simply iteratively doing this in each step and we can use bootstrapping uh and play as we said self play between two agents and that will provably converge to a good value function and to a nash equilibrium all right that was the paper thanks for listening i'll see you next time bye bye
Info
Channel: Yannic Kilcher
Views: 24,834
Rating: undefined out of 5
Keywords: deep learning, machine learning, arxiv, explained, neural networks, ai, artificial intelligence, paper, poker, deep neural networks, facebook, facebook ai, rebel, holdem, texas holdem, rock paper scissors, liars dice, liar dice, self play, nash equilibrium, alpha go, alphazero, zero sum, policy, cfr, counterfactual regret minimization, tree search, monte carlo tree search, mcts, public belief state, infostate, value function, supergradient, strategy, actor critic, imperfect information
Id: BhUWvQmLzSk
Channel Id: undefined
Length: 72min 21sec (4341 seconds)
Published: Wed Dec 16 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.