Decision Transformer: Reinforcement Learning via Sequence Modeling (Research Paper Explained)

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

hello there today we're going to look at decision transformer reinforcement learning via sequence modeling by lily chen kevin liu and others of uc berkeley facebook ai research and google brain on high level this paper ditches pretty much anything and everything of reinforcement learning in an offline rl setting and substitutes it for simple sequence modeling using transformers of course and through that they're able to achieve some pretty compelling results in the things they test at least they're able to keep up and be on par with the current best frameworks for doing offline reinforcement learning so we're going to look at this paper and what it what it does in in terms of sequence modeling and how this looks the key ingredient here besides the transformer is going to be the fact that we are instead of maximizing the reward we're going to condition on the desired reward and uh through that we we can sort of influence what the model is going to do in the future this allows more effective offline reinforcement learning and makes the offline rl problem pretty straightforward into a sequence modeling problem i do have a little bit of troubles with the paper in various aspects but i'm sure we'll come to that but i'm just warning you this might be a bit of a rant mixed with explaining the paper though the paper is is pretty cool so don't get me wrong on that that being said there is concurrent work also out of berkeley as i understand it um where it's this is called the third trajectory transformer reinforcement learning is one big sequence modeling problem that uses the sequence modeling in a bit of a different way so what they do is they use it as sort of a world model and then they use beam search in order to in in order to find good trajectories in that so it's a little bit of a different approach and i just from skimming this paper right here i think i this one might be a bit more of a of an approach that i would subscribe to um but i guess we'll see what happens going forward and oh wait why did this show up reinforcement learning upside down by schmidt uber ah this must just have gotten in here by accident sorry um let's go back to this paper they say we introduce a framework that abstracts reinforcement learning as a sequence modeling problem this allows us to draw upon the simplicity and scalability of the transformer architecture and associated advances in language modeling such as the gpt line and bird in particular we present the decision transformer an architecture that casts the problem of rl as conditional sequence modeling unlike prior approaches that fit fit value functions or compute policy gradients decision transformers simply outputs the optimal actions by leveraging a causally masked transformer okay so as i said they ditch things like uh policy gradients or value functions none of that we're simply going to do sequence modeling right here by conditioning on an autoregressive model on the desired return past states and actions our decision transformer model can get can generate future action that achieve the desired return so a key concept here is going to be this desired return thing and here as well so there are multiple ingredients to this paper there's a lot to to unpack right here and lastly they say it achieves it matches or exceeds the performance of state-of-the-art model-free offline rl baselines again this is sort of zooming down into a problem so we are in the world of model-free and offline reinforcement learning algorithms there is as i said there's a lot to unpack here so first of all what is offline reinforcement learning this is contrasted to online reinforcement learning online reinforcement learning is where you have an agent and an environment and the agent sort of gets to perform actions in the environment and the environment responds with a reward and a state or the not really a state but an observation but we sometimes it is the state uh if it's not a partially observable environment so the agent actively gets to interact with the environment to try out things and its goal is going to be to maximize that reward in offline reinforcement learning it's a different situation so in offline reinforcement learning you are your agent is here and what you get is not an environment but what you get is a data set and this data set will contain it will contain lots of experience from other agents so you would simply get to observe what a different agent has done and so there's going to be a lot of like episodes in here so what happened in the past to this other agent and purely by observing that other agent you somehow have to learn a good policy to achieve a good reward this is different because you cannot go out and sort of test your hypotheses in this world you cannot have a good idea and say well i'm gonna try that you can't do sort of targeted exploration and so on you simply get to look at a bunch of trajectories and then uh decide what you want to do so we need a bunch of different approaches here and and uh one that they compare to is so there are two that mainly that they compare to one is called they call it bc which is behavior cloning where what you're trying to do is you simply try to mimic the agent uh that you observe in the events where it has led to two good rewards right so that's how you maximize the reward you simply say well that agent there it got a good reward uh so i'm just gonna try to sort of clone that behaviors as behavior cloning from the name i'm i'm butchering the explanation but roughly that's what it's supposed to do the other approach is you view this as a let's say more a traditional reinforcement learning problem where you do q learning so in queue learning um what you do is you are in a state and you have maybe like three actions at your disposal and every time you again have three actions at your disposal so you get this sort of tree that you could do so you're in the first state and what you want is you want to ask your q function uh how much how much is how much is this worth maybe the q function says five how much is this worth six and how much is this worth for so the q function is supposed to tell you if you um take this action and after that action you follow the the policy like after that action you again do ask the q function for the q value which what's the total reward you're going to get q learning is very very classic reinforcement learning algorithm and you can actually do q learning from a data set like this it doesn't need to be you yourself that makes the experience that's the thing about q learning is that it can be done from offline data other than policy gradients uh you need sort of a you need a correction um if you do policy gradients and it usually doesn't work if it's complete offline i it might work i'm not super informed like this but q learning is possible from offline data and apparently the current currently good baseline is conservative q learning um which you're going to see in this paper which uh fixes the the um the bug let's say the tendency for for these q functions in the offline setting to overestimate the q value so apparently uh they they tend to overestimate the value that you get from certain actions conservative q learning uh is a more like a pessimistic approach so these are the two baselines that we're going to compare to you'll notice behavior cloning some kind of relation to inverse reinforcement learning not really or um yeah so so that's one approach q learning uh is also an approach here we're just going to do sequence modeling so what does this mean and the key concept as i said is going to be the condition on that reward uh sorry so so this was offline rl now there are people have pointed out problems with the approach here uh which some of those problems are simply problems of offline reinforcement learning so for example which data set do you use right here turns out in their experiments they use a benchmark data set which is the the data set where this agent right here is a dqn learner so an active reinforcement learner so naturally you're going to get out like some some good episodes out of that so it's more like learning from expert demonstration rather than from random random demonstrations okay so it's crucially important which data set you use but that's that's a fault of offline rl of the setting itself rather than of this particular algorithm so i just want to want to point that out but keep in mind the data set they're using for their main experiments is one of let's say a rather high performing agent in this world okay so so that's that so the second thing right here is their their use of the of a transformer now is the use of a transformer crucial to this algorithm and the answer is is no so whenever the transformer comes to mind this can be any sequence modeling algorithm right here transformers are trendy okay but this can be an lstm that does autoregressive sequence modeling anything that does sort of autoregressive sequence modeling is going to be good for this task right here the the core here is going to be this is a sequence model it's not an rl model in fact transformers for rl have been a thing you know usually what people do is they use lstms as a backbone for reinforcement learning algorithms using transformers has several advantages in offline and or online reinforcement learning algorithms so usually you have some sort of a state right here so you have your history with states and actions and rewards and so on and an lstm will take in that state and and action well let's just let's do it something like this so you have state action reward state action reward state action reward whatever you did in the past right so an lstm will take that in and it will propagate its hidden state through times i realize some of you youngsters might not actually know what an lstm is this is a recurrent neural network that processes one time step at a time and then here at the end you're supposed to output whatever the next action is going to be right you have your history of actions you're supposed to output whatever the next action is going to be and you're going to get back a state and a reward along with it and then you incorporate that right here into the next action so if you train this thing in any way let's say q learning policy gradient what not if it's a q learning you're not going to output an action directly you're going to output q values that's a minor modification to the a what you have to do is you have to and that's the difficulty in reinforcement learning in general you have to somehow make a connection between the rewards you get from this let's say this action gets you a reward the reward you get from the action to some something that you you predicted so you predicted several you predicted an action here and an action here right these are these actions now just because you got a reward from this action it doesn't actually mean that this action was the smart action or the good action right if you are in a chess game it's not the actual last move that is the good move even though that move gets you the all the reward the cur the crucial move might have happened 20 moves before so the the underlying reinforcement learning problem is to assign that reward to which action was actually the smart action such that in the future you can take that more so maybe this action right here was the smart action so you need a way to figure out that that was the smart action and you know back propagation over time will do this but in an lstm you can see right here you need to back propagate you know through one two maybe three different uh computation steps in order to reach there and now this is three steps but think if the good action was 50 steps ago or 500 steps ago this quickly gets um gets tricky normally we can unroll lstms like this for maybe i don't even know like like not more than a couple of dozen steps right so it gets tricky so what people do is they use what's called dynamic programming and that is a thing that here with the sequence modeling approach we're going to ditch and this this is uh one of the fundamental things so instead of having to just learn from the reward and assign it to an action what you're going to do is you're also going to along with the actions right here you're going to output a value and the value tells you sort of how good you are doing now q function in a way is already a value so if you're doing q learning you're doing this automatically and then the way you learn this is called temporal difference learning so you know let's say this is the this here is the final stage of the game okay so you always get a reward here it's maybe plus one here it's minus five and so on okay now instead of back propagating only that reward back what you're going to do is at every step you want to predict a value obviously the last value is going to be equal to the reward itself but here your value is sort of your expected reward in the future if you take you know the good actions that you're going to take so here your value might be maybe negative 4.5 because you know you're actually though you're probably going to take the action that gives you a good reward right so it's maybe like plus 0.9 because you're fairly sure you're going to take that good action and then down here it's maybe so you get five reward from going there um no wait that's the q value i said that's the q value so here your value is going to be something like plus point seven um so it doesn't really matter what the numbers are what matters is that now you're not you're learning signal doesn't just come from the uh from the reward itself your learning signal is you're from here you're trying to predict the reward but you're also trying to predict the output of your own function like one or two or three steps into the future so if you've done an episode and at the end you got a reward right here you could your value function right here could try to just output that reward but that's really noisy so what you're doing is you're saying well you know i have predicted a value here and here and here and here and here so why aren't i training my value function to also predict these things and by predict i basically mean so if if i was in this value and this transition got me like a reward of something then this value here should equal to this minus this reward because you know like that's that's how the value is supposed to function so you're trying to predict the output of your own value function this also works with the q function this is the famous bellman recurrence relation where the q function of a state is equal to the reward you get from performing an action according to the policy in that state plus the q function at the state that you're reaching so again with the same policy and the the r here is drawn from the action that the policy gives you something like this so the r is the result of performing the action okay so this f this fundamental relation is the basis of q learning and you can do as i said right here this is called temporal difference learning so what they call td all of this is based on concepts of dynamic programming we all ditch this here and so it is important to go through so that you understand what we're not doing okay why do we need all of this why do we need the q functions and the temporal difference learning and so on well because it's really hard to do that credit assignment over long stretches of time now in we can see that this is the case with an lstm right especially if we can't back propagate all the way through the lstm in a transformer what does a transformer do you have a sequence what does a transformer do it uses attention in order to look at a sequence at a hole right it through the attention mechanism it can route information from any sequence element to any other sequence element in a single step so essentially it technically could do this credit assignment right here in a single step if and that's a big if if anything fits into its context okay and that's i think one of the crucial criticisms of this paper right here in that as as as far as no i don't think all it fits in all into the context but you can see that there's a trade-off right you're able to do the assignment in one step okay but as soon as you would like to predict correlations and do credit assignment across longer spans than the context you need to resort back to something like the dynamic programming approaches right here which they say they can ditch now they don't only say that because their context is long but that is when they say how the transformer benefits this instead of like an lstm or something like this this is the reason that you can do this credit assignment in one step across the context however always think that statement has an if if the credit assignment needs to happen longer than one context like if the relevant action for the reward is more away the transformers out of luck because it doesn't fit into the context and we would need to go back to something like this but there is a second reason of course and that is the sequence modeling approach and that is something i i i see at the core of this a little bit so the the causal transformer you know cool it's a transformer okay we could use any other sequence modeling approach now viewing rl as a sequence modeling problem is a different thing so what does this thing do so instead of having a neural network that you know here is here's the history okay this is the history this is the rewards you got in the past and disregard the little hat on the r it's the states of the past it's the actions of the past actually extends into the past okay so this is the input you get and you would get that in any other reinforcement learning algorithm what you would get to is this thing right here the current state right and this goes through a little encoder they use the dqn encoder so this is a little convolutional neural network right that encodes the state so it's technically able to handle uh very complex states and so on by simply encoding them into a latent space um so there's no attention on the like on in the state space right here the attention really happens over the over the sequence now from this right the classic rl algorithms they wouldn't have this from this they would try to predict an action that maximizes the future reward what this does differently is they say well instead of giving me an action that maximizes the future reward i want to i want to tell the system what reward i would like and then it's not giving me an action to maximize the reward it is actually supposed to give me an action that achieves exactly the reward that i have uh presented okay so i ask it for a reward and it gives me the action that corresponds to achieving that reward in the future this is is different right and i can still do uh reward maximization by simply putting a high number there right i want to get a lot of reward and like 21 is the maximum in pong which this game is right here so you can say i want to achieve 21 reward please give me an action that achieves 21 reward and that will be corresponding to getting as much reward as possible notice that you do need to know the maximum reward um it doesn't actually work if you just would put 1 billion billion billion as we will like as the their experiments kind of indicate so that's a drawback of this now uh just when i go back to this paper that slipped in just by accident i have this open right here by schmidt uber don't predict rewards it says just map them to actions so they say we transform reinforcement learning into a form of supervised learning okay which sounds like you know offline rl by turning rl on its head and did uh you look at this the memes are strong in this one okay upside down rl i've actually made a video on upside down rl they say or standard rl predicts rewards while whatever this is instead uses rewards as task defining inputs together with representations of time horizon and other computable functions of historic and desired future data uh r l later learns to interpret these input observations as commands mapping them to actions um through supervised learning on past possibly accidental experience okay so uh this it is actually i of course this isn't by accident so i knew this paper right here and when i read this paper it immediately uh sprung into my mind and schmidt uber also i as i see it wasn't the entirely first who did anything like this like we've known about goal conditioned reinforcement learning for a while and so on so this is not necessarily a new idea they do reference schmidt huber's paper very briefly in in this paper uh staying stating that it's kind of a markovian approach and and so on even though here you have markovian interfaces and here you have non-markovian partially observable interfaces and the advantages that schmidt hoover names right here are very much the same for example they continuously say they don't need discount factors and here also you have no problems with discount factors and so on so i i wanted to point this out and i wanted to point out that the paper is referenced in this paper but essentially here you have the three components the component is offline rl plus a transformer plus viewing the problem as a sequence modeling problem by conditioning on the reward so why does this make sense to condition under on the future desired reward well it makes sense first of all because in classic reinforcement learning why don't we do that why don't we we say i want to get this reward please give me the action to it because it's a lot more work right if i just want to maximize my reward i need a function right i need a neural network here is my state here is my neural network maybe it's a policy gradient method give me an action and that action is supposed to maximize the reward so now i need an additional input the desired reward and also give me an action now the network doesn't only need to remember what do i need to do to perform well it needs to be able to distinguish what do i need to do to perform well what do i need to do to perform a little bit worse what do i need to do to perform terribly it's a lot more stuff to remember for the network the hope of course is that with all the the advances we've seen in sequence modeling um that essentially these transformers are capable of of memorizing or learning all of those different things we we know that transformers are almost unlimited in their capacity to absorb data and learn stuff so the hope is that uh these models will be capable of learning that thing the neck at doing this though is this is a technique that naturally maps to offline reinforcement learning so offline reinforcement learning in general is a harder task than online reinforcement learning right for the reasons i outlined however this particular thing lends itself extremely well to the task of offline reinforcement learning so what do i mean if you have a history you take one history from here and it says well i was in this state i performed this action i got this reward i was in this state and then i came to this state i performed this action i got this reward and so on okay what you can try to do and what q learning tries to do is it tries to somehow learn the the q function that takes state and action condition on the history and sort of uh predict the future rewards and so on so it tries to figure out what it needed to do instead of doing what this agent did in order to achieve higher rewards so it is sort of trying to look at the agent that it sees critically and be like you probably didn't do something well there but it has no way to act in the world it has no way to to go out and try it itself instead this thing it simply accepts it's like it accepts the history it simply says oh well you did these things and you got this reward okay cool um and if you know anything about these sequence models and transformers that they can memorize stuff quite well so going forward maybe think of these what these transformers do as simply memorizing the the training data set okay i know it's not the case but you memorize the training data set well now if you memorize the training data set and you're in this situation right here you see a history you see a state and the sort of the human tells you i would like to get 21 reward what the transformer can do is it can simply say okay let me go into my training data set let me find some let me find some uh sequence where the agent was in the same kind of history also was in this state and also ended up getting about 21 reward out of the future actions now what did that agent do well it did this action okay it and it's reasonable to assume that you know if you're in the same kind of history and uh if you want the same reward as that agent got you should probably act the same as that agent did okay it is a lot like behavior cloning though behavior cloning still focuses on sort of getting high reward as i under understand it um so it it simply takes what comes in as expert demonstrations whereas here you just you accept the history as it is and if you're in a new situation you the question to the sequence model is essentially how would a sequence that evolves like this okay that evolves like this how would it continue in the training data set and what it will give you it will give you the action that agents who were in a similar situation and ended up getting that similar reward that you want to get those what did those agents do just do the same thing and you're probably going to end up in the same place as they did okay that's that's the approach right here um you can see how this is is useful right though again it it only given that we ditch all of the rl um given that we ditch all of the url mechanics right here which they claim as a positive and certainly it is a positive you don't need to parse out what you needed to do and so on you simply accept history and say okay i'm going to do the same kind of things instead of that if so i just said i'm going to look at agents that had the same kind of history and we're in the same kind of situation now if you think about back about this problem right here of the context length what if the future reward right here is crucially dependent on an action you did back here right you could have two agents that have the exact same history as far as the context reaches back but done a different action back here and the sequence model would have no trouble uh sorry would have like no chance of differentiating between the two it they look the same okay one agent ended up with a really nice reward the other agent ended up with a really bad reward even worse the data set couldn't contain an agent that ended up in the bad reward but had you done q learning you could maybe figure it out from other trajectories so as much as they i feel as much as they tout the ability to ditch uh the whole mecha like the whole machinery of reinforcement learning right here you run into the same problem like even with this like all of this it does not alleviate the problem if you want to go beyond how far you can backprop uh you need to you need to use the dynamic programming approaches okay like i don't see a way around it maybe i'm terribly wrong but yeah so the transformers are good for doing the credit assignment over the longer distances than the lstms um yes certainly but that's valid for online offline rl and so on whether you do sequence modeling or not it doesn't alleviate the problem that these approaches were trying to solve in the first place though the sequence modeling approach is different and does bring like a different view on the problem and again you can do the sequence modeling approach because it there is hope that with these transformers you can actually absorb that much data and learn from that so that is sort of the thing we're in that that was actually already the the technique right here we're not even past the the first page and that is that's already the thing you get this data and they're like you can deterministically you can see that right you can deterministically transform this into the format they want so this state action and desired future return or future return you simply look into the future which you can do because it's a data set and you sort of calculate what the the future reward is at this particular time step so you can easily generate that training data then you can use classic sequence modeling in order to do that their idea of what happens is encapsulated again in this um in this thing right here so this is a very very example problem that they come up with so they consider a task up here of finding the shortest path in a on a directed graph which can be posed as an rl problem okay the reward is zero when the agent is at the goal node and negative one otherwise we train gpt model to predict the next token in a sequence of returns to go which is the sum of future reward state and actions training only on random walk data with no expert demonstrations we can generate optimal trajectories at test time by adding a prior to generate the highest possible returns they also say see more details and empirical results in the appendix i've looked at the appendix nothing there i've looked at the code nothing there uh just adjusting i mean it is a toy example to illustrate but like there's nothing there of this example so what they do is they have a graph there is a goal or you're supposed to just find the um the shortest path what you do is you just do random walks okay some of these random walks will actually fail like this one here so the all the rewards are negative infinity um some of them will succeed and then you can generate that training data okay so from here that all the future reward is negative four from this particular random walk you did here okay here you start at a different location also negative four because you're going to take four steps now what you do with this sequence modeling approach is you say i want to start from this node however however i would like to get a reward of negative three okay which is a lesser reward than you got um all the way here so what you're asking the model to do and by the way like i'm pretty sure this should say negative two to make their example compelling okay but so i i think there's kind of a flaw in this toy example but i hope you can still see what they're doing so you're saying i would like to get a very high reward or a low negative reward i guess a low magnitude negative reward going from here which corresponds to finding a really short path right and what the model is going to do is going to look at its training data and says well was i in a similar situation at some point like in the training data set and it's if it's gonna find yes yes actually here i was in a very much similar situation um and and so i wanted to get exac actually exactly that reward i was in that situation the history is a bit different but you know who cares uh now i'm here as well and what did the agent do that then went on and reached exactly the reward i want well it did this action right here okay i'll just i'll just do that same action this is just comes out of the sequence model right so it's the sequence model simply tells you how would a sequence that started like this continue and it tells you the action and then it looks at this thing right here and here is a bit where it fails right they say each step gets you negative one reward so technically at inference time at inference time what you would do is you would look at um here so you get negative one from here so here you will put negative two so at the beginning you have to specify the reward you wanna get and from there on you can calculate sort of the next reward uh they need this to be negative one right here actually because um so let's just imagine that for some reason you got a negative two here right so they need this to be negative one because that makes their example so the sequence model says well was i in this situation at some point and i got out i got a negative one yes i was here and what did i do to achieve that i went there okay i'm gonna go there ah now i'm at the goal okay and technically you find somewhat the shortest now this again this doesn't the example here doesn't work because you start with negative three you're going to end up with negative two right here that wouldn't match the blue one that would actually match this one so you would not get the shortest path so you should actually start out with an oracle knowing that the shortest path is negative two um that would of course not match any example you have in your training data but the sequence model could say well this is kind of close to this right so the most likely action is still going to be the one right here and then you take the one right here and then you're in the negative one regime and then you match this one right here i hope you can see right how that that figures out a bit so this can also handle if you don't get the expected reward which of course can happen right it's not everything is always deterministic so because you reassess after every step you reassess you ask sort of your training data set and this is very much how we think of these big transformer language models what they do is they sort of interpolate the training data set so they stitch together different pieces of the training data set which is you can see that happening right here of course you already saw the flaw you need to know what reward you would like to achieve and um so like by the way lottek is beautiful isn't it uh maybe that's just my thing i don't i don't recall that being like this so by the way the code is available and also the pseudo code big props here you can see that the decision transformer in blue in atari lags a bit behind what they call dd learning so this td learning that's the the converts conservative q learning and the behavior cloning which they term bc in the open in the openai gym uh it outperforms it a little bit and then there's this key to door task that we're going to get into in just a bit so um i just want to quickly mention that their primary comparison here is this cql uh and they make a big deal about sort of not needing discount factors i'm not really sure what they mean there are usually two different discount factors in these algorithms so one of them is usually found right here in the objective formulation so here they say what we want to do is maximize the expected return which is this quantity right here okay so what you want to do is you maximize your expected future returns in the episode now this is usually different some people formulate it as the expected um return in the future but discounted by a discount factor that you raise to the power so you're essentially saying the future rewards are less valuable than current rewards and that gives you some sort of stability but it also gets you short-sighted in a sense one however this is a choice this is a choice of the problem formulation now i get people train with this for maybe stability reasons and then they still test um and actually report the undiscounted reward at the end okay but i'm just saying this is a choice and their choice right here is different from what cql does so cql explicitly uh maximizes the discounted future returns while they maximize the future uh returns i just want to point out that there is an actual difference here the other difference is in the td learning okay so the by the way if you don't do this if you don't discount your returns um you get the situation that you can you can cycle so if you know if you if if you get like positive rewards or zero rewards for certain transitions it can just like if someone is losing okay a game so here would be negative one this is the only two options um either lose or you know go back here now chess has a built-in protection against this but other things you can just the agent will just circle forever because it doesn't cost anything and if it were to go here it would actually lose so you usually discount no actually that's not why you discount um sorry that that is a bad example but there are good reasons to discount future words here you would actually implement some sort of a penalty like minus 0.1 for just any step you do um yeah but discounting maybe you could you could win if you could win the agent could still go in circles because well it can still win later right um yeah in any case that's one discount fact the other discount factor is in the td learning so right here um and that's a different discount factor you say well i'm going to predict this next step right here uh that's probably a pretty accurate description and that reward here is quite a good signal given that i am in in this step right here the next one maybe a bit more noisy right because it's two steps ahead and then uh i could you know i could be doing different actions uh maybe the transition is stochastic so when i learn my value function from all of these different goals okay i'm going to value this target as a learning objective right here you have that recurrence relation i'm going to value this target the highest i'm going to value this one a little bit less so i'm more trying to match this oops sorry i'm more trying to match um this one right here given that reward then i'm going to match this one right here giving the given the two rewards okay both should be accurate so the value should match this the reward plus this one the value should also match these two rewards plus this one but the second one is more unsure so the td learning usually you have uh classically called another discount factor lambda where you discount sort of future losses and they say we don't need the discount factor right here i don't know which one which one they're referring to uh but what i want to point out here is that yeah the objective is different so maybe they say we can get by with this objective i don't see that that's a choice of the modeler and you run into problems with some environments if you don't have a discount factor in any case uh you can see right here in the experiments for for example this is atari the decision transformer outperforms cql in some respects uh it it trails it in other ones i mean they also look at like these standard deviations are are quite high in the open ai gym it is a bit it looks a bit better in that it sorry it does outperform cql in quite a number of things and also with less standard deviation right here yeah also they they they compare against sort of behavior cloning where you retroactively only train on the um best such and such percent of the experience and they find that if you hit the correct percentage which is not necessarily the only the best trajectories if you hit the correct percentage sometimes behavior cloning can actually give you a better performance however hitting that percentage of course requires another hyper parameter search and you as an oracle you kind of have to you know you have to go and filter and you have to try out and and um you don't know you have to have some sort of a validation set whereas the decision transformer is just one run now throughout all of this they're sort of touting that they don't need as many like searches and as many you know like here you need to choose that percentage you need to figure it out but if you look at their actual configuration of hyper parameters down here they do things like well we have one architecture for these atari games but then we have a different one for pong right we have a context length for these atari games but then a different one for pong because pong is actually quite a sparse rewardish game okay compare these other ones so they make the context length bigger in order to capture a longer history because otherwise they couldn't differentiate the agents and they would need to use td or some kind of dynamic programming right and then there's also this this how the return to go conditioning like how much reward do you want to get and that's a problem like so here again they they do something and this is like they look at the baseline they look at cql how much did that achieve and then they just choose to achieve a multiple of that one they say it's like you look at your competitor at what you're compared to and then you base your decisions off of the result of that so you know i i kind of get it and also this multiplier they take it is very informed by them knowing the games right in pong you know you can reach at max 21 so that's take a condition on the re reward of 20 in in sequest it's i think it's unbounded so they they do it 1.5 times the performance of that and yeah so i'm not i'm like i'm not saying this is invalid experiments but like this this looking at your competitor and then basing crucial hyper parameters off of their performance but i'm sure i'm sure it will work otherwise but just know that you need to have a good idea of what reward you can even achieve and what's possible given your data set right so cql also takes into account like it also learns from the same data set and that's sort of how they know what's possible from that data set uh yeah so is this a problem that you need to know the ruler can't you just put 100 billion billion billion and the answer is no you see right here this orange line is the highest reward that was observed in the data set now this is is gamer normalized that's why it's not like 21. um but here the experiment it's actually a pretty cool experiment is since you're not only maximizing reward you can you can ask the model to to give you any reward you want so the green line is what you want it and if the blue line is what you achieved matches the green line exactly the model always gives you the actions to to make that reward that you requested happen okay and you can see that green line in the blue line they match pretty accurately for a long stretch which meaning means that this the sequence modeling approach can really not only give you the max reward but it can give you sort of any reward because it remembers all the sequences though probably not the lowest ones because you're actually learning from a dqn learner that has probably only good trajectories okay but you can see as soon as you go past the highest observed reward it not only does it stay flat it actually drops down again okay and you can see that pattern pretty much anywhere where you have an orange line like this so here you well maybe you stay maybe you drop down here it's that kind of seems like you stay it's only that here in the seaquest where it's a bit better but like this is a gamer normalized score of three like a gamer would achieve 100 here um but you can also see that sort of drop compared to the green line so that means you can't just put 100 billion essentially so you need to know the reward that you're going for sometimes no problem sometimes actual problem okay and that reward is not only dependent on the game it is also depend on the game but it is also dependent on like how your data set is that you learn from is structured you need to know what your agent can achieve they do some other ablations with respect to context length they actually find that larger context length helps so if you don't provide a long context the performance drops it makes sense in that the transformer is able to match the history to observe trajectories better on the other hand a technically reinforcement learning algorithm since these are in atari are fully observable if you do frame stacking you know technically an rl agent shouldn't uh shouldn't care about the more of the past but you know rl algorithms do they're not perfect um the last thing is that key to door thing where they show that okay there this is a an experiment uh toy setting by the way again i did not find this in the appendix i did not find code for this so we actually we don't know too much about this experiment but as far as i understand um there's one room there's two rooms there's three rooms in the first room there's a key in the last room there's a door now you're thrown into the first room you get to walk around a bit then you're thrown into the second room you get to walk for a variable length of time and then you thrown into the last room if you have put taken the key and you you reach the door here then you get a good reward otherwise you fail okay so the middle room is called a distractor because if you have something like an lstm or if you have something like q learning or something so the the the problem with this uh sorry q equals r plus q is that this sort of looks one step ahead okay this recurrence relation that means if you have a learning signal somewhere way down the line you need to sort of propagate it's not back prop it's actually you need to learning step propagate the fact that there is a signal back here all the way through these time steps in the past where a transformer can just go like boop okay so this is this is an experiment designed to show that this really helps so you can see right here they can analyze what their system says about the expected reward in the future so you can always ask it how probably is a given reward in the future and you can see whenever the agent doesn't pick up the key it immediately knows as soon as it gets into that second room it immediately knows it's lost no matter what happens in the last room if it does pick up the key in these two situations it estimates a future reward of about 0.5 and you can see it does not degrade across the distractor room okay so no no matter how long the destructor room is does not degrade um and that's the key difference between this and like let's say td learning uh q learning approaches it does not it doesn't forget um because there is no dynamic programming involved and then you know in the last thing if it reaches the door obviously it says well that's a high value if it doesn't reach the door it changes its mind now i would have liked to see whether or not and this is why i was keen on seeing the parameters of this whether or not this right here is inside or outside the context length of the transformer they used and i'm going to guess it's still inside because as soon as that's outside or like let's say more like this as soon as that's outside the context length the the the system has no the sequence model has no way of knowing whether uh that particular agent picked up the key so it cannot predict anything i think what they're what they want to show right here sorry that's an alarm what they want to show right here is the fact that the attention weighs heavily on those frames where it picks up the key or reaches the door which is fine right we can we can get that transformers learn that however here i'd really you know like to see what happens if you go outside of that and again if you go outside of that you're going to revert back to the old method so ultimately the transformer gives you a longer context where you can do one step assignment of credit but again as soon as you exceed that as with the lstms as soon as you exceed these you need the classic approaches and i feel the paper is a little bit is a little bit shady on the fact that they get like a constant factor longer context with what they're doing but it doesn't really solve the problem okay in my mind i might be wrong please tell me if i'm wrong read the paper for yourself it is a good paper i hope we can cover the trajectory transformer in the future and with that i wish you all the best bye-bye

Info

Channel: Yannic Kilcher

Views: 29,010

Rating: undefined out of 5

Keywords: deep learning, machine learning, arxiv, explained, neural networks, ai, artificial intelligence, decisiontransformer, decision transformer, berkeley, uc berkeley, facebook ai language, fair, deep learning tutorial, what is deep learning, introduction to deep learning, transformers for reinforcement learning, transformers for rl, transformer reinforcement learning, sequence modeling, sequence modelling, sequence modeling reinforcement learning, reinforcement learning with transformers

Id: -buULmf7dec

Channel Id: undefined

Length: 56min 48sec (3408 seconds)

Published: Sat Jun 05 2021