Josiah Hanna Final PhD Defense

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

good afternoon thank you all for coming I definitely did not give away enough room for this so I'm really excited to talk with you all about the work that I've been doing during my dissertation in particular for those of you who were at the thesis proposal I'm going to be really focusing on the work which is taking place since that proposal so the field of reinforcement learning has seen several attention-grabbing successes many attention-grabbing successes in the past few years it's been a lot of exciting progress so actually the first one first result that I'm showing here superhuman performance at Atari video game playing this was done just as I was beginning my PhD but then a year or so after that superhuman performance at the game go from UT Austin simulated robots that could learn how to run kick kick soccer balls and all the skills needed to play soccer and there's been a plication it's not just two games but also in industry tasks such as web marketing control of data centers or control of home thermostat systems or reinforcement learning has been shown to be effective but if we look a bit more closely at these results we can see that while the reinforcement learning is effective it's not always efficient so in this Atari video game playing the agents have to try 50 million actions before it had a good policy for the test the game would go 21 days of playing of the computer playing millions and millions of games against itself and is one more example these simulated robots took a year and a half of compute time to learn all the skills necessary to play soccer and so I think you know these results have raised the question for many people in academia and industry of can reinforcement learning be data efficient enough for real-world applications so my dissertation is motivated by this question and I focused on two limitations that many reinforcement learning algorithm currently have so these limitations are first that many algorithms need to be what we can call one policy and I'm using this term somewhat loosely but in the strictest sense on policy means that the algorithm can only use data generated by its current policy or a policy of interest and so I'm using that term somewhat loosely but in general many algorithms rely on using data that's from a policy that's similar to their current policy even if it's not the exact one and the other limitation of this dissertation addresses is that many reinforcement learning algorithms need to be what we call an environment meaning that they must learn in their target environment of interest this problem is especially felt in robotics where we may have access to good simulators that model the world well but inevitably if we plot apply reinforced learning in simulation we cannot expect what we learn what we learn to Train sir forms of the task of interest so it's Awards addressing this big question of making reinforcement learning more data efficient for real-world applications my dissertation tries to answer the my dissertation answers this esis question how can a reinforcement learning agent leverage both policy and simulated data to evaluate and improve upon the expected performance of a policy we answered this question by answering for smaller questions these questions are how should a reinforcement learning agent collect of policy data how should a reinforcement learning agent weight its collected off the policy data how can a reinforcement learning agent use simulated data and finally how can a reinforcement learning agent combine both simulated and off policy data so I'm not going to talk about all of these in depth today they're covered in the dissertation document I'm happy to take questions about them but before talking about any of them I want to present a bit of background on reinforcement learning just as a way for me to state some of the assumptions I make or firmers the audience who are less familiar to understand the reinforcement learning model is an agent interacting with an unknown environment the agent's behavior is captured by what we call its policy and denote with the symbol pi throughout the talk and in general what pi is is it's a mapping from any state of the world to a probability distribution of a possible actions that the agent Katanga so given a policy the agent interacts with the world is follows the agent begins in some initial state it takes an action according to its policy it receives a reward for doing so and in the environment transitions to a new state and this process repeats for a fixed number of time steps we call this entire sequence of interaction on a trajectory after a trajectory is complete the agent can go back to an initial state and start the task again now in reinforcement learning that there's two problems which are commonly that this dissertation looks at the first one which most people associate it with being the ultimate goal of reinforcement learning is that policy improvement we want to find the policy which maximizes the expected reward total that will receive when we run it in the environment but a second problem which is at least as important if not more important is that of policy value estimation where we're just given a fixed policy and we want to determine the expected here you live rewarded that we will receive when we run in that policy and so the reason I think this problem is at least as important as policy improvement it's first of all many policy improvement algorithms rely on good value estimation to do their improvement and second for any application where we're considered with the safety of deployed policies it's going to be necessary to be able to quickly and efficiently evaluate what's going to happen when we run a particular policy in the world so much of my dissertation work focuses on the policy value estimation in question and I want to just mention here that I am going to call I call this problem policy value estimation to distinguish from the related problem of policy evaluation we're trying to learn a value function which is also commonly studied so policy evaluation more specifically we're given a fixed policy that we call the evaluation policy and we want to know when we run this how much what is the reward total that we should expect to receive for a trajectory and what we want the solutions of policy value estimations we want to come up with estimation techniques which allow us to accurately estimate this value the objective that we seek to minimize is the mean square error where we have some estimate of policy value we want in expectation for that estimate to be close to the true value of our policy so one of the simplest ways to to tackle policy value estimation is to just run the evaluation policy and see what happens so this is called Monte Carlo value estimation so you can run the policy generate some large number of trajectories and then just average the reward scene and this gives you this will give you an estimate of the policies value and as you generate more and more trajectories your estimate will become more and more accurate so this is a very straightforward approach but I actually wanted to kind of illustrate what it's doing is it's going to relate to later parts of the the talk so we have a true unknown value that we're trying to estimate and we have a sample average which is estimating imagine a toy environment where we have three possible trajectories our evaluation policy and the environment in which were running induce a distribution over these trajectory so we're gonna see each of them with some probability the Monte Carlo value estimate is going to sample trajectories from this distribution I mean as the number of trajectories increases we're going to see them in a sample proportion which is going to approximate the true distribution of them and as M gets really large this sample proportion is going to converge to the true policy of each trajectory and our estimate will converge to the correct value so this is only you only they will converge to the correct value if you're sampling trajectories from the evaluation policy in the target environment of interest so one problem that we see in reinforcement learning is that of distribution shift if we cannot sample from this particular distribution for trajectories then we're not going to be able to do effective policy value destination distribution shift in reinforcement learning can arise in two ways that we consider in this dissertation first of all our policy can change so if instead of running the evaluation policy we run a different policy that we will call the behavior policy then we're now going to have all the policy data and if our environment changes for instance when we're using a simulator in place of the true environment then we're going to simulated data which also comes from a different trajectory distribution so returning to the toy example if we're running a different behavior policy then no matter how many samples we collect our samples are net we're never going to converge to a proportion which matches the true distribution of interest shown here on the spot that we need to read is that we're going to be getting closer to the bars so one way to a very simple way to address this problem is that of it is using important sampling with important sampling you can run any policy that you want to collect your data but because the policy is different than the evaluation policy that you're interested in you need to rework the rewards which I received so if you're not already familiar with the important exam point it's not important to understand exactly how the mystery weighting is done but intuitively what we're saying is that if a certain trajectory is more likely under the evaluation policy than the behavior policy that generated that trajectory increase the weight on it and if it's less likely than we want to decrease the weight once this rear weighting has been done we can average the related rewards and use the data as if it had come from our evaluation policy that were interested in so again in this toy example even though as we sample from the behavior policy our trajectory the sample proportion is never going to be that of the evaluation probability policy but because we're applying these real weighting factors it doesn't matter we can still get a good get a good estimate of the policies available so with that background I'm now going to dive into the contributions of this dissertation and for most of them I'm going to talk about them relatively briefly and I'm going to start with the question of how should a reinforcement learning agent collect data for an all policy policy value estimation so I talked about this a lot at the time of my thesis proposal which is how should we collect all the policy data and the what we're meaning they're a more technical level is what - how do we choose the behavior policy for an important sampling policy value estimation the choice of behavior policy is critical as it determines that it will determine the variance of this estimate anything before this work the prevailing wisdom would have been that using the evaluation policy as the behavior policy would be the optimal thing to do however in fact the dissertation discusses how this is not in fact the case so the first contribution of this dissertation is that we formulate what we call the behavior policy search problem which searches for a behavior policy which makes an important sampling perform better and we introduce the behavior policy gradient algorithm as the first solution algorithm for this problem now since the thesis proposal we've also done some work is an initial study of applying the behavior policy gradient algorithm to the problem of policy improvement and I'm not going to talk about that today but I'd be happy to more about this work the so the main takeaway for the question of how to collect all policy data is that being off policy can actually improve upon being on policy if you choose the behavior policy correctly and this dissertation has introduced a method which will find a better behavior policy for important Sandler so that's all I'm going to talk about for now on this question of collecting off policy data I'm going to spend most of the time talking today on how should a reinforcement arming agent weight of policy experience so the question of how to weigh off policy data am AMSA how do we correct for the distribution shift we see when we run a different policy than our evaluation policy that were interested in important sampling is a very standard technique for doing this so in some sense we're asking is there a better method than important sampling for correcting distribution shift and the answer to that is yes and we show this in several contributions made in this dissertation first we're going to introduce a family of estimators that we call regression important sampling estimators and show that they improve on the standard important sampling estimator that I introduced a few slides ago um this improvement is based on a principle of being able to correct what we call a sampling error and we're going to take that same principle and apply it in policy improvement we're going to introduce what we call the sampling error corrected policy gradient estimator and use this to speed-up policy gradient reinforcement work before own getting into our contributions I do want to take you all back to proposal time and talk about what I discussed there along this direction so the the problem that I proposed a tackle was that an important sampling with an unknown behavior policy you can notice here that if you do not know what the behavior policy is that generated your data then there's no way to compute these importance weights because you need the denominator to compute this ratio this is a really unfortunate limitation particularly if you're working with data that was generated by a human and you can't you can't just write down the PDF that was in the person's head when they generated the data or your data came from some other source where you don't know the behavior policy a really straightforward baseline for attacking this problem is to use maximum likelihood behavior policy estimation you just look at your data you use the policy that looks most likely to have generated the policy the data this is essentially a supervised learning problem and then you plug the estimated behaviour policy in as the denominator in your importance and marriage here um so the time of the proposal was considering using you know wanted to study how much worse was it using the estimated behavior policy than the true behavior policy and you know maybe hoping to derive some theoretical balance on how bad it could be and give practical advice on how to best estimate the behavior policy and practice so it turns out that this was incorrect and in fact the estimating the behavior policy is preferable to using the true behavior policy for important sampling so I'm going to show one empirical result of this but we're talking talking more about our contributions this is from open a eyes River School hopper domains so reinforcement learning benchmark for continuous control well y-axis is showing means for error so lower is better using the true behavior policy for elf policy value estimation does this well but if you replace the true behavior policy with its empirical estimate you can do much better we're not the first ones to have noted that using the estimated behavior policy can improve one using the true one and some of this work we discussed it proposal time as well but in the causal inference and bandit literature it's been shown that what's called the estimated propensity scores can improve important the sampling and some work parallel to our own contextual bandits it's been shown that it can be comfortable to use an estimated behavior policy but our work stands out in that we're the first to show using an estimated behavior policy improves in multi step environment such as Markov decision processes so just to recall really quick the problem we're tackling policy value estimation were given a fixed amount of data in the form of trajectories so we say that we have state action or reward trajectories of L times that's each we're given a spec a stick evaluation policy and we want to estimate what is the expected sum of rewards we would get if we ran this policy in our target environment as I mentioned earlier ordinary important sampling is a method which is commonly used to tackle this problem and what's important to see here is that this important area important sampling is rewedding in returns and it's doing it in a way which corrects from the distribution of trajectories seen when running the behavior policy to the distribution of trajectories seen when running the evaluation policy so this is a really nice correction it has some nice properties it's an unbiased estimator of the value of the policy the downside is that it may take a long time to converge meaning that the number of trajectories means and observe and be very high before we can get a good estimate of policy value so era of proposed family of estimators is a relatively simple change to an ordinary important statement specifically we take the denominator in the important sampling ratio and we replace the true behavior policy with a maximum likelihood estimate of the behavior policy we call this maximum likelihood estimate the empirical policy and one thing I just want to point out for now although I'm not going to discuss it in detail until a bit later is that we're introducing a family of estimators where the family is defined by a parameter in which means when we estimate the behavior policy how much past history do we want to condition on when doing that so a when n is equal to zero we're estimating what is called a Markovian many of your policy just conditions on the immediate preceding state as in increases we condition over more pass States and actions so what's important here is that by using the empirical policy in the denominator we're no longer correcting from the distribution of the behavior policy to the distribution of the evaluation policy we're instead correcting from the empirical distribution to the evaluation the distribution of samples we're gonna talk about what that means returning to this toy example that I introduced earlier we have three possible trajectories were sampling trajectories by following the behavior policy so according to these blue probability the blue bars and we want to evaluate these as if the trajectories had come from these red bars ordinary important sampling is effectively correcting from the distribution of the blue bars to the red bars but when we replace the true behavior policy with the empirical policy we're now correcting from the sample proportion towards the evaluation policy so for any fixed amount of data that we have no matter what the sample proportion is we can correct exactly back to our evaluation policy of interest so what were you please say that the difference between the sample proportion and what we're sampling from we called sampling error and so no matter how much sampling error there is regression where the sampling can correct from it to obtain a more accurate estimate whereas ordinary important sampling means to wait until kids observed needs to actually observe samples at their expected proportion so using the estimated behavior policy I'm narrowing the show of some empirical results showing how this improves performance so this is on a grid world domain with discrete States and actions the y axis it means for error so lower is better the x axis is how much data we've collected so we want to be as low as possible with this little data collected as possible using the true behavior policy with ordinary importance and point when we do this well but if we replace the true behavior policy with its estimate we do better and there are other ways to improve ordinary important sampling but these techniques are more or less worth dogging all to the work we're doing so weighted important sampling or more sampling is another alternative but if you do it with an estimated behavior policy even further improved and the same goes for the state-of-the-art weighted W robust estimator which was introduced a couple years ago this isn't just for discrete states and actions I don't want to continuous state inaction environment where we use linear regression to estimate the behavior policy here is using the true behavior policy it's the estimate improves and the same goes for per decision important sampling and to a lesser extent weighted important same way the next empirical result I'm going to briefly talk about is the fact that we're introducing a family of estimators and the family relates to how much history do we want to condition on when we estimate the empirical policy so as it turns out conditioning only more history makes our estimator closer to a method from another literature which has some nice theoretical properties so it's what you that we're using the true behavior policy we do this well if we use an estimate that just conditions on the preceding state we do better as we condition on more history see that the estimators do worse when there is very little data but asymptotically they converge to a orders of magnitude better estimate of the policies value and what's really interesting is that there's a method from the banded literature that in fact cannot be in many realistic settings could not be implemented for Markov decision processes but for this very simple environment that we tested it on we could influence it and we show that our methods are actually approximating this method which in some along some axes was shown to be theoretically optimal the next thing I want to talk about is that this is not just for off policy data we actually if we use the event we meant the behavior policy and the evaluation policy the same we're seeing the same improvement and the reason for this is that any method which performs Montecarlo sampling in the action space of the policy may secure from the same tenure meaning that the empirical distribution doesn't match the expected distribution of actions taken and if we do know the desired action probabilities as we do when we know the evaluation policy we can potentially correct for the same thing and this naturally graces the question of are there other reinforcement learning algorithms that can be improved by correcting sampling error this leads to the fourth contribution of my dissertation which is the sampling error corrected policy gradient estimator and we're going to show how this can improve upon policy make policy improvement algorithms or admissions so let me give some quick background on policy gradient reinforcement learning intuitively if you imagine all of the pomp the space of all possible policies that our agent could run along the x-axis of this plot and the y-axis is showing the policy value so how policy performance for each of the policies this this curve is our valid own is the value of a policy and I'm expressing this slightly different but equivalent to what I introduced earlier this is an expectation over states from a policies distribution actions taken from the policy and Q of Si says how much reward should we expect to receive after taking action a in state s so this is another way of expressing how policy is its equivalent to the definition I gave earlier since we have an objective that we want to maximize one really simple approach this is to just compute a gradient of our objective with respect to policy parameters which of denoted theta we did take a step along that gradient and we're moving towards an improved policy which will give us more reward of course in practice we teams actually analytically compute this gradient and so it has to be estimated typical approaches to this estimation use Monte Carlo sampling so they run the policy and collect some batch of state action pairs and then they can use this estimate by just averaging what they see so this approach is fine like once if the sample size is large enough we're going to converge to the right answer and we'll get a good direction for how to update our policy the problem of course arises when our sample it when we don't have a large amount of data before going on I do want to just expand upon the definition of what we're really what were actually trying to estimate here as it's going to maybe give some intuition for what we do next so the gradient expanding using the definition of expectation of what the gradient is we have we have the inner gradient terms that each is weighted by the probability of being in a state when following our policy and the probability of taking an action when in that particular state the probability of being in a particular state is unknown to us but the probability of taking an action when following the policy is known to us but I want to point out here that Monte Carlo style approximations to the gradient use sampling to approximate both of these probabilities even though the inner one isn't known to us so before going on is our contribution this gradient estimation can be plugged into a an algorithm I'm going to give just a simple framework for polar greatings RL most algorithms look something of the following where we execute the current policy for some number of steps we then update our policy using the Monte Carlo average estimate of the gradient and then throw away the observed data and repeat the process and this is a simplification in some respects but it's reflective of many state-of-the-art reinforcement learning algorithms such as proximal policy optimization so the reliance of policy gradient RL on Monte Carlo sampling means that the algorithm may suffer for what we call sampling error this can lead to inaccurate gradient estimation and thus lower um one way to view this same thing here is that our observed data may appear as if it's coming from a different policy than the policy were actually interested in if you consider this robot has a choice of two actions with some different probability for each but it only takes a finite amount of actions the proportion that it sees each action well in general not match the expected the probability that it took each action which means for a finite amount of data it may appear as if our data had been generated by a different policy of course with enough experience eventually these sample proportions will converge to the true probabilities but for a finite amount they they will be different and will have inaccurate gradient estimation so we take this view and use it to correct sampling error we first pretend that our data was generated by the policy which most closely matches the observed data we can determine this policy with the maximum likelihood estimation finding the policy which looks like it generated the data visit in many reinforcement learning problems this is essentially a supervised learning problem once we've recovered the empirical policy we can use it to correct the weighting on every state action pair to make it better match the waiting that we wanted to see under the policy that took these actions so we introduced the sampling error corrected policy gradient estimator it's essentially the same as the Monte Carlo policy grading estimator or just adding an important sampling correction which corrects from the empirical policy have the way our samples actually fell to the current policy that we wanted the samples tolerance we then plug this into a new policy gradient algorithm essentially the same framework is earlier we collect we run our current policy for any steps we then estimate the empirical policy with maximum likelihood estimation essentially we read we have a supervised learning problem solve here we then update our policy using our new sampling error corrected policy grading estimate an S before we throw away our data and repeat so I'm going to talk about two empirical results here which I think ill at home are illustrating some of the work first I want to show how same plain error relates to better and correcting for sampling error relates to the bastard learner so this is only the same GreenWorld domain with discrete states and actions the plot I'm showing here is a measure is going to show sampling error throughout learning so the y-axis higher means more sampling error lower means less than we'd ideally want to be loved and the x-axis is how long we've been learning and if you just run a standard policy gradient method you start with some sampling error tends to increase throughout learning within as the policy converges to an optimal one sampling error decreases too close to zero so now showing how this relates to policy improvement so this is showing on the y-axis average return so higher is better this is what we're trying to maximize and a Monte Carlo policy gradient method does this well but if we correct sampling error we do better and what I find interesting is that we are correcting sampling error at the pool where we have to see the most improvement where the sampling error is the highest the next thing I just want to discuss really quick is that probably the biggest limitation to this approach is that we need to solve a supervised learning problem in the inner loop of a reinforcement learning algorithm so this has two disadvantages one is that we're adding a computational cost to the algorithm and the second disadvantage is that we're potentially adding many more hyper parameters to the algorithm which and those the right value for the type of parameters might change from iteration to iteration so one way to address this is to estimate the empirical policy using simpler simpler policy representations than the target policy which actually generated the data when we do this and we use a carpool domain with continuous States and discrete actions we train a neural networks policy to maximize performance and so what we do I'm showing a diagram of the neural network we use all this is a little smaller than the one used in the experiments that I report we train the black connections of this neural network using policy grading enforcement learning but we only train that we used the red connections which are conditioning on an intermediate representation of the policy we only train those to estimate to estimate the probabilities of the actions that we have observed so this is a much simpler supervised learning problem and it makes it so this makes it our algorithm have less of a computational burden so how this relates in practice is along the this is again showing a average return on the y axis and x axis is how long we spent learning if we use Monte Carlo style approach this is how we learn if we naively just train a whole separate neural network to approximate the empirical policy we can do a little not significantly better in terms of how fast we learn but if we use a much simpler form for the empirical policy you can do much better and learn learn more robustly quicker and more robust the idea of correcting sampling error has been looked at in other ways in the reinforcement learning literature probably the most closely related approaches are what we can call expectation methods such as expected sarsa tree backup methods or expected policy gradient these are methods which try to directly sum over all of the actions to avoid sampling in the action space and of course there's some other work outside of reinforcement learning and banded algorithms which also addresses the fact that Monte Carlo methods may suffer from sampling error and should be adjusted so we conclude this part of the talk my dissertation makes two contributions towards answering the question of how should the reinforcement earn engagement wait called policy data contribution three of the dissertation was a family of regression important sampling estimators that estimate the behavior policy instead of using the true one for importantly contribution four is the sampling error corrected policy gradient estimator which improves policy gradient reinforcement learning and there's additional results in the dissertation both establishing theoretical the theoretical advantages of these methods as well as in additional empirical results but in terms of answering our question of how should a reinforcement learning agent wait on policy data the main takeaways from this part of the talk is that it is better to estimate the behavior policy then use the true behavior policy and this is what is doing so will correct sampling error in both policy value and policy gradient estimates so I'm now going to talk there's two other questions that this dissertation look I'm going to talk more briefly about each of these and then I'm going to conclude by discussing some directions for future work so the other problem that my dissertation considered was how to use stimulated data and so I wanted to first point out that the problem of using simulated data although traditionally has been studied separately through a problem of all policy reinforcement learning these are actually related problems so in both cases we're dealing with distribution shift in terms of what trajectories we observe the of policy learning that's due to the the policy which generates the data and changing but when we're using a simulator it's due to this environment changing unfortunately the environments of interest we don't know the state transition probabilities for it so all of the work that has been done for important example and we can't actually do here so we're gonna need a different class of approaches the algorithm that we introduced in my dissertation in addresses this issue by correcting by using small amounts of real-world data to correct the simulation to produce in effect a better simulator this work builds upon earlier work done at UT Austin that is the grounded simulation learning framework the basic idea is that if you can collect a small amount of real-world data you can use that data to ground the simulator meaning make it more realistic so that it produces similar trajectories to what you're observing in the real world once the simulator is grounded you can apply reinforcement learning directly in simulation but then your improved policy you expect to improve performance when it's taken back into the real world whereas without the simulator grounding you do not expect performance data so this dissertation introduces the grounded action transformation algorithm that makes simulation more realistic by changing the actions passed in to the simulation so first of all I want to note here that we're not going to try to make every aspect of the simulation more realistic we're doing that they would be warned we were in the setting of the model-based reinforcement learning trying to just build a simulator of the world from scratch what we're going to try to do is just make some aspects of the simulation important in the case of the experience I'm going to talk about today dealing with robot control I'm trying to get robots to learn the new different tasks and what we're gonna focus on is making the joints of the robots move more realistically so you can think of this as the policy is sending commands each of the joints of the robots these go into the simulator the simulator updates a lot of different state variables and then sin including in those are the joint positions of the robot and then these go back to the policy so if you want to make things look more realistic so the policy something has to change from where the policy sentence joint commands to the simulator and when it gets the new joint positions back and so our algorithm makes things more realistic by modifying that joint commands which are passed into the simulation I'm going to talk only at a very high level about the algorithm as I talked about in a more debt at a high level what we do is we are algorithm the grounded action transformation algorithm augments the simulation with a book we call a grounding module this grounding module takes the commands that the policy sings modifies them in some way such that the joint positions that the poet that the simulation sees the next time step or more realistic we show how this process can be the grounding module can be learned using two different supervised learning problems we use a small amount of real-world data to learn to predict what would happen in the real world as an effect of the policy sections and then we just need to choose the action in simulation which produces the same effect so that's the high-level sketch of the algorithm at the time of the thesis proposal we had already applied this to the task of walking on the narrow robot we started with an initial hand coded walk it would have been developed for robot soccer this is a state-of-the-art walk as far as we know is the fastest than anyone I've got this swarm to go before we applied the enforcement part name using our algorithm entirely in simulation and we ended up with the fastest walk that anyone had ever done on this robot so that was a proposal time I discussed that since then done a couple of additional studies so we looked to the task of trying to get that now robot to learn to control to move its arms to target positions we used a simulator called gazebo as a surrogate for the real world and learned in a different simulator using that as the same as the simulation so on this test I'm showing here the final policy performance the y-axis is how close the robot is learn to move its its arms to the target position so lower is better and we see that by grounding the simulation we to do a lot better without providing this grounding and so finally a lawyer I showed earlier that we were able to transfer a new walking controller to the robot since the proposal we also transferred a new type of walking controller I'm not gonna go into the details of this walk but we're effectively we use a simple linear function to generate a walking pattern on the robot if you just optimize this linear controller directly in simulation then the robot wasn't able to figure out how to get its feet off the ground to move forward but once we grounded the simulation we were able to make the robot was able to make progress so in terms of using simulated data the fifth contribution of my dissertation was the grounded action transformation algorithm which allows a reinforcement learning agent to learn from a simulated data and so there's a also in the dissertation there is a theoretical result from using an inaccurate model as well as additional empirical results with the grounded actual transformation algorithm but that may take away in terms of how can a reinforcement learning agent use simulated data is the difficult if we first ground the simulation using small amounts of data it will lead to more accurate simulation and we can do this by modifying the actions which are passionate with the simulation so I'm gonna talk even more briefly about the final question that this dissertation addressed I'm just mainly going to just tell you what the contributions were in terms of using both simulated and all policy data we looked at a problem of how to place a cognitive pedal on the performance of an untested policy we introduced two new algorithms which rely on a technique called statistical bootstrapping in order to develop these confidence intervals and one of these algorithms makes use of using both off policy and simulated data using a technique called control parities so the main takeaway from these contributions is that by using statistical bootstrapping and control variants we can tighten the confidence intervals for policy value estimation as compared to previous methods knowledge so this dissertation addressed the question of how can a reinforcement learning agent leverage both policy and simulated data to evaluate and improve upon the expected performance of a policy we address this question by answering for smaller questions and I'm going to summarize the answers to these questions both policy learning can be preferable to one policy learning if the behavior policy is chosen correctly it is necessary to both correct distribution shift and sampling error in all policy learning and one way to do that is to estimate the behavior policy instead of using the truth 114 sampling stimulate the data can be used if we first ground the simulator using a small amount of data and finally combining simulated and all public data can tighten confidence intervals for high confidence of policy value estimation before wrapping up I do want to talk briefly about three future directions for research that comes off just coming off with the work that I've talked about today there's many more in the dissertation document that I'd be happy to talk about the first of these is that the work I've done with using simulated data has assumed that the simulator remodels the target environment pretty well so what if the simulator is only a very abstract model of the target environment so consider every one of these simulated robots to learn to coordinate to play soccer together but our simulator for the task was just a very poor small playing soccer like these 2d robots passing them all so is it possible to use small amounts of experience from the the true environment of interest to ground the very abstract model and does doing so benefit learning in the target environment the next direction which I find really interesting is how to be collecting my collect data for the new regression employment sampling estimator that I introduced I'm sorry I didn't wasn't able to go into too much of detail about it and then talk today but we introduced a problem called behavior policy search where how do you select the behavior policy for important sampling and the answer to this was you basically want to increase the probability of seeing trajectories which are rare under the evaluation policy and lead to some high magnitude of some of the rewards so the question is is this the same is this better behavior policy for ordinary important samples the same for regression importance thankfully I would argue that this is it likely is not a better behavior policy for regression important sampling I don't have time to talk about some preliminary empirical results on this I have some discussion about a dissertation document but what I believe a good behavior policy would be means to actually experience a new trajectories that it hasn't seen and the final the final direction for future work that I want to look at is how can we apply what we've done from policy value estimation to the problem of policy evaluation or learning and value function so a value function for those of you who may be unfamiliar for any state it tells us how much rewards you expect to receive if we run into how much cumulative reward we should expect to receive when we run our fixed evaluation policy value function methods typically learn in an incremental fashion where they're moving towards an estimate of the value of the policy at a particular state so I'm showing that as this target use of T here so one definition of the target use of T would have a policy correction if we were an importance weight if we were generating the data in all policy fashion to do the contribution of this dissertation is there a way that they can improve the use of important saint-14 value function learning so for collecting data what is the right behavior the optimal behavior policy if our optimal as our value function is changing during learning or what if we have different evaluation policies we were all like to evaluate at this and then for waiting data how can we estimate the behavior policy if we're doing one line value function learning or more importantly how do we correct sampling error in value function line so before wrapping up I want to thank my collaborators in particular those who've been very involved with the work that went into my dissertation of course Peters been involved with everything Scott's been involved with a lot of this work and then Phil Thomas and chungu were really involved with work on behaviour policy search so with that thank you all for your attention and coming into the defense and now happy to take questions

Info

Channel: Josiah Hanna

Views: 88

Rating: 5 out of 5

Keywords:

Id: yhp8rlw7Frc

Channel Id: undefined

Length: 45min 34sec (2734 seconds)

Published: Fri Dec 27 2019