Stanford Seminar - Human in the Loop Reinforcement Learning

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
thanks Michael I'm super excited to be here and I'm happy to make this interactive I know also my own background is probably quite different than some of yours in the audience in the sense that I primarily do machine learning and artificial intelligence so if I use a term that you're not familiar with just feel free to raise your hand or if you think I know what that is but that means something very different than what it sounds like she means it to be just let me know so if you look at the news right now I'm probably almost any day you look at the news there's some article about why whether AI is gonna automate humans and is this gonna happen that we're gonna automate things like being able to do skin cancer detection are we going to automate taxi drivers are we going to automate drone flying and what is that going to be in what sort of implications will that have on society and the workforce and really a lot of the history of artificial intelligence has been about making artificial humans and in some ways thinking of maybe not explicitly trying to replace humans but really trying to make agents that can do anything that a human can do so sort of trying to replicate human abilities but I I think of artificial intelligence in a quite different way so who here has seen the matrix so um you guys might remember the following clip can you slide that thing not yet operator tank I need a pilot program for a b2 to both helicopter hurry [Music] let's go [Music] so this is probably exactly how your own educational experiences have been that you want to learn something and you just download a program into your brain and then instantly you can do the thing that you wanted to do of course that is not what we actually have right now in terms of education education is mostly the same as what it's been for a hundred years but I think that this vision the reason why I often show this clip of the matrix is because I think it's a really inspiring vision of what technology could enable us to do it's demand-driven education it's real-time it's incredibly fast and it's exactly targeted to what the user actually wants to learn and so I think it's this really amazing vision of what education could be and what lifelong learning really could be and Kintaro Toyama who is a technologist who thinks about information communication technologies for international development talks about technology being an amplifier it's an amplifier of sort of existing things in society and I think that if we think of artificial intelligence as being an amplifier that to me is a much more exciting goal than thinking about artificial intelligence as being a replacement for us so um in the work that I do I come from a background of thinking about an area called reinforcement learning which a number of people may have heard about um and in these sort of scenarios I think of an agent that's trying to help a human do something or augment the humans own cape about capabilities in sort of a loop like process so we might have an agent that's represented by the computer here that is feeding a math exercise to a student and then the student responds in some ways and then periodically we might get some sort of reward signal that indicates how good our there is the way that the computer is teaching the person so maybe we can see whether or not the student has passed an exam and there are really important questions around here of what should that objective function be you know is it that they pass an exam is that that they play our educational game for five more minutes or is it whether they choose a stem career so there are a lot of different choices they are about what is it even that we're trying to optimize for so when I talk about reinforcement learning we often talk about a policy which is a mapping of past interactions to the next activity and all a lot of our goals and reinforcement learning can be viewed as how do we construct some way that our agent should make decisions to maximize reward so it should be a strategy for saying depending on how the student has done so far what is the next mass exercise to do so that over time we're gonna optimize our reward and it turns out that this comes up not just in education but in many different other types of domains so we're also thinking about healthcare examples in this situation from a similar modeling framework and in these cases you could think of let's give some sort of medication dosage suggestion and then we observe blood pressure and maybe there's some penalty if the blood pressure is too high or maybe there's some other aspects about side effects and we're trying to figure out a strategy for how to trade shape like provide drug dosages over time so we can think about this in the context of healthcare and we can think about this in context of consumer modeling and marketing so there are many different types of domains where we have kind of this interactive process where we have an artificial agent that could be interacting with a person and trying to optimize something through a series of interactions with that person now in general reinforcement learning we call that first thing we're doing whether that's picking a problem or picking a treatment suggestion as an action and then we get some observation back and then we have a reward signal so we can have this generic framework for thinking about this sort of continual process of an agent selecting actions and then getting some observations and so who here has heard about the recent results in go and Atari maybe not everybody so there's been some really exciting progress recently in video games like Atari you you can now get human level performance by artificial agents and those agents are typically using these type of reinforcement learning algorithms and similarly in the game of Go we now have exceeded human level performance with go and a lot of the ideas come from reinforcement learning and again in robotics this is one of Google's arm farms it's an arm farm because you have many robots with a single arm doing something and there's a lot of interest in how we can use these types of techniques in robotics as well but as Michael mentioned one of my primary interest is thinking about how these sort of techniques apply when there's a human in the loop and so instead of there being a video game we're trying to play we're trying to you know help someone learn math that we're trying to help a patient have a higher quality of life and people are not the same as robots or video games and of course they're not the same but they're not as they the same in a couple important ways that change the type of techniques that we need to develop so in particular in the context of video games it's often fairly cheap to try things out so I can have my agent fail to you know succeed at pong or some of the other Atari games for on the order of 200 million steps and that is not a hypothetical number and that is a real number so you could do that but that's really computation so you know we could simulate that for like 12 hours sometimes more like two weeks and then we could have an agent that learns to play these games really well and then we often report the final performance so now we've got an agent that plays this game really well so it's often cheap to try things or it's pretty easy to simulate so we don't have perfect models at the physics of the world but we can do pretty pretty well so we can in these sort of robotics examples we can often write down pretty good simulators of how the world works and then we can try to use those to say construct a policy for how to grasp an object when we think about people it's quite different so people I would argue are high stakes I don't want to make a tutoring system that only learns to teach people fractions after I have not taught a million people fractions so it's high stakes the data I'm getting is from really interacting with people and yet they're also really hard to model so we don't have really good models of student learning we don't have great models of human physiology and so we're not really in either of these two camps which are often explored in reinforcement learning scenarios and so what I'm going to talk about today is observe some of the technical challenges that come up when we're trying to do this human loop reinforcement learning and also thinking about some of the things that we do have access to in these type of domains that are really people focused and what are the sort of technical challenges that come up there and what can we do what's the pageant that we're in the case where we have a set of classrooms called the a classrooms I mean in those classrooms the students first got to do an exercise on a computer and then an exercise on a chalk board and they got a score of 95 on an exam and then I have this other set of classrooms called the P classrooms where they did the opposite order of the activities they first did the chalkboard exercise and then a computer exercise they got an average score up in 92 and then you have a new student who walks in and says okay what do you want me to do we want to pick something that's going to maximize the score so if we did this what are the questions well first of all what would you do or what are the questions we'd need to ask in order to make this decision of what to do for a new person right so it might be that classroom a is kindergartners and classroom B is you guys and so maybe I should say the opposite way around given the scores and and so there might be some really big differences in the types of students in a and M B is there anything else that we should maybe ask how many students very nature classroom right so if it turns out that this is you know classroom a is Michael and classroom B is me then maybe we can't really generalize from that to what we should do for all new students another thing that people sometimes suggest when I bring this up is well is you know is that test really actually even a reliable indicator of whatever it is they're learning like let's say they're doing math exercises here and this was a history test maybe that tells us nothing about what we should be doing but yeah all of those considerations so what is the differences between these groups how many people do we have and is this a reliable indicator let's imagine this is a reliable indicator this is really a math test that was really math exercises so this challenge of doing kind of this counterfactual reasoning of saying what should we do what can we generalize from our past experience is a really foundational one and it comes up in many domains so imagine something like maintenance scheduling and we're doing a collaborations with Siemens right now so you might see there are different orders when you could do maintenance or when you could get your oil changed or things like that and you want to optimize for that you want to sort of optimize cost over time and you have old data and you can see what the outcomes were and similarly in the case of health care often now with electronic medical record systems we're gonna have sequence of treatments that are given to patients and then we can observe their outcomes we can observe you know how well are they doing different measures of sort of quality of life and what we'd like to be able to do is use that information to try to make better recommendations for future patients but why is this hard well it involves this counterfactual reasoning and there's this really lovely talk by Judea pearl who's sort of one of the founders of causal reasoning at least in computer science and he argues that our ability to do counterfactual reasoning is really a core aspect of what it means to be human we can imagine things that we haven't observed and so we have to do that here because I didn't tell you what would have happened in the B classrooms like if those B students had gotten the B classroom students had gotten this other intervention the same one as in the a classrooms we don't know what would have happened we can't observe these parallel universes we can only observe a subset and we can't rerun it the second challenge is generalization so here we only have two types of things we can do we can either use a computer or chalkboard and we can think about which or do we do that in and so there's a very small number of possibilities but if you start to think about all the drugs that you could prescribe all the different types of medical interventions and also not just doing this for two time steps but for many time steps there's a huge combinatorial explosion and we don't want to try all of those it's probably not the time in the university trial of those and so we're gonna have to be able to generalize somehow from our previous experience to try to say okay would it actually be better to use the computer twice so in general this issue of saying we do have access to some old data sort of the data about decisions and their outcomes and we want to use it to make better decisions can be framed as follows we're often calling this batch data policy selection I mean again when I say policy I mean this mapping from given a scenario what decision should we do so in this case let's say we have this old data and we have a bunch of different policies we could consider these could be constructed using reinforcement learning they could be from experts they could be you know done in different ways and then we want to evaluate each of them so we want to say how good given our old data do we think each of these strategies would be if we were to use them in the future so should I teach my students using the first thing or the second thing for all future students and how good do I think that'll be let's say in terms of their test scores and then I probably want to do some sort of policy selection at the end and make actually a decision about which one to then go try with my students or deploy in the hospital etc so I'll just give a little bit of background my original interest in this came from an educational game that we were working on a few years ago this is in collaboration was eran Popovich who is a professor up at university of washington and this is refraction I think it's been played by around 500,000 students and it involves splitting laser beams using fractions to try to feed spaceships fuel them up and we had data from the past and we wanted to figure out maybe I can if I can play a little bit of it here see it's a video works [Music] no spaceship and what we wanted to figure out there are different levels which involve different fractions and different types of spatial into spatial configurations of different levels of difficulty and so we wanted to be able to order these to keep people engaged in learning so in this scenario the policy we're looking at is sort of some model of player state some history of the previous levels they've done and what they've done in those levels and deciding what the next level would give them and our goal was to maximize engagement which is just how many levels do they play now you might argue is that really the right optimization criteria really we're going for learning and the idea here is this is sort of a game that's online there's no constraints on people using it for a long time and people generally use this for the order of somewhere between four to six seven levels and we wanted to increase how long they we're using this and with the idea being that is this is a ped logically effective game then that should also hopefully help them learn more things and the old data that we had was from about eleven thousand students and one thing you could imagine doing in this case is building some sort of statistical predictive model so what I mean by that here is we're just gonna build a model that says if I'm in this letter if the student has this here series of history historical interactions with a game and then I give them a new level how will they gonna do in this level maybe what they're gonna do in this level and the reason we might want to build the statistical model is because then we can use a disease simulator so now like I talked about before how in robotics there's often these great simulators of what it's like to move a robotic arm now we have a simulator of what it's like to be a person playing this game and so then we can try to just use reinforcement learning to try to optimize a policy for if that's really the right model of a person what is the series of levels we should give them to keep them engaged but unfortunately there's some problems with this so the model might not be good I told you before that models of people are hard even with deep neural networks they're still hard and so 11,000 data points is a reasonable amount but it's not huge and so we're probably not gonna have a perfect model so if we have a bad model we know from our theoretical results that we can get a bad estimate of the of a policy so if I pick a particular strategy for how to show levels to a student and then I use my model to estimate how many levels that student will play I might my estimate just might be wrong might be off and that seems sort of reasonable right like it seems like if we don't have very good models we can't use them as very good simulators here's what's worse more accurate models can lead even poorer performing policies so that seems really bad right like I already told you if our models bad we're gonna be in trouble but no I'm saying actually even if your models good you might still be in trouble seems really bad right and so why did this come up this was sort of an interesting issue for us when we we started it so on the x-axis here you don't have to worry about the exact details of the representation we're using but essentially these are just a line about more complicated models so on the right hand side are really simple models and on the left hand side are still pretty simple models but a little bit more complicated and we're looking at two things here so the grey line is log-likelihood which is on a held-out test set of our eleven thousand data points so let's say it was a 1,000 I don't recall the exact number but let's say it was 1,000 we built a statistical model we saw how good was our model at predicting what people would actually do on that held-out test set so this is a measure of how accurate is our statistical model player behavior and what this is saying is that gray line is going up over time which is saying it is better to have a slightly more complicated model you're better able to model player behavior if we have a more complicated model okay but then what's the black line doing the black line is saying okay if you take that model you pick the best policy that is optimized for that model and then you run it in the real world or so we're actually using the offline estimation procedure but we actually evaluate how good that would be and it's getting worse so we're if we use that model and then try to get a good strategy for how to map level no student state two levels we're doing worse over time and with more complicated models and why is that so this was sort of confusing to us for a while and then I think it's a really interesting issue so the issue is um these models these statistical models are just looking at predictive accuracy so what do I mean by that well let's imagine that you're trying to build a robot that can make tea and the robot can see everything in the kitchen it can see the outside where there's a sunset there's a kettle there's steam and it's trying to use all of those features to figure out a policy if okay given I see the Sun set and the stove and the steam do I take the coffeepot off or on and it turns out the only thing that you need to care about there is whether or not the water is above a hundred degrees or not it doesn't matter about the sunset doesn't matter about you know any weather there's a lot of steam or not you just care about whether the water is hot enough in order to make your steaming beverage and I can get this back up and the problem is is that if you're building a predictive model it's gonna try to model the sunset it's gonna try to model the steam and it doesn't know that what all you really care about in terms of making good decisions is whether your water is hot enough or not so that's the problem is that you kind of have a mismatch of objectives here what we really want is a good policy that's going to allow us to make good decisions where we're optimizing for here is predictive accuracy and so there's a disjoint and so what this led us to say is okay we don't necessarily there data through this thing of trying to be get accurate models of player behavior we really just want a policy that's gonna allow us to get high-performing results helping our students get engaged so as I mentioned just to summarize that point a lot of the prior work says if the model is good the policy is good the problem is that the model class you're using and the features you're using might be wrong and so particularly if you have small amounts of data you might be very misguided in a way that we were here and it's very hard for us to know if our model type is wrong with finite data and this actually relates to a lot of other challenges that are coming up right now in robotics people often call this the simulation two real problem that you've got a simulator that is not perfect you don't to say no how it's not perfect so that when you deploy a resulting policy in your robot fails to grasp how do you know what went wrong so we wanted to do something that was robust to kind of this model class mismatch and the subjective mismatch that what we really care about is trying to get a good policy out a good way to help people learn and so we use a new statistical estimator the important sampling estimator and what it does is it gets a direct evaluation of a policies performance so there's no modeling a player behavior at all we just try to directly estimate how good is this policy and the idea for those of you who may not have seen this idea before or is a refresher is that we had sort of some distribution of student outcomes before when we used our previous strategy in this case it was just randomly selecting levels for the student and then we want to kind of relay those to look more like the distribution of outcomes we'd get under the alternative policies so when we did this we found that we could get a 30% higher engagement than what we saw before and this was significant for two reasons the one was that we could accurately predict this before we tried it in the real world so that was showing that this statistical estimator was really a good prediction of how well we were gonna do unlike the models and the second is that there was room for improvement that we could use these type of reinforcement learning algorithms to get a lot better results and I think the intuition for that is that it's hard to write down what level a student might do at each point in these cases there's a lot of things students are doing when they interact with this sort of game and so it was hard for game designers to figure out exactly what that mapping should be and that's exactly where optimization and data might be helpful now in general this - this technique of important sampling can be high variance the estimator can be quite enak quite inaccurate and vary a lot and it's variance can grow with the number of decisions that are made so I mentioned before that this previous game people would only play it on the order of four to six levels in some of our other educational settings we're making many more decisions the students are using our tutoring system for 20 problems - 120 problems and so in those sort of settings it can be much harder to do this sort of estimation and so we've been working on a lot of different technical challenges that come up with that another so this has been a general big focus of our lab is how do we make how can we use old data to try to get better policies for the future use and both in education and also in healthcare and as we've gone through this we've discovered some surprising things so it turns out that sometimes you can have statistical estimators of how good a policy will do both of which are unbiased which means that you're just as a refresher that means that whatever estimate you get of how good the policy will be really is centered around what its real value will be and the problem is is that then when you make decisions about that it can turn out it's still not fair and that's because sometimes you're gonna have much more variance in one estimate than the other so when you're finally trying to make a decision about which thing do I deploy in the future you can be systematically biased against the better solution so we had a best paper award for that last year and I think it's a really interesting technical issue because this stuff is used and a lot of robotics problems in a lot of other areas of reinforcement learning so we're continuing to do a lot of this work and I think that there's some really interesting questions here about how can we do this particularly in these cases where it is really hard to model people so the thing I want to talk about is expanding this space so wonderful other tutoring systems was this histogram tutor histograms can be thought of as a precursor to understanding basic probability and statistics and they tend to be really confusing for people so things like knowing what is on the y-axis of a histogram are often really confusing when you're first learning about them and we had this vision of doing this continually improving tutoring system that was getting better and better the more people it would teach so you know the student we'd be doing problems selected by our tutoring system the student would be getting them right or wrong at the end we do a post-test and based on all of that data our system would automatically change how it taught people and it would just get better and better and better and so be improving across many many students that we're all having this interaction process and the problem was that over time the tutoring system stopped giving some problems to students so there was a series of about eight skills here that we were teaching and about six of them stopped being taught so that was a little strange because we were actually trying to teach histograms um and then this kind of gets back to this issue of what is the reward function what are we trying to optimize and what we were trying to optimize in this case just to briefly mention is we were trying to look at essentially post tests so sort of how good did they do at the end divided by the square-root the number of problems we gave them so what we were fighting what we were doing here is we were saying well we want to kind of trade-off between learning effectiveness and learning efficiency and so we want to have them do really well but not require you know if they can do really well in ten problems instead of real really rolling 50 problems of practice it's better not to waste the students time so we'll give less problems and the problem was but it just didn't give any problems for a whole bunch of skills and so somebody have a hypothesis of what had happened there why given that objective function I know it's right after lunch but like I so everybody's probably in need of coffee too but it's a great idea why given this objective function it might stop teaching some skills those were not a first test yes so one thing would be maybe those skills were not meant to post this which is great question in fact they were all on the post test so all the skills were on the post test if they weren't then that certainly yeah so the stree point do so we did correct for that so that's a great question and we if this here strictly it would be zero we did make a slight change that so it wouldn't be but it's exactly along that line so what would happen okay tries to give me problems like these your problem because it wants them together good score so it sort of avoids the problems people tend to miss a lot to each are probably the skills you're trying to teach yeah so that definitely should happen and all these ideas are really great even simpler than some of those suggestions um the idea was that the post-test score didn't change whether or not we gave them any problems so if it doesn't change you just don't give any problems cuz you're gonna minimize the bottom so if the top is a constant the best thing to do is to give as few problems as possible and it's a little bit like if you go to a lecture a lot and you realize the lecture isn't helping you learn you've never had that experience and so in that case you just start to start to skip lecture or you realize the textbook doesn't help you learn and you stop reading the textbook so what the system that basically self diagnosed was that our beautiful content that we had spent hours creating was completely ineffective at helping people learn for some of those skills and so it's self optimized it was like I don't need any of that the students seem to do the same on the post-test whether we give them this content or not so we'll skip it so this was you know perhaps slightly a failure of our curriculum um but I think at a high level it introduced this really interesting issue which was it can self diagnose that something's wrong and so there's two levels of self Diagnostics here the first is that none of our material is being taught so that was good to know but the second is that it couldn't reach the performance we expected it to so we were expecting that if we gave the students right amount of problems everybody would do well in the post-test and what it was finding is that let's say maybe you know everyone was giving like around 70% on the post-test regardless of how many problems they did and so people just didn't learn the material it wasn't that everybody had already understood it they just weren't learning and so if you as a human know how good you want your system to be you can realize as a gap there and so the system says I'm done optimizing and there's still a gap which means something needs to change now in the context of sort of classical reinforcement learning you'd be stuck can't do anything at this point because that's your system and you've optimized as well as you can and it's at 70% and you're just done but of course as humans that would be very very sad if that was the case that would mean things like the Zika virus comes along and we'd be like well that's too bad we don't have a virus for vaccine for the Zika virus so guess we're stuck you know but we don't do that we invent new things all the time like we encounter new problems for which we have no existing solution and we invent something and the context of reinforcement learning that would be called inventing new actions so we see that we are stuck and we invent new things and similarly there are times where we realize we don't have the information we need to be able to make the type of decisions we need to and so we invent new sensors to try to set a new things like the Hubble Space Telescope and so why can't we have that in the context of these sort of human loop systems so this again is a big distinction because you know in playing go I can't suddenly decide that I'm gonna use a new move which is like take all of the other person's pieces or to declare victory right and we can't do that in those sort of closed systems same an Atari right I can't suddenly teleport my agent to the final room but we can do that sort of thing in a lot of human Vegas Eden applications we can invent a radically better patient treatment so in thinking about this in the context of human loop reinforcement learning we want to augment the system so that it's no longer this kind of closed box but we can have outside input so the first thing we started thinking about is well how could we have an agent reach out to a human maybe an expert maybe a crowd and say hey I've optimized as well as I could and everyone's still only at 70% what should I do like I need another video to explain histograms or I need another problem can you help me out but we wanted to do this in a way to kind of direct the human effort people are busy and we'd like the system to say alright well given my optimization of this system and where I think people might be failing to learn where I think the system is currently doing poorly can you add something exactly there so the way we did this just to briefly sort of give some of the equations is that we've tried to think about the local improvement we could have so imagine that there's some part there's some situation where the student isn't understanding something and you could introduce the perfect explanation how much of the difference make so we tried to quantify that which requires us to have some upper bounds on sort of how good it could be if we were at this new action so can we imagine sort of introducing the perfect vaccine through Zika virus or could we imagine sort of introducing this perfect explanation and how student learning would go later on and one of the challenges that come up when we do this is that this is all during the process as the system is itself optimising and so often there's a lot of uncertainty in terms of how the system is currently performing and so one of the things that we came up with is to say don't add in actions in places in the world where the existing action might already be good enough and you just don't have enough data yet to tell that add in actions in the places where you have a lot of evidence that you are not doing well so like if I've just given one lecture in the class on a topic that's really hard to learn maybe that was a good lecture and we just need some more practice with that but if I've given you know this lecture to 30 classes and nobody ever understands it that it's probably time to do something new so what we did to try to check how well this worked was we did some different simulations and this is kind of a sort of completely toy domain and the idea here is to say well what if there's a really large action space can we kind of grow over time to add in these new actions and get good performance and the idea is that this sort of directed way of add scheme to augment the space and get better actions for your agent is like from overtime is better than either doing something where you just get random actions or where you try to like think of things that are just like frequently visited so you could imagine just saying well what are lots of places where people get stuck is that the right place to add a new explanation or you could do it in this more targeted way and the targeted way is better and another thing that I thought was interesting again this is in simulation but we tried to say well what if the advice you're getting from the human when you ask for these new ideas is not always good so sometimes you know maybe like I try to think of a new lecture and it's bad and that is not effective the nice thing we can find here is that the system is still learning about what actions are effective or not so as long as some of the new actions that are being added are helpful its performance goes up over time because it can learn to do discard the bad suggestions because in this case somebody actions I'm the random here is just how you're selecting weird to add in those actions and some of them are still so then and I'll return to this example later but and so then we started to think about how could we do this for a real educational setting this is again a game that was developed by Zirin Popovich's lab and this is math word problems and the idea is that we often many tutoring systems we give students hints and these hints themselves can be thought of as an action and they're not always very good by game designers or had the people that the researchers involved and sometimes they're good and sometimes they're not so good and so we thought that we could ask people to figure out you know to add a new hints exactly where we think it's not currently being very effective I'll return to this example in a sec so another thing that we thought about in this case is well all right maybe the action space is good but maybe we're missing something maybe there's something about the student or about the person that we're missing that would allow us to make better decisions so for example maybe I'm talking to a student or maybe yeah and I realize halfway through the question the the discussion that maybe they don't actually know algebra and so then if I ask them do you know algebra and I can provide a much better explanation than just assuming that they knew algebra or just sort of trying to pick the best explanation regardless of whether they know algebra or not so it is often the case that additional features allow us to make better decisions but there's a lot of reasons not to put in all possible features in advance it tends to mean that we need more data to learn a good policy so our idea is to think about whether we can do this strategically can we think about this in a utility driven way if we were to add in a new feature could we make better decisions so this is something ongoing work with my question Rampton and just as a really small example some of you guys might have seen these before if you've taken a I this code this is a grid world and here's a very small toy domain that we often think about it and in this scenario it's only those big black boxes which are kind of the state of the world so there are nine states in the world that s means a start state that J means a goal state and you have basically a little robot that's trying to learn how to traverse and the thing is is that this is the world that it sees so it only sees like those nine states inside of the big block black boxes but in reality the world is very complicated at least more complicated than that there's things like rainy States and windy states which cause its dynamics to move in different ways and so the real policy can depend on more distinctions than it currently has access to and so what Rampton developed is an algorithm that tries to adaptively split the state space and refine things based on utility based on the idea that it uses its data it hypothesizes whether or not there could be additional latent features that it would allow it to make better decisions if it thinks it would it splits and that sort of simulating asking a human hey I think there might be additional feature here that if I knew it I could make better decisions can you help me out and what are the interesting things we found here is that the number of distinctions it was making was still smaller than if you would split the entire space so you may not need to make all the distinctions in the original domain yeah oh well if I only understood like this concept you because you may not be aware that that yes yeah no I think it's a really interesting issue and I will also get back to that in when we return to the word examples so at a high level our vision here is to say how can we use this human expert or a crowd to try to make the system better to change the definition the spec of the system if you will over time so that we're getting the system it's not only self improving but it's kind of expanding in scope to try to really achieve the performance we want but as Michael just mentioned including humans loop has some limitations as well so well let me first just talk about sort of the grant more grand scheme and I'll get back to that um so this can be used in other ways too this could be used in crowd work so this is some work that with my students and Rudy and then our collaborators Eric Horvitz and Ajay Kumar and in this case we were thinking about crowd work tasks where we would like to train people to be able to do more located tasks and one of the challenges in this case is that we may not know what's the right way to train them to do more complicated tasks and it also takes time to write those sort of instructions and so we were curious about whether we could have a scenario where because we know that worked examples are often helpful could we take the solutions that people are generating and use them to train the crowd workers so this is we publish this think highly years ago to show that yes you can use sort of the old solutions to try to bootstrap and teach future people how to do this task and in general you could imagine sort of a whole ecosystem where if you're doing a complicated task you also need to be able to validate and grade the solutions that people are generating and so you could be using the crowd both to create solutions use those solutions to train the crowd workers and getting the crowd to validate those the thing that I think is interesting about this site type of vision is saying well really we could be using machine learning to try to kind of select which solutions are effective for teaching those might not be the same ones that are actually the best solutions there can be differences between how you want to demonstrate and how you want to teach but that we could serve have this closed loop system here which is leveraging both people particularly for the invention of the new content and machine learning to try to refine it but let's get back to that issue that Michael was just meandering about metacognition so we haven't looked at that issue in terms of the feature splitting but we have looked at some limitations of asking people for actions so a lot of work right now in reinforcement learning and robotics and AI is thinking about this issue of learning from demonstration where you have someone demonstrate an expert like how to pick up a cup or you know many different types of aspects particularly robotics but teaching is not the same as doing and this has been noted before and it's not clear always that learning from demonstrations is the best and teaching might be much better but people might have challenges so teaching is really hard and if we looked at what was happening in this case so we had education students who are providing these hits to the machine that then were being used eh later students so you can think of the system which is this math world problem system periodically we ask education grad students hey the system isn't working very well can you provide some new hints that students can use to learn and then we keep going and the problem was is it made no difference so it didn't seem like the system was really getting better over time and when we dug into this we realized it was really hard for people to know what new hints to provide and I think it was hard from a couple things even though their general sort of education experts they're not an expert in this particular sub game and they're not the same demographic as the people play in this particular sub game there's a lot of delayed feedback about whether or not how they were teaching the agent was effective or not it takes a while to assess that just like in a lot of teaching and whether or not you're being effective and we weren't doing any training of people to try to help them do this and so you know when people are sort of teaching or thinking about how to teach another person I think we sort of want them in this case to be able to think about what is the robot doing or what is the agent doing and what is the tasks involved and how can we help train the trainers so think of the humans training the machine in this case but we may need to educate and train the trainers in order for them to be able to provide effective feedback and I think as we start to generally think about this closer interaction between machines and humans and through these more collaborative aspects it's gonna be really important because there will be many people interacting with machine learning algorithms that are not machine learning experts and allowing to train them to provide effective feedback at labels and interactions with the with the machines I think will be really important so just to summarize I talked a bit about how do we learn from the past to try to decide the future I talked about how we could maybe expand the space of actions and observations in order to make systems that can outperform their original specifications and I also talked a bit about how I think we're going to need to change some of the interaction between human and machine learning in order to get truly effective systems I just want to thank my group and open it up to any questions there's type of our discussion feedback questions suggestions of new actions she could take so I think the question of learning reward systems is super interesting I think in particularly to go back to the beginning it was Trinity who wanted to fly a helicopter she had a particular goal and I think ideally we'll make all these systems where we could sort of expose people's reward or objectives and then support them towards doing that I think in our own work we haven't we've normally been in situations where we've imposed the reward structure like post-test performance or engagement there's certainly a lot of interesting work on preference learning where you try to infer say from people's actions what their reward function is or sometimes it's called inverse reinforcement learning you try to infer their reward but I think that's a really interesting issue I also think is a really into particularly the context of coaching and training of how do you grow someone's reward function like I think there's often this challenge between saying this is maybe what you currently want but maybe you're not aware of these other potential outcomes and so like a good coach or a mentor often exposes you to additional opportunities that then you decide you incorporate as part of your goals this is not what I'm trying to reason through like in the in our educational system often there's a lot of value placed on keeping someone within the system so maybe you're struggling to learn Matt you know well let's keep you around and put you into a different math Lane or something so that you stay around and that seems to be optimizing for some sort of reward criteria that might be different than if we were just saying try to maximize the amount that they learn or something like this because you're subject to budget constraints like which might I do what this thing didn't say okay we need to cut them out of the system they're not learning math and this feels to me like is I don't know which one's better although I have a hunch it seems like part of this is about the formulation of effective reward strategies as we're moving into these words socio technical domains how do we reason about what is it what are these rewards and when are we gonna go to certain biopic outcomes I guess I a couple thoughts within that so one thing that I find interesting in terms of how people use these actual systems so a lot of these sort of more personalized systems um you can also skip people ahead and in terms of keeping people coherently within a class it's really hard to teach people of which some people are in unit one and some people are unit twenty and so people often get caught up which means they're just advanced and so maybe like at some point you know you're struggling you're still on unit 1 of pre-algebra but everybody else is now halfway through algebra in your teacher just moves you up and they observe that empirically a lot and so people aren't necessarily following these personalized systems always because it is really hard in the classroom to have people being that disparate it's very hard teacher load I think in terms of maybe the second thing of saying like you know what is the was the reward function I mean I think another big issue like if we look at the sort of the No Child Left Behind the objective there is that maybe everybody passes you know gets to a sufficient level as opposed to optimizing the expected outcome so maybe that's going to be a little bit lower and I think there's a tension often between saying and we have constraints that make sure that you make guarantees for the individual as well as sort of improve the average or the improve the expected amount I think ideally another thing too is that you would like society to sort of explicitly decide this is the longer-term objective which is probably not just passing this year's test and it's like you know reducing crime rates later getting into college having long term jobs and the problem right with that is having these very long-term data streams in the credit assignment there has been some really cool work looking at tutoring system work and LAN in like sort of eighth grade and later stem career choice and college so when we have those data sets I think that's really exciting because those are really the outcomes we would like it's the kind of space that I end up having to navigate in many different ways incentives that the person in the system are not wholly rational right and so you know if you're trying to optimize my test performance you know like if at some point I don't think I'm gonna learn anymore I might just start messing up the system or you know answering randomly doing things that like the system now miss models the entire ready and these seem like really interesting questions to be able to unpack yeah so I think also a lot of the things another point that you just brought up here is that these systems are generally thinking if the person is kind of a black box and also but but a sort of a unassuming black box it's not an active Alex and they could be adversarial they could be cooperative and that should change things in terms of how we teach it how we interact and there's certainly been a lot of work on gaming the system and how people game or exploit features of the RL systems but I think there's been less work on looking at the cooperative setting like I want to learn this you want me to learn this can I need way less data if I can reason about the agent actively choosing examples assumption of action space sort of thinking about open world assumptions in not sure if the objective is their reward function is the right think of it or list in the in the actor right that there are unobserved things about the actor or that their energy might shift yes exactly and those are bringing out which of those parts of the key features to be doing optimization over or to consider in terms of robustness one thing we try to think about is robustness to failure your models LEM that whatever model we use is going to be limited and so how can we try to reason about them to be robust especially in education maybe in the context of institutionalized racism or bias of teachers where certain classes of students have not been encouraged or had expectations so they haven't tried for instance more advanced problems because they weren't provided to them so how do you deal with that yeah I think that's a super interesting issues so in these cases we have not been addressing it and some of our are their work if you know what some of those variables might be that you want to actively correct against we have developed algorithms for how to do that so in the kind of word of machine learning supervised learning case we've looked at let's say for predicting entrance to college can we systematically make sure that you don't systematically over predict for one group and under predict for another group and you if you know what those are the things are that you want to try to protect against then you can put it in the system you can make sure your solution respects that and solves for that I think one very big challenge still is if what if they're late and ones that are you they're correlated with that thing you don't realize or if there are features that you're like oh in retrospect I should have included religion or I should have can you know controlled for X that you miss and then your system will tend to exploit that information that might be sort of indirectly related to zip code or something like that so I think that that's a really big challenge in machine learning right now being with change in preferences for who the actual user over time at the beginning they might want to learn something or they desire to learn and then that at some point they may change in many times as Michael said is irrational chose because fashion with the complicity of the irrationality of human being over time well we certainly don't completely I think the issue of non-stationarity is super interesting we had a paper recently looking at how do you detect that so how do you detect when this the system has changed and related to that Steve Collins and I are starting a new collaboration he's a mechanical engineering and where he has an exoskeleton and you're trying to optimize the parameters the exoskeleton for a person and what are they really interesting things is that it can take people a while to learn how to wear an exoskeleton and so they're non-stationary during that time period and then after that you're kind of teaching them this is what does exoskeleton this is how to change your gait and after that you want to optimize the parameters for that person and so we're trying to look at how can you tell whether the system is non-stationary versus when it's converged and can you detect those changes so I think it's a really interesting issue and that often people will be changing those over time maybe they start they enroll in a MOOC and they don't they're just gonna learn a little bit about history and then they decide they love it and so now they've changed it and in some cases you can ask sometimes I think it's fully observable or it's worth asking and in other cases it might be harder to ask and so you might want to say hey you know there it seems like things have changed like your pattern or behavior has changed and that that comes up in like drop out predictors too so there's some work on that to try to detect it but they're not always modeling all the dynamics and what do you see as the future of the inflating system always gonna be a supplement to I think it's very unlikely that universities are sort of like collective learning institutions will go away because I think there's a lot of other aspects that come right now from like the social interactions and yeah and so I think it's doubtful that people will just be interacting like with their own tablet that's provided they're like learning coach forever um but I think they could have a much larger role and particularly in lifelong learning I think that as particularly with AI automation I think people are gonna need to do a lot more lifelong training and I think having sort of a personalised coach that comes with you along that journey could be way more effective than the types of ways we do education right now like right now I so Nicky Couture Geoff big earner I had who were professors HCI and CMU had a project couple years ago called the question reader with the idea being if I want to learn a new topic there's kind of no nice middle ground either I could like you know just like start with like a course which assumes no background or I have these really high cost ones of like bugging my colleague to sit down with me for three hours and teach me and that if I just do a web search there's no personalization of my background knowledge and and how would we create a system that could create a personalized tutoring system for me and I think that sort of stuff is stuff that we could reach you know that we could start being up to dynamically select material that says okay I'm I know something about optimization and reinforcement learning so this is the type of material I should give to her for learning new things last question do you think we're like to focus on trying to like really design for success and we should be working on design for failure like when these algorithms fail should we then use that as an opportunity to like you know if there's something new about these there may be through interaction she's saying the computer can say like I'm not freaking idea why should I do now like why don't you tell me what to do as opposed to like trying always to be right yeah that's an interesting question like you know so can their system sort of figure out when it doesn't know and then maybe just you know reach out to a human or reach out for augmentation and I think that's a really interesting question I think there's cases we could do that I guess I'll just say I thought you're gonna go to a different direction which is should we be optimization for student success like should that be the objective and I have a lot of thoughts about that and I think one of the new areas that I'm really excited about is trying to optimize not things like which activity to give but try to decide things like metacognitive strategies or do things like mindset interventions or impostor syndrome or belonging interventions which from some of the psychologists here at Stanford have shown to have enormous implications and they're really really cool interventions because they have these recursive effects that if you can change if give people's different theories of how they can interpret their own experiences then that can have huge cursive effects on how they can take advantage of opportunities and so I think if we can put those inside of tutoring systems both we could personalize those and we could scale up their impact and so I'm much more excited about those these days rather than figuring out you know video then problem or problem than video she's around you can take her courses on the reinforcement learning and other kinds of cool things you can go researching collaborate with her research you can swing by her office hours and just chat about cool things you
Info
Channel: stanfordonline
Views: 5,590
Rating: 4.9101124 out of 5
Keywords: HCI, stanford CS, computer science, CS, human computer interaction, reinforcement learning, human computer, emma brunskill, AI, artificial intelligence, machine learning
Id: nj5t1Q_ANlw
Channel Id: undefined
Length: 55min 57sec (3357 seconds)
Published: Wed Mar 07 2018
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.