DeepMind's Richard Sutton - The Long-term of AI & Temporal-Difference Learning

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments

its from the RLSS. better video @ http://videolectures.net/deeplearning2017_sutton_td_learning/.

Edit: better wrt slides and other features of videolectures

👍︎︎ 12 👤︎︎ u/terrorlucid 📅︎︎ Oct 05 2017 🗫︎ replies

This is a very enjoyable talk. I like Sutton's style.

👍︎︎ 6 👤︎︎ u/visarga 📅︎︎ Oct 06 2017 🗫︎ replies
Captions
thank you everybody thank you joy well it's true I'm kind of the I have the most gray hair here I've been around for a long time in and I want to kind of talk about the longview a bit actually so this is the challenge its first let me say though it's really exciting just to to talk to all you young people learning about the field learning about I want to get into AI and reinforcement learning and it's just like it's a special opportunity I really appreciate it and so so when I do this I have to think what can I try to communicate to you because you know we have really a brief amount of time and the field is immense written a whole book on the subject and so what do we what what can I try to do now my topic nominally its days is temporal difference learning and certainly I'm going to talk about that but I also want to you know to speak to you as young people entering the field and just tell you what way I think about it from a sort of a longer view and well so I think the first thing to be said along those lines is that this is a really special time and we can look forward now to to doing an amazing thing which is maybe understanding the way the mind works way intelligence works this is this is like a monumental event not just you know this century you know for thousands of years maybe in the history of the of the earth when intelligent beings animals things that can replicate themselves finally come to understand the way they work long enough to by design create intelligence and so the big thing that's gonna happen is that we're going to come to understand how the mind works how intelligence can work how it does work and just the fact of our understanding it is going to change the world it's gonna change obviously there'll be lots of applications but just it'll change ourselves our view of ourselves what we do what we play with what we work at everything it's a big event it's a big event and we have to we should keep that in mind and as I say you know I'm not saying it's gonna happen you know tomorrow it's not good but you know 10 years it could happen 30 years it could happen and then the great sweep of history that's a small amount of time it's gonna happen within your lives with high probability so so the first thing we have to ask is why is this happening and I think it's really comes down to like Moore's law okay so I took a slide from Kurzweil and this is his standard slide of the increasing computation per dollar and it's just reaching an a point where it starts to compete with the computation that we can do with our own brains with natural systems okay that's happening now and it's just going to continue increasing dramatically I'm sure you're seeing these kind of slides here we have time years along the bottom and on the other we have computations per second per dollar okay and you see it's a log scale so of course a straight line would be exponential increase and Kurzweil argues that it's actually slightly super exponential but the point is the computations become available to us has become enormous and that changes things it's the rise of deep learning and all that is purely because the GPU is in computation getting cheaper and that will only continue and it will become more extreme as vastly increased computation so that alters everything and it really is profound and I just was trying to slide to say what I conclude from this as I think it's an answer to like a question for AI this had for 6070 years okay so the answer I'm not gonna I just gonna do a couple of slides like this but I guess I should one one one last bit of preface is that mice might my talk today is gonna be like half he's on these big things okay and like I'm starting to do now in half like really small things where you go back to the foundations real stuff that's that's basic that we need to okay I think the details the big picture is important but the details also really matter and like we heard a lot of specific algorithms early today exactly how they work just really important the details matter but the big picture matters and sometimes we lose sight of the big picture sometimes something comes so obvious that we overlook it I think the computation is as long that line stand its implication its implication in a phrase is that methods that scale with computation are the future of AI okay and that's that's by scale computation I mean as we get more computer power those methods become more powerful and this has this has not always been tried true but certainly it's true of learning and search methods these are what we call general-purpose methods and it's the answer to this one of the oldest questions in AI for sixty years old do we want weak methods that's as they call them in the old days they call general-purpose methods weak because they just they just use the data or use computation they're gonna leave their their general-purpose their general their weak okay that's what they calls me the old days the strong ones were the winds that would lose human knowledge and human expertise and human insight to make their systems so much better okay now probably you guys are thinking that's crazy the general-purpose ones are you thinking that with me general-purpose how many of you great to go human insight all right some of both okay yeah that's that's that's nice compromising sort of position and so all today I'm going to try it all in this big part of the talk I'm gonna try to talk about I'm going to present strong views okay really maybe you should do compromises and nuances but it's good to talk about strong views because they give you a working hypothesis to give you a point of view and you know you can say this is a strong one you don't have to believe it you have to say well that's the strong point of either that I shouldn't be thinking about and then maybe you have several of them anyway so I'm gonna that's a strong way to either this is all this question has been answer it's been imagined in favor of the weak methods and yeah and now I don't want you to I could talk about this all day but I'll refrain from it I'll just note the next thing to say is that you may you're thinking you're good you're thinking you're into deep learning or reinforcement learning so you're on the right side of history but I'm not sure that's right okay because we take things like any kind of supervised learning even model-free reinforcing the thing I love more than anything right it's only weakly scaleable if I had you know a million times its computation I'm still limited in my model for reinforcement about how fast I can gather data and you know I'm only learning a value function or maybe a policy what's the big deal that's a tiny object and some map from States to what to do okay and that's not a big thing and okay how how many features you have it's yeah and super certainly for supervised learning is only weakly scalable because it requires people to provide datasets okay and there they become the bottleneck if you have you know you get a get people to label things on the web things scale but only weekly only as faster you can gather to the data and unless eventually that becomes a bottleneck you really want to be able to learn from raw data you want to be scalable okay so if these things that we love they're not scalable what is scalable what would be fully scalable well my answer is simple it's what I call prediction learning prediction learning means learning to predict what will happen okay and so it's the unsupervised supervised learning because we have targets where supervised learning we have we we just wait and we see what does happen and that's our target okay and so you don't need to have human labeling or you have a target but you don't need a human to provide it just get it from the data so you see on super high supervised learning but anyway it's it's definitely the scalable model free learning and maybe it's it's the scalable model free learning and prediction learning is at the heart of all of our control methods where you learn value functions I think maybe that's sort of the argument I wanted the big argument about temporal difference learning is that it's the scalable model free method and of course I haven't given you even the step towards that yet I'm just saying that the idea of predicting predicting having learning that predicts is the key problem it's what we should be thinking about and you may say well deep learning supervised learning is all about prediction learning but that's not true it's all it's about predicting what the other what the label will be but it's not about predicting over time it's not about predicting where you have to wait and see what happens and that makes all the difference okay so so with all this words I want to ground us a little bit remind us what the data what the data looks like so I put in the slide of this is what real life is like real life is a temporal stream we have things like playing soccer and we have to make actions at every moment we're maybe a hyena being chased by lion and trying to predict whether it's gonna live or die we're maybe a baseball player and we're where our eyes are watching this tiny little ball flash by us really fast I have to swing or just the right moment to hit the ball or maybe you're talking to someone you're trying to predict what they will do so this is the data streams that AI should be learning to deal with so we should always keep this in mind when I say learning to predict well I think I didn't think of that like a hyena trying to predict you know fear fear is your is your prediction of are you gonna die okay so he's trying to predict it several times it looks good and bad and and at the end here it's not looking so good okay so those premises let me start talking about temporal difference learning temporal difference learning temporal difference learning it's a method for learning to predict it's widely used in reinforcement learning to predict future award value functions and it's basically the center the core of many methods you know about cue learning and sarsa TV lambda deep cue networks TV gammon the world champion backgammon player using deep reinforcement learning from 15 no 25 years ago my god 25 years ago deep reinforcement learning 1992 but not all reinforcement method so alphago it happens not to use TV learning uses other kinds of reinforcement learning the helicopter thing that you might have heard that doesn't use it and then there's sort of pure light like Peters talk was interesting because he talked about policy based methods and some of those don't use temporal different slang but eventually he would get to it and put it in to make things better so it's sort of yeah yeah so it's sort of optional it feels like it's optional many people say it's optional but I do want to argue that you do want to use it that really you should always be using it it's ubiquitous and okay it seems you how the brain works what else today oh now it can be used to predict anything it's for general method for prediction learning not just for rewards okay so don't be fooled by and reinforce something you read all about TV it's all about predicting value functions reward but we can use it for anything and since my talk is just about TD I want to be sure to to think about this general use question yeah I forgotten exactly the details the initial learning system was learned from from actual outcomes actual games and they would go all the way to the final outcome if they make a prediction look at a position make a prediction how am I gonna win or lose and they would go ahead and see who actually won the game at the end they would use well who actually want instead of a temporal difference learner so that's what we want to work for do do they wait and see who actually won do they see the outcome or the return or do they do the updated guess from a guess okay let me make that clear so that's my next slide is the TD learning what is TD learning well basically it's learning a prediction from another later prediction okay so if alphago was looking at position make a prediction and I'm zooming all the end of the games he won and then it's not TD it's not learning that prediction from a prediction on the other hand if you make a prediction from a position and then you make one move and you see your next prediction and you use that next prediction as to form a target then you're doing TD learning okay so the quick word quick phrase is we are learning a guess from a guess okay sounds a bit dangerous doesn't it how it would constrain it what would tie it down okay but that is the idea we want to learn an estimate from an estimate and we have to talk about whether this is good or not okay the TV air the TV air is the difference between two predictions to temporally successive predictions right so if you're playing your game you say I think I'm winning then you take another move you know now I think I'm losing you try to learn from that you don't know who actually is gonna win not yet what you try to learn from the temporal difference in your predictions okay now after that it's the same as what you're all used to you just have an error and you send it back proper through or whatever you're doing and so it's just really where does the error come from or where does the target come from is the target come from the end of the game or does the target come from the next prediction okay so here's the the example of TD gammon originally 1992 and he had a chest a backgammon position and he would send that into a neural network which would filter through actually just a single well he had many versions the standard version was a single hidden layer and he'd end up with this probability of winning and the error then was the probably winning in one position - the probably winning in the next position so we look at the change in the estimated probability of winning and that was used as the error that would be back propagated through it and this would learn just stick it in a corner playing against itself learning from itself from its own trial and error and it came out to be competitive with the world's best players really the best in the world okay so that's familiar now but well I'm trying to get to this question those questions do you need to use TV learning because this all this is is to motivate the motivation is the most important do I need to use T learn here can I get away with it because you go to the field now maybe even in reinforcement Lane you'll find a good fraction of the people don't believe in T dealing and I think they can get away without it okay and so it's a real we should all be asking do we need it I want you to understand I want you even as people learning about the field to to be able to engage with this question and and know the basic facts pertinent to whether we should we need to use TD learning okay so I will skip over over Atari and as you've seen that already so TD learning Wendy when do we need it okay well it's only relevant and what Icahn multi-step prediction problems that's the first thing only when the thing predicted is multiple steps in the future so obviously if you're predicting the outcome of a game that's multiple steps in the future but if you're predicting a label or or in alphago where the data was the initial data at least was here's a guess and then I'll just see who won the game and if you just use who won the game then it's essentially a one step prediction okay okay so now I want to say that it's really broadly applicable you should I said it's only applicable when you have multi step predictions but really everything you want to do is going to be multi step prediction if you want to predict oh I have some examples multi step prediction if you want a predicting outcome a game you want to predict what a stock market index will be well really you could just predict what it's going to be but you get more data on every day after that and you'll mean like new prediction so you make a mole you like repeated predictions about a long-term outcome if you want to predict will be the next president you can predict that you know every every day as a new event happens if you want to predict who the US will go to war against next so ll these are long-term predictions they don't they don't jump to the end now even even if you want to predict a sensory observation if you want to pretty just the very next sensory observation that would not be multi-step prediction but if but but if you were to predict you know even ten steps ahead or a discounted measure of the future those are long term or multi-step predictions and yeah we think about go back to the real world the the the hyena and the and the Lions or the the conversation or Messi playing soccer he's got to make long-term predictions it's not you know what's the next thing it's am I gonna make this goal will I get around that fellow where's the ball going to be in a few milliseconds from now all these things are multi-step predictions now can we can we treat it can we just use our one-step methods like you know sure things happen bit by bit but ignore that just wait and see what happens you know like see you on the game and use our one-step methods and or can you learn a one-step model and then compose your model okay and the answer is that we really can't do these things and I want to try to give some sense of that today and I'm really going to talk mainly about this first question can we think of the multi-step case as one big step or do we have to deal with it bit by bit and but I wanted a one slide on the second bit the second bit is can we learn one-step predictions like a model and then iterate them to get a multi step prediction when you need to and I just I just want to say that I think it's a trap okay and I know I could really properly explain this to you but I think it's a trapped I think that it's enough to model the world to make it like a low-level simulation of the world to make like a think tray to throw is a Markov decision process and model the transition probabilities or treat the world as a as a as a engineering model where we just have to learn the velocities and the accelerate and the effectively on the accelerations or actions and then integrate this low-level differential equation Lisa this is all a trap these short-term models and then iteration is it feels good because we know we know that if it can be done perfectly on the one step then it can be done perfectly however far we want to look into the future but there's two problems first of all we can't do it perfectly and when we do it imperfectly with it because we're always gonna have an approximation then when we try to iterate them we we get a propagation of errors and compounding of our errors and we get a useless long term prediction and secondly of course it's exponentially complex because as we look ahead each step there be many possibilities the world is stochastic and also our actions may be we have we're different choices we might look at for actions so it quickly becomes computationally intractable it will always be computationally intractable to try to look ahead many small steps into the future I like to try to try to iterate your model of physics to get to you know how much fun will I have going to Montreal and taking the summer school okay it's crazy and it just doesn't if there's no future in that that way of thinking it's a trap and lots of people are in my opinion are falling into it okay but let's go back to the other side or the other side remember I'm just going backwards here we have these two things two ways to get away from TD if we don't mind get away we don't like TD can't we learn a model and iterate it that's the second one or the first time can we think of it as a one step thing and just use do the one step thing okay so the one step thing I'll do it I'm gonna do it in I'm about to transition to my low level part of the talk but I don't want to try to answer it here just at the high level and then maybe we can even take questions after this this high level part of the talk can't we just use our familiar one step supervised learning method and and reinforce something these are known as Monte Carlo methods I mean it's just roll it out or whatever see what happens and use that what happens is I've target okay so this has costs number one Costas is you have to you make you're making this prediction then you're rolling it out to the end and really you're gonna make a prediction in every moment in time so you've got to remember all whole mess of predictions as you as you go out to the end to see the outcome or really uh if you have returns you get the outcomes at different different steps different outcomes you have to relate them back to the to the earlier situations it's horribly complex it's nasty okay it's first of all you need to remember all the things you did think about yourself maybe you're the lion there you're trying to make a good prediction okay and what do you have you have all the stuff swirling around you I had the hyenas running away you have a glimpse of him your your feeling of your feet you have all the stuff that you can sense and and then you want to relate that to how well it's going how how good you should feel about this chase okay and so what is it sensible to think or just be much better if you can do it now right now we know all the stuff is in your mind and in your sensors learn if it's going well or going poorly and learn now as opposed to wait wait five seconds later when you guys have you know a whole number a large number of frames of different different sensations and different patterns of sensation you've forgotten you know but if you have to wait five seconds till the end of it it's too late you can't remember all that whatever you remembered will be a tiny shadow of the real vivid representation you had at the time it happened okay and of course the computation is is poorly distributed you can't learn now so what are you doing now ladies find an outcome you have to do all the learning then just a poor temporal distribution and you can avoid these problems with special methods that's what really what TV is about specialized methods for the multi-step case and another reason that you don't want to wait is that sometimes you never know the target like I don't know let's say you're playing your chess game and there's a fire alarm and you never finish your game so you never see a final outcome but if you're TD you know maybe thought you were winning and then you know it was going really poorly and you say you did something bad you can learn without waiting until your checkmated maybe the fire alarm just for like one move away from the check game so technically the game never ended and but you can learn a lot from your experience so obvious we could try to ignore all these things think of them as nuisances but I think of them as clues these are hints from nature but how we should proceed okay okay so now now I'm going to get down to it more technically but I hope you're starting to see the view I'm trying to present we really need to learn from new predictions so that we can do it as we go along and I think it's really ubiquitous and kind of and all the different kinds of learning we're looking at to nai okay so I'm gonna use notations a little bit different than what we've heard earlier today that's what I call my my new notation and I use it in the second dition of the reinforcement learning textbook the big thing is that we're trying to use the students status to Texas statisticians convention that random variables are capital letters and instances are lowercase letters so all of the things that happen that make up life are capital letters Azir the random events that have actually happened they're not possibilities they're whatever happens so s0 is the first state a 0 is the first action taken in that state are one is the first reward that depends on that state and taking that action in that state and then as the same time as we get the reward we get the new state so I I like to give them the same temporal index r1 + s1 they occur together they're jointly determined in fact okay and then life goes on on and that's the data that's all we have terms of data we have a trajectories and maybe a single trajectory and we're interested in classically in the return and the return is the sum of rewards and I'm not gonna use capital R for the return because capital R is actually the actual rewards the sequence of rewards and so I'm going to find the return I need a new letter I'm calling a G capital G because it's a random event the R and whatever some of future of rewards after time T actually was we're going to call that G of T G of T is the return and as we note here if I can use this thing this dot just means that this is a definition it's not a a statement of something that's true because it follows from other things that I've said it's a definition so G of T the return is the sum of the future Awards with the discount rate is most common way of dealing with it and that of course can be written now really as equality this is not a definition that equals the first reward plus gamma times the sum of all the later rewards and we're just taking one factor of gamma out of all of these guys so this is no no no again as this one is one guy right out of this guy's - gammas we just take the one out and then the rest of this is the psalm of future rewards from from from one time step later okay so it's a bit like what we started with it's just like the same thing it's G of T plus one okay this is just a definite this is this is this is this is a true equality or any return be written as the first reward plus gamma times the next return okay and that is gonna be the basis for our temporal difference learning because we're going to race T gonna use it this is um we're gonna use this as a target we can use the reward extra ward post gamma times the next return essentially as a target I guess that's gonna be explained right right next we look at a state value function okay since we're using capital letters for the random variables I can't use capital V for my for the true value of the function okay it's gotta be lowercase and it's the difference function of the policy so V PI of s where s is any particular state it's an instance any state lowercase s its value is the expectation of the random variable to return if we started in in step state s okay so what we can expect the return can be under PI pi as the policy so it's some way of picking the actions of course the value of state depends what you do if you dance at the top of the Grand Canyon it might be bad but if you consider a walk up to the railings it's good so policies Val you found value is depend on policies okay so then you know this is the return that we can we can just use the above equation the return can be written as the first reward plus gamma times the the rest of the return and then it's a retained expectation of this this is the expected next reward and expected value of the next state okay so that naturally leads to it the notion of an error we can use we can compare the the estimate estimated value of a state at some time to the reward and the estimated value of the next state okay so that's gonna be our TD error this is what we're gonna use to replace the normal conventional error now this V missed is is a random variable this is our estimate their estimate will depend on what happens and so that is random the bat the estimated values are a random value or a random function and so it's capital and that's our two-year you got it a TDR was that clear good okay so now let's talk about our methods I want to contrast supervised learning and remember I said supervised learning is called Monte Carlo in this context so what exactly is that that's we take our estimated value function for the state that we run into V of s of T okay we're going to update it based upon some experience okay here's the experience here we are des of T and this is the tree of those things that might happen like we might pick either these two actions and if we did pick this action either these two states might arise so yeah black dots are actions the open dots are our states so basically this is this is the tree of all the futures that might happen and in this case we're imagining that there are terminal States we're basically adding things up until we reach a terminal stage so here is a particular trajectory that might happen let's say it did happen from that state we would this this this this this and then we terminated okay so we now once we've terminated we know what G is you know turn is and we can do this update rule we can compare our estimate for the state at time t you up here to the actual return and we make that that error and then we between us an increment I should this is the step size alpha is a number like 0.1 so we increment towards towards this target okay that's that's a standard would be a Monte Carlo learning roll a supervised learning roll and that's the the competition for the TD method the simplest TV method looks instead like this we only look ahead one step we're at s of T we look ahead at one we see the the reward that happens and the next state that happens and based on those we form this TD error which is again is comparing comparing or updating this estimate of this for the state at time T so we're gonna make an error between we we're guessing for that state and this new target the reward plus gamma times the estimated value of the next state okay now you've probably also heard about dynamic programming you can think you can put dynamic programming in the same figure the same kind of figure and if you were the dynamic programming version looks like this because it's not considering a single line through the possible tree it's considering all possibilities it's considering both actions and both possible next States so this where you need the model the world because although you you know you're probably picking each each action but probably that the world will give you possible next States will be known only to the world but in dynamic programming you assume you know all that so I never programming the equation is that the value the estimated value for a state is move towards the expectation of the first reward and the expectation of gamma times the value of the next state okay so there's this expectation and that's what makes it dynamic programming because you've seen me know all the probabilities you can figure out that expectation doesn't give you the answer because your value will still be you're still learning and guess from a guess learning your new estimate still from your old estimate my that's sign number for you so so really we can say the following though it's special about T methods to say bootstrap and sample so bootstrapping is this idea that your target involves in a guess and an existing prediction okay so Monte Carlo and Monte Carlo or the whole point is that it doesn't bootstrap it's just looking all the way to the end and seeing what the return is there's no there's no estimates playing a role in the return dynamic programming also bootstraps and they're boring says look ahead one step and look at the expected value of the next state and back it up so you're only you're using your estimates and your estimates gradually get better TD of course also is using your estimate yeah it's like Monte Carlo and TD are learning methods I guess that's what my next point the learning methods Monte Carlo and TD they sample they sample what happens because you don't know how the world works and what dynamic programming does not sample it just uses the expectation it assumes you know what will happen what could happen so those are the two basic dimensions whether you're sampling and therefore learning and whether you are bootstrapping you're using your your your bootstrapping your estimate from other estimates your learning guesses from guesses and so TV prediction basically I'm just saying this is this is the update you saw before Rupp doing the Monte Carlo is here and the TV is there so just the contrast is that one the target is the actual return and the others the target is this sort of one step estimate of what their return will be okay now let's think let's do an example so here I am I'm coming home after working a hard day at the office and I'm trying to guess how long it will take me to get home okay so I'm I'm leaving my office it's Friday at six o'clock I have some other features and I make a guess of how long it will take so I will I'm gonna guess it'll take 30 minutes to get home okay so and that's that's my prediction of my total time because I haven't gone you know I'm just starting now so my elapsed time is zero now as I come out of the I come out of my building go to the parking lot and I see it's raining okay and it's raining you know it's gonna take me longer because everyone drives slower in the rain so I think well first of all it's already I've already spent five minutes just getting down to my office into the parking lot and but I also think it's gonna take me longer I'm gonna think it took me thirty five minutes from now for a total of 40 minutes okay so I what I want you to see the first thing you ought to see is that my guess by how long it's gonna take me and my guests enough total time to go home it's constantly changing as I get more information I revise my estimates okay so to carry the example through I get I get I I start getting my car I Drive on the highway it turns out I didn't take it didn't take so as long as I thought I've spent 20 minutes total now and I think it'll only take me 15 more to go home and we're so bad between the rain and so that's 35 minutes total now this keep this car in this column is my total estimate as it goes up and down and then I get stuck behind a truck on secondary road and so I think it's gonna take me longer and then I reached my home screen I think I'll take me 43 minutes and it does take me 43 minutes okay so that's a possible thing that might happen a possible trajectory and what I want you to ask is what you might learn from that okay so if you're doing a Monte Carlo methods you just say well it took me 43 minutes to get home that's the answer so all my estimates my first initial estimate of 30 minutes that's going to be moved towards 43 minutes that's the error in fact all of these will be moved up towards 43 minutes because whatever guess I made at each point in time it should be moved towards what actually happened or whatever whatever is remaining in the future at that point okay now if you're using a TV method if you're using your your guests you can learn a guess from a guest that's something very different happens so even some of the signs change so your first prediction will move up because you start out 30 and then after you found out it's raining so you'll move up but this one for example will move down and the actual that although all the errors are different all the errors are different and the long long long rounded wall washout but for all actual learning is a law is it's very different okay now I also want you to think about the computational consequences okay if you're doing TD then you know when you're here and you go to the next stage you get an error and you can a can update right away you can say well why did I make that prediction what are my features their house and how shall I change those what are the contents of my deep network that led me to make that prediction I need to change those and and that's true at each step and you so when you go from here to here to here you can update this guy and then you can forget about it but you're never gonna update him again whereas in Monte Carlo you have to remember why you made each one of these predictions until you get to the end then you have to go back and say well okay why did I make that line and then and then adjust its weights you know it with knowledge of its feature vector and content your network and and so on yeah and it's terrible its distribution because you keep all this time you're doing nothing you're driving home you can't do any learning okay you can only wait till the end you know the answer and then you can go back and add to all the earlier things and learn them so the distribution of computation is poor the memory is poor it's it's it's just kind of inconvenient it's much more convenient if you do just go along and you think about it you're in your car you're trying to drive home you get stuck behind a truck do you say you say you say this is bad you know I say it's gonna take me longer than I thought I was too optimistic before you don't say well you know maybe this truck will disappear and you don't say hold the whole judgment you could hold judgment until you get home but you know my feeling is I'm learning as I go along and I'm responding to what I see and we actually do learn as we go along okay so I think I've said these things and TD with Monte Carlo you can be fully incremental learning as you go along you can learn before you know the final outcome this means you need less memory and less peak computation you don't do it all at the end you can even learn if you know if you never find out how long it takes you takes you to actually go home you know maybe you're get a phone call and you're you called away for something important and you never find out but you can learn without knowing the final the final outcome now when you do the math both of these methods will converge and but so the only question is which is faster okay this is the only question but it's a big question okay so I don't know let's just do a simple experiment to find find out okay so here's a trivial experiment famous not famous I know I did this a long time ago just a random walk and it's meant to test to idea which one of these is better so we're gonna we're gonna have five states and we're just gonna have an estimate for each state of what the outcome will be okay this random walk it takes 50 step right and left and you start in the middle and you back and forth back and forth back and forth until you and at one side okay if you enter this side you get a zero and you do get zeros all along the way for your reward but if no way you get a nonzero is if you end on the right side you get a you get a reward of one okay so are you with me what's the correct prediction so the correct prediction there's there's no discounting here so we're just trying to predict the sum of the rewards up until the end what's the correct prediction for the start state see you're in C what's the correct prediction for the expected value of your return yeah the squared gamma is 1 so the expected expected return so if you if you end if you go blah blah blah blah blah and you end on this side the return is 1 if you and on the other side the return has to be 0 now you start in the middle what do we expect the return to be you know by symmetry it's gonna be like 0.5 okay and state B I don't know it's gonna be less than 0.5 and state a still less anyone want to guess what they're good at what they are the true values of all the states C is definitely a half what do you think B is guess just guess 1/3 yeah I thought it is 1/3 and and the next one is 1/6 yeah these just go up by six one six two six three six four six five six and those are plotted here this Lin line is supposed to be the true values so a state a true that has a true value that's one six and state B it's a true value this one-third this has a true value of 1/2 and so forth and these other lines are the estimated values from from applying TD to it okay so TD you do have to care about the initial conditions because it's making a guest from a guest right so your your guess is you know effect things they either pollute things or or brilliantly provide good guesses and value okay they so the initial guess is zero so our time at episode zero all of the estimated values actually excuse me yes all the estimated values are half because since there could be zeros and ones for possible returns it seemed reasonable to start the estimated I is all at a half we've got happens to be right for the middle state but it's quite a bit wrong for the other states and then we gonna do episodes where you and we're gonna learn it on every time step and we're gonna update the states according to the TV rule and after one episode we have this this darker line this is the episode number after one episode we have these values for our estimated values of the five states right so what do you know happened on the first episode you ended on this side because well what's gonna have what does this TV rule gonna do what is what is the TV air gonna do let's say that we start in the middle and we could go either way we move around well what's the TV air going to be it's gonna be the reward the reward beginnings can be 0 and then it's gammas ones and forget about gamma and then we just basically the change in the value okay and if you went from say this state to this state what's the change in value 0 cuz they're all their estimated values are all a half and so we went from one state from an estimate of a half to another state whose estimate of a half so as we go all this bouncing all around nothing is gonna happen really until we run into one of the ends well we ran we can run into this end or we can run into that end if we ran into this end we'll go from a state that was a half - oh the terminal state thermal state always by definition has a value of zero okay so so over here if you way if you did this transition you get a reward of one a reward of one the starting state sonic state here would be a half and you get this thing that has zero okay so it'll be 1 minus 1/2 it'll be positive 1/2 anyway the the estimate of state a would go up and that didn't happen because here's the estimate of state a it's still at 1/2 instead what happened is we ended on this side we went from here that had estimated value half to this thing which has estimated value 0 terminal state and so we went from 1/2 to 0 and our ITT air is minus 1/2 so we moved down from 1/2 toward 0 and you can actually see how far removed we actually moved by from 1/2 to 0.45 and so our step size alpha was 10th was 110 okay we can understand this algorithm is very simple and and then as you get more episodes you get closer and closer to the true values after 10 episodes get the blue line after 10 hundred episodes you get close to the true values you never get exactly to the true values because there's always randomness in the individual episode and and alpha is nonzero it's 1/10 and so you keep it bumping around bubbling around the true values so that's that's an example now let's compare now Monte Carlo versus TD on this problem ok and we have to draw whole learning curves now and we have to worry about what's the value of the step size ok so what I'm showing you and this is a learning curve meaning the x-axis is time or episode number and the y-axis is some measure of error that's actually the root mean square error averaged over the five states and many many iterations of the whole I guess 100 iterations of a whole experiment okay so as I said they're going down you know everything's getting better over time but they things will not go to zero because well we have the step size 1/10 and or whatever our step size is it's it's always going to be there and we're always gonna have some residual error I don't know which one should we look at first maybe Monte Carlo it's simpler we just Monte Carlo you just wait till these you know what the final return is then you do your update for all the all the states that were visited so take a nice slow learning rate 100th we just gradually moved down towards zero error and it's actually this this alpha equals 103 very slow will actually get the closest in the long long long run to zero error but it's very slow so you might want to go faster if we take alpha is 150th we go down faster but we're not we're gonna start to bubble and what we try to try going as fast as point oh four and we do go the initial part is fastest but now we're definitely bubbling and you can't really do better there's no step size which will do better than the ones that are shown here if you try to go faster you're gonna you may be a little bit faster very beginning but you're gonna leverage level out at a higher level okay and for TD we see a similar pattern but all the numbers are lower okay so here's our the one of the the slowest one I'm showing here is 0.05 and it goes its slowest but it gets slower and then other ones are faster and of course they bubble more and they they don't get as low in the long run okay now if we have long someone may ask you you may some of you may be wondering what's going on with that TD stuff because it seems because think they go down and they start to come up again your anybody wondering that anybody wondering that yeah it's weird isn't it and it's real it's not a bug in my program that's the first thing to be sure of yeah it's it has to do with the fact that actually starting out the estimates at 1/2 is not it's not stupid it's actually a reasonable guess if you started all the estimates out at something really bad then you wouldn't see that balance I got the balances we go down and we seemed to bounce we come up a little bit higher and the balance is really interesting it has to do with the fact that we do have some some effect of the initial estimates in TD and whereas we don't really it's not as much for Monte Carlo ok so so this is just a random walk and I've survived been systematic about the random walk and I don't know the big picture is that TD is faster ok there's a balance ok whatever but it's still much faster but this is just one problem this is just a random walk ok maybe it's something special about the random walk maybe if I did that I'm you know Atari games I would get a more fundamental result and I like to do simple things question oh no they they all converge even with a non well with a non zero if you don't reduce the step size then you don't expect anything to converge right they would converge in the mean ok and all of them will converge to a mean that depends on the step size and the higher step size would be higher lower step size be lower convergence point yeah the convergence properties are roughly the same in both cases I want I wanted to I want to ask now can I say anything about the general case not for the random walk what can I say general case actually no I'm gonna do I'm gonna do that in a minute but first I'm gonna do the random walk again under this this setting can I call batch updating okay that's updating means we we take some training set like a hundred episodes or ten episodes or whatever and we present it over and over again until it does converge so even her finite sub size will get complete convergence if we repeatedly present the same training set okay because there's no randomness and random samples you're just putting the same data over and over again and you will converge the two methods TV and Monty color will converge to two different things and this is for constant alpha these trigrams converge to different things as long as your step size is small enough it won't depend on on your step size yeah all step sizes as long as they're small enough so that you don't diverge will converge the same thing okay and they converge to different things the two algorithms converge to different things so we can ask on this problem which one converges to a better thing if we present the data over and over again to the algorithm okay and here's the results on the random walk again you have we have different numbers of different sizes our training sets we're increasing that along here but for in each case they say with 50 your training set of 50 episodes we present those 50 over and over and over again until we converge and we measure the the asymptotic error there is independent now of the step size site I can now eliminate this The effect of the step size just to measure you know which algorithm is giving me a better result on this problem okay and TD is faster and by this measure I mean we're doing a lot more computation we have to go to convergence we have to repeatedly present things none of which I like but it's getting us insight into what were the real difference between the two algorithms so it's like TD is moving towards a better place even on a single example as suggested by the initial results and if you if you go over and over again you can get that okay now this again is all random walk and you have to ask if this is happens on all problems so one approach would be to do all problems and that's obviously not satisfactory so what can you do instead you can try to prove a theorem okay and you can you can also try to get insight I guess I'm gonna try to get insight first and then we'll do the the formal result so let's try to get insight into us as people I want you to need you to be the predictor imagine you were having some experience so an imaginary nsync a training set of these eight eight episodes these are all very short episodes right so most of them are episodes like B 0 means I'm in state B and then I get a reward of 0 and the end of the episodes the end of the episode okay or I see state B I get a reward of 1 and that's the end of the episode the only non-trivial episode is this first one where I'm in state a I get a 0 and then I go to state B and for B I get a reward of 0 and that's the end of the episode ok so that's the day that you see just these eight episodes and I want you to tell me what's what prediction would you make ok the first question is what prediction would you make for state B if you found yourself from state B what would you guess for the expected return ahead of you from state B say again three quarters I agree because what would we do that we said well I was in state B all eight times and six of them ended up the one and two of them ended up at the zero so you're gonna guess three quarters okay okay that was easy one what about state a state a it's really much more uncertain you only been in state a once and what are you gonna ask her state a just take a moment and think about it what are you gonna guess the return if you find yourself again in state a what is the estimated value well what would you estimate the value of state a as okay now I'm gonna say right away that this is a question there's multiple good answers too okay so I'd like someone to raise their hand and give me one give me one answer and why it's a good answer okay how would you you always if you always go from A to B and the only time you've seen a we went from A to B and B has value 75% three quarters then a should also that sort of makes sense what's another good answer yeah I don't know if everyone heard that you said you've seen a once every time you saw it the return was zero so you know why not predict zero okay now those two answers those two answers are the two are Monty of the want to Carla's answers and TD's answers okay so we could say zero that's what Monte Carlo would say a Monte Carlo just looks at what happened I was in a once and I it was a outcome was zero so I should predict zero Monte Carlo now the other one the other one is what TV predicts it's also what you would predict and this is the gentleman explained what was going on in his head he was saying well I'd seen a go to be and with and all that sort of stuff he was building in his head this model he was saying I'd see the only time I seen a it went to state B by the way the reward was zero on that transition so let's let's guess that the that's that happens every time and then in B B I saw eight times and six out of the eight went one way to a one and then stopped and two out of the no I saw an eight times six out of eight went to we're to one and two out of eight went to a zero so I'm building this in my head okay this is like and this has a name this is called this the maximum likelihood model of the MDP it just just just means what you'd get by counting say how often from do you go from here turn those into into probabilities okay and this is the maximum likelihood model of the underlying Markov process and then if you take this model and you solve it if this is this is the true world then the true value is 3/4 okay and so this is the general phenomenon of what TD does if you present a training set over and over again to it it it it gets the answer that you would get if you collected all the data I mean the maximum likelihood model of the world and then solve that model with dynamic or with any method the true the true solution if that model was was this was the reality okay and so that's well that's that's why TD is is can be faster can be better because it's using this Markov property that saying oh and I've gotten to be I know OB is a Markov state and whereas Monte Carlo just says I don't care about what happened between I ended up with the getting a zero okay so to summarize that the prediction that best matches the training data is the Monte Carlo Lester best matches the training data remember the train data okay and if you saw you saw a once and it ended up with a zero so you want to match the training data the right prediction is the value of a is 0 that is the prediction that will best match the data okay now of course I want to tell you we don't want to match the data if we talk we don't want to minimize the mean square error on the training set weird huh it seems like we should want to minimize the mean square error on the training set and that's why I've gone through it's some length this example with you guys so I want you to have some intuition and why we don't want to minimize the we mean square error on the training set so what can I offer you if you if I can't offer you minimizing the mean square error on the training set it's gonna be minimizing the mean square error on future experience because we don't really care about the training set past it's print we care about the future and so we think if we believe we have real stakes here we would think that the estimate the value of a is 3/4 will actually be a better match to future data okay we get a new experience with say day it's probably going to be ending a 1 3/4 chance of being why okay so it's interesting now we have to really distinguish between minimizing error on the training set minimizing error on the future these different things and TD can be faster because it could take advantage of the state property and match future experience better now even as I said that you you may be able to get immediately a sense of possible limitation of TV methods as I said they're gonna take advantage of the state properties that I know when I get to be it doesn't matter how I got to be but in in in real life you don't normally have complete state knowledge you have incomplete state knowledge if you're anytime you're using function approximation here we're just using discrete States and any time using function approximation you're going to have imperfect knowledge in perfect state information and so so in the end this could be a mix it's going to be a question which gonna win in practice but in the end it's going to be TD that wins in practice I'm thinking okay in the end dad okay okay so yeah good good time for questions yes Novi is still predicting the some of the other awards from that's from the state to the end and so you remember the example was a is followed by zero is followed by B is followed by another zero okay thank you for clarifying that [Music] nation then of course in this case TV in Swissvale in Ebisu premiere yes depending on the size of the representation of the so let's let's think that through a so let's assume that instead of having a table lookup this is all table to come but instead of that we have a complicated neural network right and and so then when we get a new error we have to back propagate through the network we have to do somewhat expensive computation but it's not should we consider that expensive I'm going to say no because even the back propagation of one error back propagation to the network that complexity is is the same it's the same order as a forward pass to the net so we already spent the fort we had to make a forward pass in order to get the prediction and so there's an equivalent complexity to do the update so even though it's it's a bunch of weights it's it's it should we should consider that cheap ok ok it's it's linear in the size of the network good good any other questions good ok so so I've just done one step methods tabular methods model free methods all these qualifiers can be generalized but even here in this simplest case one-step method meaning we're looking from one step to the next step rather than one step to five steps ahead like in AC 3 but there we can see the basic ideas and it's tabular tabular is easy to think about but it all the ideas really do to the to the network case complicated function approximator we've seen the basic things is that we're gonna bootstrap and sample combined aspect of dynamic version like to carry Carlo UTV methods are computationally congenial just a little bit of work on each step and you don't have to wait till the end and then do a whole bunch of work and if the world is truly Markov then TV methods are faster that's what we see and it has to do with the past data versus the future data now before I go into a summit new thing I like to also try to summarize where we are in terms of pictorially ok what today we've talked about contrasting TV methods a one step TV method which is like this is what I this little picture is what I use to summarize the math exactly the thing of the table is something updating it I and this says I go ahead one action and one next state and I use my estimate here to improve this guy this is like a picture of the algorithm and the same kind of picture for Monte Carlo as you want to estimate improve the estimate of this guy this value you go ahead one state on action state action states then I'll leave at the end and you see the final outcome and then you back all that up ok so and this is like a dimension you can occupy intermediate methods you can do two two step methods three step methods for step methods five step man this is like an infinite step method where you go all the way to the end of the episode and then and then there's the parameter lambda you might have heard about in TD lambda the eligibility trace parameter it's really a bootstrapping pattern that if determines it's not the number of steps but it's it's analogous to the number of steps and always though this is really a dimension and we can occupy any point along this dimension okay and that's now there's a second dimension which is are we going to use planning okay are we going to use knowledge of the model of the world okay dynamic programming dynamic programming is this corner it means we're still going to do one step methods we're only dynamic we only had one step and use your estimates at then one step and look ahead into the future okay and so that's moving along the top here you says keep keep in these short backups one-step backups but instead of being a sample do all the possibilities the next dynamic programming and then there's a fourth corner where the analogous of Monte Carlo but with planning is like an exhaustive search you consider all the possibilities all the way to the end and so we can get these four corners as classic methods and then we can occupy you the area in between them and that's kind of a big space of reinforcement learning methods although it's it's certainly not the whole space okay now where should we go next I don't have time to present all that I wanted to say mm-hmm but let me just sort of we've done a good group here I can almost sort of wrap up and talk about the future from here but let me just tell you some of the things that we had more time we might talk about okay first we will talk about estimating instead of state values we toss about estimating action values because as you know really for control you want to do the action values and it's not that different you would just estimate q pi instead of V PI and it's gonna be lowercase because that's this is the true value function Q PI and you would estimate it with with an action value estimate since it's just tabular I can say big q of Si for the actual state and action counter then this update this update is essentially the the TD update it's just on state action pairs rather than on States right this is still a TD error this is my original estimate this is the estimate for the next state this is the sarsa algorithm that's that's quite straightforward and so the source algorithm ends up being that rule that update done over and over again and and some examples q-learning q-learning is it's almost the same the rule is you know we're doing out there state at a state action value the new one is the old one plus the step size times the TV air but the TD air is a little bit different the TD air we you know preparing our original estimate to mix reward plus something of the next state but it's the max of our possible actions at the next state that's Q learning I like to draw this picture this is its picture the picture says you know I'm updating a state action pair all the possible estimated action values take the max event that's this this part of the rule right max over all possible a and and then I back up the max to improve the estimate of this guy at the top that's Q learning it's a TV algorithm with that particular target and this is a nice example if block is a nice example comparing sarsen hew learning sarsa is a non-policy method Q learning is an off policy method and we see that actually here the y axis or the y axis is reward for episode sarsen will actually work out better I don't have time to explain this example I wouldn't I didn't want to do something it guys didn't really understand better to skip over it okay here's one sort of new algorithm expect this salsa expected star so if we look at the picture right q-learning is taking all these possible things you might do and you take the best of them take the max the arc these max and if you have if you don't have an an arc then it means expectation okay so an expected star saying you don't take the best of the things you might do you take the the expectation based on how you would actually do them according to your policy so here we are we're summing over all things we might do how likely are we to do it under our policy which we know we know our policy we know how likely we are like it maybe it's Epsilon greedy and we take the expectation the action value times our likelihood of doing that action and we backup that okay and that's that's arguably Atlas an improved version of sarsa and it can also be made an off policy version of sarsa and there are some other novelties you can you can do an off policy version to expect this arson now I've used the word off policy a couple times not explaining it I'm sorry about that but the off policy means that you're learning about a policy that's different than the one you're following okay and on policy means you're learning about the same policy as the one you're following the same one that's generating the data so the way to remember it is that on policy is almost one policy and in the non policy methods there is only one policy it's a policy you're doing its policy or anybody but very often these want to be different like you want to do something that is more exploratory and you might want to learn the optimal policy okay so if you're gonna learn the optimal policy but you're gonna actually get your data in an exploratory way we're just not going to be optimal then you have two policies okay and and then you're in the realm of off policy learning q-learning does this butt off policy learning is is theoretically more difficult and more challenging for our methods okay that's that's off policy okay so so I basically just extended these things to control okay now and we've seen some some methods that can do the on policy case in the off policy case we didn't talk about double Q learning okay so I've talked a lot about about do you want to use Monte Carlo or do you want to use T V okay and it's it is there's a sense in which we don't have to choose because use an algorithm like if you use TD you can parametrically very lambda or very the the height of your backups to get to give you any intermediate point between one step TD and Monte Carlo you can get both and a key mean way of doing this is with the parameter lambda the bootstrapping parameter so you really talk about but it is a way to very parametrically between TD and Monte Carlo so if land equals zero which is the left side of all these graphs that's pure TD pure bootstrapping okay if land is one that means you're watching any bootstrapping its Monte Carlo okay okay so now all these graphs have land I've gotten across the bottom so it's basically like this - pure - - no bootstrapping and they all have a measure performance on the top where in all cases lower is better okay so it's like it's like Mountain car and you want to have a few steps to get to the top of the hill okay so and what you see looking at this is that you know performance depends on lambda this is this is a random walk and it's actually not best at lambda equals zero purity D is not the best you can do better if you do some amount of TD intermediate between pure TD and Monte Carlo but if you go all the way the land equals equals one then things get really bad that's like the worst case in general and that's the pattern land equals zero is Monte Carlo and Monte Carlo has really high variance and it has to be ruled out it's not very happy if you are committed to Monte Carlo now you couldn't do TV and say oh I can pick any step in between now that's what you you want to have this facility of doing some bootstrapping and that's sort of some evidence for that even though this is old data I think you know like Peter would agree that that Monte Carlo is is not really an efficient strategy to do it in a pure way okay now another I want to give it one slide also for taking questions on the linear case with a case with the real function approximation I'm not going to go to nonlinear networks but I want to go to something which you five talk so much theoretically about which is the linear case so suppose we're doing linear function approximation linear function means that our estimate estimated value is formed as an inner product between a parameter a weight vector and a feature vector okay so future vectors are fee fee for feature V of T is our feature vector for the state at time T and the parameter vector is theta and that might think about as the weights of your your network this is a linear network okay so do we take the inner product and so this transpose thing means the inner products of theta inner product with fee is our estimated value of state T if you just sit it's just made value of the state at time T because this is the feature vector for the state at time T okay so this is our estimated this is an estimated value of time T this is our estimated value at time T plus 1 so this really is a TD air and this is a TV rule the TD rule is that the parameters are the old parameters plus step size times RT d air and that's the gradient in the general in general nonlinear case this would be a gradient of the of the prediction with respect to the parameters in the linear case is just the feature vector five feet excuse me okay so that rule should be fairly familiar to you now it's just a TV rule using a stochastic gradient descent it's Lots can be said about that but that's that's the standard routine linear TV t zero and if you look at this you can of course write it like this you can take the five the fee and carry it inside here it's there and you carry it inside here you with a little some transpose e stuff you can write it like that and this is a vector if you take the expectation we're gonna take the expectation okay so an expectation the the new feature vector is the old one plus a step size and this this thing this thing in expectation what is it okay well I'm just gonna make some names for this thing is a vector and this thing is a matrix times theta okay so B is going to call that vectors that vector B is just the expected value of this thing it's it is a well-defined vector you don't know it but it's it's it's there and this thing is a matrix because it's an outer product of feature vector with the change in the feature vectors so the expectation of that matrix is what I'm going to call a so that mean lets me write a whole expectation like this and now I'm interested in the K and what happens at the fixed point where will this settle and extreme expectation where will it converge where will the expected update be zero well the expected update is basically this part so I want to know when is that zero well that's gonna be zero when B equals a theta or when B minus a theta is zero okay and that that theta that's a special theta for which this is true so B minus a beta equals zero I'm appalled at theta TV because it's the fixed point that TV converges to the linear TV converges to okay and then you can just compute it know B is is is a times theta and so you have to take the inverse of a and you get the TV fixed point is the a inverse B and which is which by the way it's a key to another algorithm least-squares Agra mez estimate a directly even though it's a matrix take its inverse and I'll sestamibi directly and then multiply them together to get that's the least squares TV works okay but but this way of computing what what what the algorithm converges to and then you can say something theoretical about it this is your guarantee that we get that the mean square value error measure how off the values are is bounded by an expansion times the the mean square value theorem the best data so this means that we don't find the best theta okay but we do get an expansion of it anyway so that's what the theory would look like if you had more time to talk about it I just wanted to mention quickly some pots some of the frontiers and things that people are working on now so off policy prediction is a big area people are working on trying to generalize to the off policy case also we'd like to talk about non has some theory for the case of nonlinear function approximation there's just a little bit of that there's also a very little convergence theory for control methods period and I think Chava maybe we'll talk about that tomorrow um and we also like to say things beyond convergence we talked like to you know how well can you do in a finite amount of time how fast you converge now when you combine TD with with a deep learning a lot of different issues come up and I think there's just a lot of uncertainty do we really need a replay buffer so that one of the folk theorems is it Oh especially you have instability and correlation quote correlation so we need this thing all the replay buffer but I think it's really there's lots of questions about what happens when we combine TD with deep learning and finally the idea of predicting things other than reward remember I started with that we might want this is TD is a general prediction method multi-step prediction methods we want to use it to predict other things and particularly we want to learn it to use learn a model of the world so in conclusion I guess what I want to say is something like this at e-learning is a uniquely important kind of learning anyway maybe it's ubiquitous we're always going to be using it and I think this may be true it's a hypothesis so anyway it's learning to predict we're just perhaps the only scale what kind of learning it's it's it's a kind of ring that's specialized for general multi-step prediction which may be the key to perception modeling the world the meaning of our knowledge it's key idea is to take advantage of the state property which can make it fast and efficient but can also make it a syntactically biased and so the key claim to fame as is computationally cheap congenial and we're only beginning to use to explore different ways we use it for things other than reward thank you very much [Applause] you
Info
Channel: The Artificial Intelligence Channel
Views: 51,842
Rating: 4.8400002 out of 5
Keywords: singularity, transhumanism, ai, artificial intelligence, deep learning, machine learning, immortality, anti aging
Id: EeMCEQa85tw
Channel Id: undefined
Length: 86min 24sec (5184 seconds)
Published: Wed Sep 06 2017
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.