Lecture 06: Nonparametric Regression

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
what happened everyone okay y'all gotta get here earlier I told you guys - you gotta get here earlier I'm gonna uh I'm not I'm not wasn't kidding when I said I'm pretty soon I'm gonna start locking the door at 1:30 so did everybody who was on the waiting let's get in is there anybody still in the waiting list all right we've solved the problem that's nice so we're going to continue the nonparametric regression today particular we'll finish up the kernel estimators any questions before I begin actually maybe I should try it now just you're just in time because I was literally just gonna lock the door I'm gonna try it and see what happens it'll be very interesting there all right I gotta be here on time so do you all right so let's remind ourselves with the put the kernel estimator is so this is the form of the kernel estimator it's the local average of the Y's remember if we just call all this the I wait Oh someone's trying to get in the ice wait shell out a min oh you gotta get here before I locking the door for now on at 1:30 I wasn't kidding all right let's lock it next and and the next time I'm adding action electric charge I see you got a shock so let's call this the ìthe wait at the target point X we've got another one and so the important thing is that it's a linear combination of the Y's now the linear combination depends on the target point but it is a linear combination of the Y's so you got to get in here before I lock the door from now on yeah there this is called in psychology reinforcement training reinforcement learning yeah whatever it's called I'm a statistician unless I call it okay so it's a linear combination otherwise right so that's important this is gonna be a common theme in fact here's an interesting thing this is not in the notes until later but I don't know I want to mention it now we have this we have this fungus estimator of the regression function okay it is defined remember not just that the training data it's defined it all axes all right however it's interesting to examine its behavior at the training point so suppose these are the training points let's look at that so this is usually called the fitted values Y hat 1 up to Y hat n that just means that the the regression estimator evaluated at X 1 X 2 X n so let's look at what that is remember that the estimator is a linear combination of the data the Y's but the linear combination depends on what point you're at that's the kernel so I'm writing it as an inner product here we're taking y1 to yn is some linear combination of the Y's that's what gives me mm hat of 1 right now if I look at M hat at 2 so I have to move somewhere else a 2 here of course I'm changing the weights so it's a different set of weights I've shifted the curl over but still it's a linear combination this row times this column gives me M head up to okay so what does that give me uh-uh see I told you is gonna lock the door I want there yet okay so in matrix terms y hat that's the vector fitted values is some matrix L times the data that's the data vector does that look familiar in linear regression we had a very same equation we had Y hat was equal to the hat matrix times s this was a projection matrix right an idempotent matrix that projected onto a lower subspace nonparametric regression has the same form this is no longer however a projection matrix in fact it's called a smoothing matrix but it's a linear smoother now don't confuse this as important linear smoothers and linear regression linear smoother I mean it's a matrix operating on the data so it's a linear operator okay it doesn't mean that the function we're fitting is linear but there's a there is this really important connection here is that there's a very similar thing going on in linear regression and in nonparametric regression it's still a linear smoother in fact what we're going to see is all many of the different nonparametric estimators you would might know about so reproducing kernel Hilbert spaces Gaussian process smoothers spline smoothers they're all linear smoothers they're all just different choices of the matrix l in fact here's an interesting thing you can take any method you want for getting a nonparametric regression estimator right rewrite it this way and if you plot if you plot this each row what you're going to see is how much weight it's putting on each observation it doesn't matter what method you're going to start with it's gonna look like a kernel so smoothing kernels a kind of a good way to think about these things because even the estimators that start very differently when you actually look at what it's doing to the data it's behaving typically like some sort of kernel implicitly so this is a good thing to keep in mind as we go through the notes okay it's an example of a linear smoother so we'll return to that point and some detail when we talk about different smoothers but it's a good unifying view of many different nonparametric estimators however not all nonparametric estimate is our linear smoothers are exceptions if Ryan talks about what I think is going to talk about in a week he'll talk about a non-linear smoother for example so many nonparametric estimators are linear smoothers but not all all right so I had written down a key theorem which was the bias-variance tradeoff so this was on page five there was this complicated bias term which as we saw had some undesirable features like boundary bias and design bias and then there was a variance term which is 1 over N H times some other stuff here and then lower order terms so I want to outline the proof the full proof is a little bit long but I can outline the main idea and then we're going to talk about the multivariate version of this and then we're going to talk about local polynomial smoothers all right so this is still on page five so we're gonna look at M hat minus M well what you can see i did i erase the definition of the console let me put it back I've got a ratio of two sums we know that it's easier easy to analyze averages averages converge to things so I'm gonna multiply top and bottom by 1 over N okay it doesn't change anything and then I'm gonna call the top I think I called it a hat yeah a hat of X does the bottom look familiar actually let me also well I'll leave it out for now but it's a kernel density estimator actually you might notice the 1 over H is missing in front but that's on the top and the bottom so we can cancel that out if we want but let's actually put that into since it's on the top and the bottom and now you can see this is just the kernel density estimate so all right oh these guys are really late oh you're late I'm sorry all right where were we all right so we're gonna rewrite this expression then in terms of that most corner of these a hat over P hat minus M and the first thing we're gonna do is just multiply and divide by something so that we get it into a more convenient form so I would like to rewrite this as a hat P hat minus M and now I'm gonna multiply and divide by this quantity that's 1 minus this quantity might not seem like I'm caught wishing much but the idea is I'm trying to get rid of the random denominator see we have what makes the analysis a little bit more complicated is that the denominator and the numerator are both random here and so now if I multiply this by this and then this term by this I'm going to get two terms so the nice thing is this P hat kills this P hat and I'm going to have a deterministic quantity so what you get is two terms after I simplify it you just get this plus the next term I've done a little bit of algebraic simplification here okay so I really haven't done anything except rearrange terms now this is just gonna be a proof outline so the first thing to notice is that this is going to have some error this is sort of what we're trying to analyze right this is something going to zero at some rate but this is also going to zero so if we multiply two things that are going to zero they're gonna go to zero even faster okay so the dominant term is this term this term is going to 0 faster than the original thing so this is approximately equal to this a little bit of hand waving there but hopefully that's convincing but the the simplification now is that the denominator is deterministic instead of random it's just going to simplify the analysis of it and so clearly we need to analyze random you're a hat X and px actually P hat X we analyzed in 705 and X is very similar we can analyze it using very similar techniques I don't want it so remember a hat X is that average so the expected value and they had X well the expected value of an average is just the same as taking the expected value at any one of them so that's 1 over H times the expected value of y I okay okay and now I'm just gonna write out the definition of expected value it's just well actually let's do two things remember let's write Y I is equal to M of X I plus epsilon I so I can actually break this up into two pieces to part with mi and epsilon I but what's the mean of epsilon I zero so I can just stick in an MI here and so I have M of X I K of X I what's random is the X I so what I'm getting is the integral of M let's call it U times K of X minus u over H P of U so if you remember how we analyzed the kernel density estimate this was pretty much the same thing except we didn't we didn't have this term no not for the mean but for the the epsilon I always has mean 0 whether its independent of X I or not first take the conditional expectation given X and then uncondition okay do you remember what the trick was when we did this in dentists how did we analyze that how did we put that into a form we could analyze change of variables yeah so you define let's say T equals X minus u over H the Jacobian kills the H so you get M of X plus th K of T P of X plus th hopefully that rings a bell again this is if it wasn't for this it would be the same as the analysis we did for the kernel density estimator and what was the next step we did in the kernel density estimator we did a Taylor series expansion of this right P of X plus T u prime of X plus h squared over 2 P double prime of X and we do the same thing over here and now let me just explain what happens is I have one two three one two three terms have to multiply them all out so you can see why it's a little bit messy it's like nine terms but when I multiply by this which is symmetric any term well some terms are constant like we'll just have an M of X out front like an M of X actually P of X somewhere M of x times P of X what happens when I multiply through might this M of x times this but when I integrate that with respect to T its zero anything with an odd number of T's T or T cubed is zero because of the symmetry so you have to multiply them all out integrate them all the all the odd terms are zero anything with an H to the fourth is a lower order and you get the expression at the top of page six that's how I got that expression then you have to subtract this term but for this term we do the same thing we take the analysis we did for 705 where we had which I think I wrote down for you to remind you that in 705 we showed that the expected value of P hat was P of H plus something of order H squared over 2 times P double prime times something I'm calling Mewtwo which is just the integral of when you see a Mewtwo in the notes it's just the second moment of the kernel so if I take this times this times this and I take what I just wrote out and subtract it it's all just a hell jibrell and yet that's why the expression the final expression ends up being so messy but the important point I want to get across is now you can say I wanted you to see where the M prime Y was the first derivative there why there's derivatives P Prime remember we've talked about the design bias P prime I just wanted to see where it's coming from it's just coming from Taylor series approximation multiply it out integrate it and we get stuck with all these extra stuff ok so the important thing is the result but I wanted you to have some sense for where these terms are all coming from and the other thing I want to remind you about is there's an H to the fourth in front of the squared bias that's coming from this analysis they'll be eight Squared's around here when you square it you get H to the fourth but that's only true because we're doing an interior point if you did the whole thing at the at a boundary point the Taylor series doesn't work out the same at a boundary point and you end up with a bias at the boundary that's much larger something like H squared so I didn't I didn't include that analysis there so the two key points though if I'm confusing you because I know it's getting a little bit technical but just look at equation 15 and just keep in mind the bias has two unpleasant properties the presence of P prime says that any time the density is irregular or unsmooth you have extra bias and then the part that I left out is any time you near a boundary there's extra bias and just so you know that this matters in practice a lot of high dimensional prediction makes use of the fact that something that we've noticed in the last ten years which is supposed to be helping us which is that data don't tend in high dimensions often to be nice uniformly spread out right if you if you've heard about manifold learning you know that a lot of times data sets in high dimensions they might be in very high dimensions but they often tend to concentrate around lower dimensional sets an example of that is a famous example is you take a picture of someone's face and you have them move their head around the number of pixels might be large so the the dimension of the space is very large but really if you follow the data in that space it's only really moving around in a lower dimensional subspace it might not be perfectly on a subspace but the point is that the data the distribution of the data in high dimensions we shouldn't think of necessarily as being a nice smooth distribution it might be something that's living on some really weird low dimensional sets that's actually a good thing it's something we take advantage of in statistics and machine learning but what does it mean when the debt when did when the distribution is on some weird lower dimensional set it means the density is very peaked it means P prime is huge so we're actually incurring a huge bias in our prediction error when it should be helping us that's why that expression is important all right and so I'm going to tell you how to get rid of the bias in a few minutes before I do that let me just say that theorem for everything there was for one dimension theorem 4 is the same exact theorem for the D dimensional case I see a typo though that in theorem for it should say school the expected value of M hat minus M should be squared I forgot to put the square but that you see it's the same thing you'll see there's a bias term there's a variance term it's just that now those gradients instead of the derivatives and stuff like that and the proof is the same it's just to use multivariate Taylor series instead of regular Taylor series this is a nice example I think of how doing an analysis of some sort of learning algorithm gives you insight that you wouldn't naturally guess ahead of time so if we look at the dimension let's look at the higher dimensional thing what we see is that the biases of order H to the fourth so an interesting thing about Tobias bias is never affected by dimension but the variance term I'm going to ignore constants here so that the variance term we saw in one dimension was 1 over N to the H but if you look at the theorem here it may be a little bit hard to parse but in d dimensions it's this that's where the dimension comes in from the variance term it's the variance that increases with dimension not the bias there's a bunch of I've left out all those complicated you know constants now what involves the gradient of P instead of so it'll depend how that density varies so actually we can ask now what is the optimal bandwidth well if we if we minimize that you're gonna get H to be of order let's just write it like this one over n to the one over four plus D if I plug that back into the mean squared error we're gonna see that the mean squared error is of order one over N to the 4 over 4 plus T so there's the curse of dimensionality coin going on more generally by the way I kind of snuck some assumptions in it there was second derivatives all over the place so what kind of smoothness was those two now assuming the second derivative smoothness if I had redone the analysis assuming not this tutor it is but suppose you assume there's seventeen derivatives or something like that you know you're gonna get a different answer you're gonna always get what's gonna start to look familiar to you is a rate like this where beta you can think of as the the holder or sobeloff smoothness but roughly speaking it's how many derivatives smooth derivatives we're assuming exists so the typical case people do is beta equals to two that's usual for some reason the most common case people work with which would give us the four over four plus D so there's two different things going on there's the curse of dimensionality for sure you can see the D there we'll deal with that later to the degree that we can but there's this extra thing nobody talks about which is this design bias and boundary bias which which tends to get somewhat overlooked that's not affecting the rate it's affecting the constants but it can actually matter a lot in practice okay we haven't talked about the practical aspects of how to choose the bandwidth yet and so on this is still just kind of a an analysis of the basics of the kernel predictor okay so before I go on to any any questions in about the kernel predictor it's one nice thing about it is I hope it's intuitive right you just said you want to predict at a point you're just taking a local average in fact computationally that has advantages too you can do lazy prediction if I want to I don't ever have to compute this predictor when you go to predict at a particular point I just need to take the points in your buying take an average of those points you don't need to store the whole predictor and we have these results about the mean squared error the bias the variance optimal bandwidth optimal rate and so on and we're gonna see later on in the course when we do minimax analysis that this is the minimax rate under the assumption of to smooth derivatives if you assume M double prime exists in the smooth you can't beat this rate we're only gonna try to improve the practical aspects now not the rate will prove something about the the minimax itty later on all right but let's let's now move on then to is there a simple way to get a slightly better behaved estimator that still has this nice feature of being simple and based on the idea of local smoothing so just to go back a bit when people notice these properties there was quite a bit of research and how do we get rid of designed by is how do we get her boundaries people constructed some fairly complicated things till finally it was realized there's actually a very simple way to fix this to see what that is let's let's revisit where the kernel density of the kernel regression estimator even comes from in the first place if I want to predict right here say give me a prediction right there one way to think about it is we're just minimizing some sort of prediction error if I take if I asked you well suppose you had to pick a constant to do predictions no matter what X you were at then what you might do is you might look at the error the training error or just leave not like that find a constancy which minimizes this so what's the answer if I do that Y bar that would be the best predictor in some sense but obviously that's constant over the whole range if I want to localize it if I want to say no I'm allowed to try to get it a predictor that's specific to that local region then you can think of kernel izing or putting a curse smoothing kernel that puts high weight at points where you're trying to predict so I put some now I'm letting the constant really depend on where we are where we want to predict we're localizing the training error if you just take the derivative and set it equal to zero you get exactly the kernel estimator just take the derivative set it equal to zero you get two sides and so you get the kernel estimator back so that's just that's just another way to think about the kernel estimator you're just minimizing the localized sums of squares thinking about it that way suggests an idea what we're doing we see is we're we're trying to fit a constant everywhere so at this point I'm saying find a constant C that minimizes this localized sums of squares well that suggests the following idea why fit a local constant why am I taking this to be a constant maybe what I should do is fit a local line or a local quadratic whatever a local polynomial and that's what a local polynomial estimator is so what we're going to do is instead of minimizing this I'm going to pick the kernel and instead of subtracting off I'll call it beta zero now it's just a constant and minimizing that that's what we've been doing so far we can also remove we can also put a linear term we could put a quadratic term I'm just going to do this in one dimension for the moment and I'm just going to go to the linear term for the moment so all I'm doing is in other words least squares I'm fitting a straight line but I'm doing it locally and if I do this what's happening is you can think of my estimator and some neighborhood here let's pick a point near X I'm saying that M hat of you I'm going to say is going to be beta hat 0 of X well the way I've parameterize the line is that it's centered around X so this is beta 1 hat of X times u minus X that's just the equation for this line in that neighborhood you for points you in the neighborhood of X ok I'm just gonna get out a line instead of a number we're bathe is ero and beta1 hat come just by doing me squares just by minimizing this there is this kernel here though so it's not going to be quite these wooly squares that gives me a line but of course are only doing this to get an estimate at this point so I'm really interested in this X and this part actually drops out and we just get beta zero X so let me clarify something here we're getting back just a single number but it's not the same as fitting a constant we're fitting a line then we're pulling off the intercept of that line that's just because we centered the line around X that's not the same as just fitting a constant there okay so you fit a local line you get the line now you evaluate the line at this point that's your that's your estimator that's the local I mean getting I'm saying if you fit this you'll get the kernel estimator if you fit this you're going to get a beta0 beta1 hat the beta 0 hat is not equal to the C hat the fact that you fit something else the fact that you fit the beta one affects what you get out for the beta zero okay but intuitively it's just for fitting a line a local line and we're taking the line at that point and the amazing thing about this is this removes both the design bias and the boundary bias automatically and let me actually write and and it's easy by the way because we're going to write down the exact solution let me write down what the estimator is if we just write down the thing that we're minimizing in matrix form and minimize it just take the derivative we can write it out explicitly so here's what it here's what you get let's say beta hat remember it all depends on the point X you're predicting at you get a different value at a different point so this is beta hat this is a vector we only really care about the first thing but the whole process don't involves fitting both of these things and the formula is this is just a least squares thing the only difference is the presence of the kernel now I don't know if those of you taking regression that there's there's ordinary least-squares regression but there's also weighted least squares regression if who's taking weighted least squares regression I bet you more of you have you saw I said but anyway if you minimize if you do least squares and put weights in you just get a modification of least squares why would you put weights in well like for example if some observations have a higher variance and other observations sometimes you put weights in in this case the weights are coming from the kernel things farther away from the target point are getting lower weight so what we get is the least is what we get is what's called a weighted least squares solution I'm going to put down the formula it's X transpose W X inverse so you can see there's going to be some matrix of weights that comes from the localization and then X transpose W Y let me write out what these things are so this looks exactly like an ordinary least-squares estimator except for the white matrix and what's the what are these matrices remember we're fitting we're doing we're not doing one least squares we're doing this at every point X so we're doing a whole series of these things so this X the design matrix in the regression is going to depend on the point X that we're at ok so it's going to be 1 1 1 1 1 1 for the for the for the intercept and this is the and the W is just a diagonal matrix with the kernel value so this is just K of X minus x1 over H and ok and then 0 this only comes from taking this equation here writing it in matrix form differentiating and setting equal to 0 that's it and I'm just writing in a very compact form but you can see it's so simple that you can program it very easily it's built into our I would imagine it's built into MATLAB I don't know maybe some of you know I don't do MATLAB but it's it's certainly built into R in many packages if I really wanted to be precise sir I guess I should put a sub X here because this depends on the point exit what you're predicting so this is not much harder than doing kernel regression this is this is fairly quick we're not going to prove those results but it is true I'm going to state the result in meant that it gets rid of all these undesirable properties without increasing the variance and remember after you've done this your estimate is this so at each point X you form the corresponding X matrix W matrix you get you do least squares you get this then you pull that off and that's your predictor okay there's a did I write yeah there's a multivariate version I guess I didn't light it down now let me just state the key theorem which we won't prove because it's quite involved I should give you a reference so there's a book called I'm gonna guess local polynomial estimation by fan and yeah this is a statistical approximation to the name I don't know how to pronounce it I don't know how to pronounce that name I certainly don't know how to spell it and I even asked fan how do you pronounce her name he said I don't know his own co-author he didn't know so but it's a very nice book it has all the theory about local polynomial estimation but the key is this the local linear estimator has here's the risk and I already see a typo oh no well I I wrote it down actually in quite a bit of generality I'm gonna write I loud for I allowed in the result that instead of picking a specific bandwidth that you could have a whole matrix capital H of bandwidth but nobody does that in practice so let me write down what the result really looks like it's this and times some constants and then it's 1 over n H to the D Sigma of X over P of X and some constants so we see the same rate this is the variance of Y given X and this is the density does appear in the variance which is natural because if the density is low you don't have much data there you'd expect the variance to be high but the key thing is nowhere in this constant do we get that P prime term there's no derivatives of the P and it's also true that this this formula holds at the boundaries as well so we've cured the boundary bias and the design buys very simply it's really a cool thing well the P double prime is there but that's true except that that's a there is a P double prime there except for one thing that's a typo and should say M double prime so if you look at equation 18 we expect the second derivative of the regression function to be there so in equation 18 that's a typo that should be M double prime this should be your go-to predictor it's an interesting thing that I would say in statistics it probably is the go-to predictor but probably not so much in ml if I go to I mean the first thing you try it's easy see if it works well that you know the reason for the design bias it came from the proof right oh why is it called design that the arrangement of the X's sometimes called the design so that's why we called the P prime of X design bias it's probably not a good maybe that's not a good term for it but that's what it's called and the the intuitive explanation for the by the way for the boundary bias is this you know suppose you have data looks like this you know you're putting a kernel here you're putting a curtain hood but what happens when you have data here you're putting a kernel here it's getting chopped off you're losing the symmetry of the kernel at the boundaries that's the intuitive explanation of the boundary bias you can see this yourself if you do a nice just just simulate some data that look like this and if you this is this is a good exercise to do then do it if this was a more applied class I would tell you to do this fit a kernel estimator you're going to get this but if you fit a local linear estimator you're going to get this you will actually see the difference now you might wonder why did I stop it linear can you do local quadratic estimators local cubic estimators local cortico sorry yes you can there's so people will do local quadratic estimation for the in some cases for the following reason there's another source of bias which is at peaks and valleys these estimators tend to underestimate Peaks and over estimate minima and if you're really if if would for your most times that doesn't matter okay that's not a big deal but maybe you're doing a scientific application where you're doesn't want to get the peaks estimator well then people would do a local quadratic estimator beyond that I don't see much practical value on going beyond local quadratic so people will use you use local linear maybe local quadratic that's about it you do start to add variants once you go up to higher and higher terms okay there's no free lunch all right so that's that one oh and the by the way despite how we derived it what do we have here it's a local smoother what did I say sorry linear thank you it's another example of a linear smoother it's just a different smoothing matrix any questions then about local polynomial predictors actually the I think guys just like a random guy in the hallway so there was did any of you know Andrew more well you should know anymore what am i why he's the Dean now I was I forgot that he wasn't at Google anymore Andrew was a very early adopter of machine learning actually of local linear predictors so there are there is precedent and machine learning for doing local linear prediction all right section six just repeats what I already explained which is that what a linear smoother is and that many things are linear smooth is I've already explained this to you I do want to mention one thing about linear smoothers which is something we'll find useful let's write down the equation you can remember linear regression is you're gonna have to yeah unlock yourself let me see imagine if there's a fire and everybody got trapped in her because I lock there I'd be in big trouble linear regression remember looked like this the predicted values are the Hat matrix x why does anybody remember if I take the trace of the hat matrix what I get right the number basically let's call it D the number of parameters rest my number of predictors it's a measure of the complexity of the linear regression estimator if you're fitting ten regressors you're going to get ten so you'll see people when we're doing linear smoothing they'll take the trace of this matrix and they'll call this the effective degrees of freedom so like in a in a kernel smoother one way to measure the complexity of your predictor is by the bandwidth smaller ballots are more complex but another way people use is this effective degrees of freedom you can check that this is always between one and n and as the bandwidth for example gets smaller the effective degrees of freedom increases it's just another way to sort of summarize the complexity of your smoother okay so you'll see that I wanted to mention that because you will see that in the literature all right any question let's talk about another brand of predictors penalized predictors or regularize which would be to do the following thing we're gonna go why I - plus lambda times some sort of penalty function let's say J of M let's call it actually G so remember if I just minimize the training error I'm just gonna over fit the date I'm just going to interpolate the data the way we got to kernel smoothing and local polynomial smoothing was to localize the training error but a different approach probably when you're more familiar with seen as before is to add a penalty term where this measures in some sense the complexity of the thing and then we want to minimize that and there'll be a tuning parameter which in some sense plays the role of like a bandwidth it's going to control the complexity of the predictor and so let me put down two that are commonly used I'll just do the one-dimensional case for the moment for many years and statistics this was the this was a very popular regularizer G double Prime the idea being if you try to overfit the data G double prime is going to increase so it's going to increase this term so that's a penalty for complexity another one might be to use a reproducing kernel Hilbert space norm that's reproducing kernel Hilbert space regression yeah this is an arbitrary function I'm going to minimize this overall G's as you proved in the homework that a doable minimization it's one of the questions in the homework is to show we did it for the reproducing kernel Hilbert space well we'll come back to that in a minute so this is this is a minimization you can actually do even though in principle it's over pasta you know infinitely many functions in fact this is a reproducing kernel Hilbert space thing for a particular reproducing kernel Hilbert space that turns out so actually if you minimize this G with this penalty what you get out is something called a cubic spline which is just a piecewise cubic fit so what ends up happening say the data look like this you get you get a I'm gonna draw lines here the solution turns out to be some smooth functions I'm trying to draw it to look smooth its piecewise cubic but also the first and second derivative is match here so it makes it very smooth and that's kind of popular in some circles I'm not a big fan of the splines because it's very when you soon you leave one dimension it gets really messy to start talking about splines but I have colleagues who like splines I prefer local polynomial estimators what does it look like so I have a little discussion for you about splines but let's actually talk about the reproduce let's suppose I take a reproducing kernel Hilbert space thing then what you proved or will prove what they study there's by tomorrow you'll prove that the solution it's not it's not a very hard problem so here's the I mean the fascinating thing about it is I'm going to minimize this over all functions in this reproducing kernel Hoeber space that's a very large space and at first sight it might seem like a how could you ever do that minimization but what you'll see is that the solution ends up being of the following form can we use the same symbols yeah alpha J hat it's just a psalm of kernels at the data points that's the solution and you you'll see that the Alpha hat this vector of alphas looks pretty much like a least-squares kind of estimator actually it's more like a Ridge regression it looks like this where K is just the kernel the matrix of kernel values so the interesting thing about reproducing kernel Hilbert space regression is despite the kind of abstract nature of the minimization problem it's an inferential minimization it turns out that the answer has a very simple form it's just a finite sum of kernels centered at the data points and the alphas come from essentially like what looks like a linear regression and guess what that should look familiar well I shouldn't it's really the fitted values I should be doing that for but it's a matrix x y RK HS regression is another linear smoother and if you plot the rows of the Smoothie matrix it'll look a lot like just ordinary smoothing kernels so you can see it as a theme here is that there's many different approaches to doing nonparametric prediction but all roads lead to not all roads but many roads lead to linear smoothers and they're just different flavors of linear smoothers now tell you I have a preference for local polynomial or smoothing kernel regression but I know in the ML literature reproducing kernel Hilbert space regression is probably more popular let me just I mean do whatever you want you know you can use whatever methods you like but I'll tell you why I think the RK HS approach has some disadvantages so first of all there's actually two parameters to pick you have to pick this lambda but the kernel you have to pick a kernel and people usually pick a Gaussian kernels the most common choice there's another there's another parameter to pick and and the literature is fairly silent about how to pick both parameters and how they interact with each other so it seems to me a little bit more difficult okay it's not so easy to derive the bias and variance in all those properties of an arc ahs estimator depends on the spectrum of these operators and all that stuff it's much more complicated but also I think it's harder to understand what it's doing to me it to me a kernel smoothing estimator is very simple if I want to predict here I take an average of the points around me averages are simple if I say no I'm going to do this minimization subject to some constraint that a function lies in the reproducing kernel Hilbert space it's a little bit hard intuitively to picture what it's doing I think it's just harder to get a grasp of what that estimator is at least for me and then there's things like boundary bias design bias nobody talks about that for RK Chesham is but it's there and it's not obvious how to fix it so you know personally I you know I'm maybe have a very strong statistical bias here but I find the kernel smoothers or local polynomial are easier to understand easier to deal with but in the end frankly it is probably not a lot of difference between these different methods because they all involve a tuning parameter that trades off bias and variance and the really important thing is that you do that well we'll talk about that when we talk about cross-validation so if I use a local polynomial smoother and you use a reproducing kernel Hilbert space smoother but we both carefully choose our tuning parameters probably we're going to make pretty similar predictions so in some sense it's going to be to some degree a matter of taste in a matter of what's convenient for the problem that you're dealing with you could do a localization and a penalty that's an interesting idea I've never thought of that I'm not sure if there's any advantages to that but that's an interesting idea so far everything we've talked about is a linear smoother there are nonlinear smoothers so just to give you a couple of examples there are things whole wavelet smoothers these are nonlinear there are things called trend filtering smooth these are anomalous so there do exist nonlinear suppose I don't want to give you the impression that they're all linear smoothers yeah the most commonly used ones are linear yeah it just refers to the fact that you can think of the smoother as a linear operator acting on the data when you're predicting right that's correct yeah all right any other questions so section eight I'm gonna skip it's just yet another thing it all section eight says is you can just specify a basis ahead of time like a Fourier basis and regress on this basis that's yet another way I mean there's a million ways to do this smoothing I want to talk instead about cross-validation now which is really important to all these methods let's check the time okay we're good so no matter what method you use you're gonna have to pick some sort of tuning parameter might be lambda in your regular eyes thing or the bandwidth and the kernel smoother and what we're going to do of course is try to choose that to balance the bias in the variance somehow what we're interested in let's let's I'm gonna write it as a function of eight so let's assume we're doing kernel smoothing and so there's a band width H you can put in any other tuning parameter there and you know we'd kind of like to minimize some form of the training error all we have access to is sorry of the true error all we have access to is the training error we know we can't minimize this right what would happen if I chose the bandwidth to minimize this what bandwidth what I get zero because then I could just make M hat equal to Y right just over fit it the usual problem so we usually use some version of cross-validation and there's many many different flavors of cross-validation so one of the most common ones just leave one out which looks like this I take let's write it down try to get rid of the bias from overfitting I'm sure you've seen this before I take out this pair X I Y I and then I get my predictor whatever predict you are using so I'll call that M hat sub minus I and then we compare that to Y I all right that's the usual leave one out measure of risk probably one of the more common methods here's a cool thing about all linear smoothers you can compute this without ever having to leave one out this is equal to this is not an approximation this is equal to 1 over Y I - an original fit divided by 1 - and that's the I I term in the smoothing matrix you're going to prove this on the next homework it's very simple this is an identity so one nice thing about leave one out is that even though in principle you're doing it by leaving one out you never actually have to do it you can just use this identity so you never have to refit the data which is pretty cool ok and that's exact that's a common method so what you're gonna do is get now an estimated risk and choose the H what's the H hat which minimizes that I actually don't know a lot of good theory for leave one out cross validation I have to say it's almost this is an almost unbiased estimator of this of the true risk but the problem is they're all very highly correlated so this is a high variance estimator of the risk but it is easy and it is convenient and certainly I'm sure it's true in MATLAB too but I know in our most of the packages that do not promote regression this will do to automatically do some version of cross validation now there is an approximation to leave one out that it simplifies things even further what are these allies remember these are the diagonal things in that smoothing matrix right suppose instead of putting L 1 1 l 2 2 although I just put I just take the average of these values so I'm just gonna take the average and every time I see how I'm just gonna replace it with the average let's just suppose I do that so what is the average well that what is this actually that's the trace we called the trace the effective degrees of freedom new what what that does is I can now bring that out front so now I have 1 over N it doesn't depend on I anymore so I'm gonna get 1 minus new this is the Greek letter nu I'm not writing it very well and times the training error they're very clear what I just did I just took the individual diagonals replaced them with their average there's some asymptotic justification for that but what I've done is simplify the for me now it's just a matter of taking the training error times this term it turns out to be a little bit more stable because if one of these happens to get close to one this whole thing will blow up whereas this average quantumly tends to be a little bit more stable and in some it turns out in some types of regression this is just easier to compute than this is so that's got a funny name that's called generalized cross-validation it should be called approximate cross-validation but it's called generalized cross-validation so if you ever hear the term generalized cross-validation think of it as an approximation to leave one out cross validation so when you run software sometimes it'll spit out for you cross-validation you better check what type it is it might be leave one out cross validation it might be generalized cross validation and there's yet another version okay which is the following thing look at this term here that's something like this if I take this if I do a Taylor series expansion of 1 over 1 minus x squared that's approximately 1 plus 2x so if I just do that here I get yet another approximation to the leave one out which turns out to be I'm going to write it in this form where Sigma squared oddly enough is just again the the training error let me make sure I was clear what I did I just took Taylor series approximation to this multiplied it out and I get this term why would I do that does that look familiar to anybody certainly to those of you who have taken 707 should look familiar it's the training error plus a penalty term you see the penalty is basically involving the effective degrees of freedom this this thing how many of you have heard of CP no ok well anyway there is this thing called the CP statistic that people often use in linear regression for choosing like how many variables to put in the regression and it's this thing has a kind of intuitive feeling to it which is just that you have the training error and you have a complexity term and if you try to overfit this will get small but this will get big the effect of degrees of freedom will get bigger and vice versa but the genesis of it is really it's an approximation to the generalife cross-validation and the generalized cross-validation is an approximation to the lead one out cross validation so I've tried to kind of unify them all here that you're going to see in various places people using these different versions of cost validation but they're really kind of morally the same thing they're just all kind of different approximations to the same quantity so if you have a nervous CP well it doesn't help you but some of you will come across CP and it's just like an approximation to that expression all right so this is all still one flavor really these are sub flavors right these are the flavor here is leave one out cross validation we haven't got the other ones yet and these are kind of sub flavors of that flavor I want to show you an example any questions so far here's a here's a fun little example I like this example just because it's tough if you want to test something out this is a good example top of page 14 the function is known as the Doppler function I'll try my best to draw it it's a really hard function to estimate even for a one-dimensional problem this is tough and I generated some data so I took that I I generated a point at random then I took the function and added noise and I got the data set like on the right what you see on the bottom left is the cross-validation score and instead of plotting it versus what kind of smoother today's I think I used am did I say a local linear regression instead of plotting it versus the bandwidth I plotted that versus this effective degrees of freedom thing you could do either one and so you get this curve and it's minimized around a hundred and sixty-six effective degrees of freedom so it's almost like saying I'm meeting 166 parameters to this in this fit that's pretty unusual for a one-dimensional fit but this is a very very complicated function look at the fit though I'm also using this to illustrate another point do you like that fit fitted function I like it given the complexity of the function I think they did a pretty good job but what you do see is there's a compromise which is it fits pretty well in the middle at the right I would say it's under smoothing and at the left it's kind of over smoothing and that's because we're choosing a constant bandwidth that's one of the things about linear smoothers is they're homogeneous this is a kind of example where a nonlinear smoother actually might work better you might smooth less on the left and more on the right but this is a pretty special case right this is a one-dimensional high signal-to-noise problem kind of thing that might come up with signal processing you want you really want an example of a nonlinear smoother you'll get one I'll tell Ryan to do nonlinear smoothers when he lectures next week you should never oh good question husband's behave outside the range should never extrapolate one of the one of the basic rules and statistics you'll be excommunicated if you ever extrapolate beyond the data I mean they'll give you an answer but don't trust it okay all right so this is all leave one out cross validation probably more familiar one to you is or maybe more common is like something like 10-fold cross-validation and that is in my opinion more reliable just to remind you how that works you take the data you divide it into chunks people usually in I'll use five ten rule of thumb and you take one chunk at a time so you would leave out this chunk you would fit your predictor and now just get the prediction error on the chunky left out and do that for each chunk I'm sure you've all seen some version of this before that gives you again if you averaged over all the chunks now you get an estimate of the risk that's a much more reliable estimate of the risk it turns out well that kind of makes sense right because instead of using one point here I'm using a big chunk of the data task to make that the risk we'll talk about a theorem about that in a minute if we have time this is something where we can actually theoretically say something about how well it's it's doing so you might wonder though why do people use ten fold I don't know there's any reason I think that just sort of became a default I actually did find a paper the other day about trying to find the optimal number of folds it's like 60 pages long it's very very complicated there's a million assumptions the answer is an optimal number of folds not surprising it depends on all kinds of things about the function that you'll never know and so the like the final statement in the paper is a so I go ahead and use 10 fold seems so I think it's pretty hard to say general statements about what would be the optimal number of folds fuse and k-fold cross-validation unless you make a lot of assumptions but you know you know in general we're not going to know much about the function but I do have a theorem here which actually just does one fold just a similar theorem will apply for 10 fold but for simplicity imagine we do the simplest version of this which is simply that you train on one half so you take half the data you know and here I get em hat of H for a whole bunch of different values of H let's say you know I let H vary in some set of Dan wits and now I just estimate the error on this half so this is just a simplified version of 10-fold cross-validation and now I get this error and I pick H hat from there here's the question how close do I come to the best bandwidth there is some H which minimizes the prediction error we just don't know what it is this theorem here theorem 11 says that the final estimator I'll write it like this actually so this is just the distance of the final estimator with that bandwidth chosen by splitting compared to the truth is less than or equal to and I put 2 it's actually slightly smaller constant but just to keep it simple I put 2 here times actually we don't need an expected value here well actually we do need an expected value because this is going to be the this is the this is the estimator like you would choose if somebody told you this is that that best possible bandwidth you could use the one that minimizes the prediction error and again this constant can be made closer to one but what I want to get across here is two things first of all I'm gonna have to choose and practice some finite set of tuning parameters like a grid of tuning parameters luckily that only comes in logarithmic Li the size of that set so that doesn't have a big effect on the air what's really really important about this this is the hard part that this is n this X this is the what you're seeing here is the price you pay for doing cross-validation the price for not knowing the best bandwidth the price for estimating the risk is of order 1 over N that's really important because that's smaller then 1 over N to the 2 beta over 2 beta plus D which is the R it you know that's the statistical risk we're paying for trying to solve the statistics problem the prediction problem if this was a bigger order than that that would tell you that trying to choose the tuning parameter is screwing up the whole problem that's the dominant part of the problem this result says it's not it's a smaller order it's very important and here's a really interesting thing when we do Cross Fit when we do classification it's not 1 over end it's going to be 1 over root n regression is special it's because of the quadratic loss function right oh this is well this is the best estimator in your class we're trying to get as close as we can to the air of your best classifiers and I should have written this by the way this is actually one plus Delta and there's a constant here that depends on Delta it would have been clearer probably yeah you can extend it actually to infinitely many but it just changes the proof technique but it doesn't matter because that's that's a smaller term the key thing is the end oh and then they have two different proof if yeah but you still be it'll still be finite that's not the bottleneck here this is actually a remarkable result now I didn't include the proof because it does use tools we've used concentration of measure but the proof it's it's very long it's a very very long proof actually let me give you a reference so I can remember I think I've referred to this I think it's at the end of the notes actually yes the very last page there's it the first book your fee Koehler cries AK and walk the proofs in that book okay and if you want to look at it you know it's a springer book and you know on CMU you get springer books but the PDFs for free now so it's a great book it's it's got so much stuff about the theory of regression in there if you're interested in that stuff it's a great book to download and you'll see a this theorem in there but if you go through the proof you'll see the proof is it's not so it's long it's not so difficult but there is some concentration stuff but we've done that stuff so it's all stuff that in some senses within the scope of the course but the takeaway message is we're lucky cross-validation works we can choose we have a method of predicting and we have a weight method of choosing tuning parameters okay any questions no it's so this is particular to regression if once you get two different loss functions like classification you may not get the same rate without additional assumptions so what we're going to do next week is I want to tell you some practical ways to deal with high dimensional problems because in principle any of these things the RK HS predictor kernel smoothing local polynomial there was nothing that said I couldn't apply it to any dimension but you could have a lot of problems in high dimensions and so I want to talk about some alternative ways of dealing with high dimensions on Tuesday then probably will start nonparametric classification I'll put up some notes about that and then as I still on Thursday there'll be a guest lecture by Ryan tips Ronnie any last questions oh yeah yes good question the project I think you have a project proposal due in two weeks maybe so should you have started yes you should be thinking about what you want to do don't leave it I know it everybody ignores that advice I do too but don't leave it to the last minute realize just what you should be doing is thinking about what possible topics might you be interested in and start looking around for papers okay alright have a good weekend I'll see it next is it is this Thursday yeah I'll see you next week
Info
Channel: Jisu Kim
Views: 8,881
Rating: 5 out of 5
Keywords:
Id: e9mN6UH5QIQ
Channel Id: undefined
Length: 78min 28sec (4708 seconds)
Published: Sat Jan 30 2016
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.