Statistical Methodology and Theory : Professor Richard Samworth, Cambridge

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

I was asked um a couple of months ago to say something to the ATI doctoral training students in fact I was asked to cover the whole of statistical methodology and theory in three hours which did leave me slightly wondering what I'd do with the other two hours um but um so there are many different approaches one could take the initial one I took was no way um but then I thought about it a little bit more and I thought well there is a sort of story I can tell of course it's completely impossible to cover the whole of statistical Theory and methodology in three hours um but rather than just sort of skating over everything completely um we had a very wishy-washy way what I thought I'd do is sort of follow one particular theme that I think has been very prominent in statistics um over the last 15 20 years particularly uh and how we get there and so this will sort of take us on a journey from some very classical statistical uh models in particular the linear model which is at the core of uh the core of Statistics in many ways and then will move on from there to high dimensional statistics so getting much more towards sort of modern data science and uh we'll talk about the lassu as a way of doing uh automatic or simultaneous um variable selection and parameter estimation so um yeah we're going somehow from the 1930s through to the 2000s um and I hope it'll be an interesting Journey but let's see so I think first we should begin by congratulating the ATI staff for the wonderful job they have done at improvising these boards at very short notice um I don't know how visible that is to the people if anyone's uh watching online but it's quite a remarkable feat of engineering let's just say that um but I am going to be writing on the board so if you think you can't see the board from where you're sitting you might want to think about moving okay okay so the the first sort of um yeah hour or so is going to be very classical stuff stuff that I would cover in a first undergraduate course in statistics um but it'll sort of hopefully get us all on the same page from where we can move move on to more modern things um okay so yeah I'm going to be talking initially about linear models can everyone read that yeah okay and let's start with some classical Theory so the backdrop here is that very often in statistics we're interested in understanding the relationships between different variables in fact often between uh a response variable that we're particularly interested in and and the way it varies with uh one or of and more than one um predictive variables also called covariates or explanatory variables or independent variables so these things have uh lots of uh synonyms okay so we're going to be interested in particular here the situation where we have y1 up to YN be independent responses with Yi being distributed normally about mui and with variant Sigma squ where mu I can be written as a linear combination of x i j beta J for some known covariance X i1 up to X IP and unknown regression coefficients beta 1 up to Beta P okay so this is the linear model usually written as y = x beta plus Epsilon where Y is my Vector of responses X is what I call my design Matrix so it's an N by P Matrix beta is my Vector of regression coefficient and Epsilon is my noise term and it's got a an N variate normal distribution with mean Vector zero and coari Matrix Sigma Square time the identity okay so I hope most of you if not all of you have seen this model before okay um okay so I start writing that here is that okay yeah so classically we um assume that P is less than n and that X has full rank P so The Columns of X are linear independent okay this means that X transpose X is positive definite because if I've got a non-zero Vector Zed in R to the P then Z transpose X transpose x z is just a squared Norm of x x z and since Z is non zero and X has got full rank this is positive okay okay so let me write beta hat and sigma hat squ for the maximum likelihood estimators of beta and sigma squ so here's a quick warm-up exercise so that what's the formula for beta hat uh no so it's X transpose X inverse X transpose y people seen this formula or not yeah okay good and the distribution of beta is normal beta hat sorry is normal about beta and with cience Matrix Sigma squ * xose X inverse okay so this is an unbiased estimator of beta and it turns out that this is the smallest possible uh ciance Matrix that you could get among all linear unbiased estimators of beta that's what the G marov theorem tells us so the second part of this exercise is to show that Sigma hat squ is 1 / n times the square of the length of the residual vector and what's the distribution of Sigma hat squ so it's got a sigma 2 n * a k^ 2 N minus P distribution okay so in a classical situation where p is small by compar comparison with n we know the expected value of a k squ n minus P random variable is n minus P so if p is small then this is approximately unbiased but not exactly unbiased okay and the variance will go to zero as uh n increases and then the third very important property about P hat and sigma squ is that beta hat and sigma hat squ are independent okay I think I'm glad I'm starting from the beginning here good okay so these are really absolutely crucial properties of our maximum likelihood estimators notice that the vector y hat which is X beta hat of fitted values satisfies that y hat is X into Xpose X inverse X transpose y and that I can write that as P * Y where p is this Matrix here is a matrix representing the suspense will kill you representing an orthogonal projection in other words p^2 is p and p transpose is p okay so notice that I don't say it's orthogonal matrix it's definitely not an orthogonal matrix it's representing an orthogonal projection and it Maps y onto the P dimensional Subspace U which is the set of XB for B and R to the p of RN so has rank P okay the dimension of its image is p okay so in other words we've got the following picture here's my origin here's my Vector Y and this is the Subspace U XB for B and R to the p and what I'm doing when I'm fitting my linear model is I'm finding the closest approximation to Y within this P dimensional Subspace okay this is py or X beta hat so I'm trying to explain the variability in my data I've got n variables n responses I'm trying to explain this variability in in terms of a a smaller number of variables P variables okay and the way I do this is sort of to realize that the the the best explanatory uh the best fitted value I can hope for within this Subspace is what I get by projecting the response onto this Subspace okay any questions about that so now suppose we partition X and beta as X is x0 X1 where this is n by p 0 and this is n by P minus p 0 and my Beta I'm going to partition as beta 0 beta 1 where this is p 0 components and this has P minus P0 components okay and that we're interested in testing a null hypothesis that bet to 1 is zero against the alternative that it's not okay so this is one of the most common things that practitioners want to do with their data they want to know do I need this variable in my model or not okay so what we're going to study is how would we try to assess whether or not we need a subset of all the variables within our model okay is age a relevant explanatory variable a relevant covariant in my model is a sort of very common sort of uh question you might be interested in asking okay um well under h0 we've just got a smaller linear model so the ml are beta0 Double Hat which is x0 x0 transpose x0 inverse x0 transpose Y and sigma hat squ or Double Hat squ now y - x0 bet 0 Double Hat squ okay so we just got a smaller linear model and the fitted values y Double Hat satisfy x0 x0 transpose x0 inverse x0 transpose Y and that's what I'm going to call p 0 Y where p 0 represents an orthogonal projection onto the p 0 dimensional so don't be confused there's a capital p 0 for the Matrix and a little P0 for the dimension Subspace um which I'll call U which is this guy here so in other words we can now think about augmenting our earlier diagram representing what happens when we fit linear models so here's my origin again and here's my response Vector y I know that fitting my larger model amounts to projecting it like this but fitting my smaller model amounts to an orthogonal projection onto this guy here okay and the question is which of these do I prefer so which model do I prefer so what would give me evidence that the smaller model wasn't true well this is a smaller dimensional Subspace I'm representing it with a one-dimensional Subspace this one I'm representing with a two- dimensional Subspace one thing I know is that projecting it onto the one-dimensional Subspace has got to leave me with a longer residual Vector than if I project it onto the two-dimensional Subspace right so whether I should prefer the smaller model or not should be related to the relative lengths of this residual Vector to this one right so um what I'm going to do is let l0 so this is going to be the log likelihood under the null hypothesis and then I've also got a log likelihood under the alternative of hypothesis which is very similar so these are my two log likelihoods okay and it can be shown it's actually rather straightforward to show that the um likelihood ratio statistic so the way we usually test hypotheses is by looking at the likelihood ratio so I can write this as W and then LR for likelihood ratio for testing h0 so this is defined to be twice the log likelihood under the alternative evaluated at the maximum likelihood estimators minus the same thing for the um under the null hypothesis is strictly increasing function of what I'll call F which is defined to be the scalar Factor 1 over P minus p 0 times the ratio well is like looking at the ratio of two residual ve vectors or squared length of two residual vectors but just normalized appropriately so let's have a look p y minus p 0 y That's this distance that's this Vector here so I'm looking at the square of the length of this guy and its ratio to this whole guy from here to here right so I'm saying that if this guy is long relative to this one that gives me evidence against the null hypothesis being true does that make sense right so it's convenient to write it in this way because then this vector and this Vector are orthogonal and therefore because we're in a multivariat normal world that means that they're independent so the numerator and the denominator of this expression are are independent who can tell me what the distribution of this ratio is under the null hypothesis sorry no not K squ it's a ratio of two things that have got K squ distributions and then scale appropriately but it's not itself Ki squ The Clue is in the letter that I gave it it's got an F distribution okay so we therefore reject h0 if f is bigger than what I'll write as FP minus p 0 n minus P Alpha where this is the 1 minus Alpha quantile of the f p minus p 0 n minus P distribution okay so now I want to show an example and I'm going to do it in the form of a a demo in R okay so um the data I want to look at is uh taking it's a car's data set which is one of the built-in data sets in R and just to emphasize the fact that this is very classical Theory it's going to be we're talking about the speed and initial speed and stopping distances it take uh for cars recorded in the 1920s okay so we're talking about initial speed in miles per hour and stopping distances of course in feet um okay so here is the cars data set there are just 50 car cars that we're talking about um and I'm just going to plot the data so you can see what they look like um something like that that's interesting um okay the display settings aren't quite right on this screen I think but um I guess it'll do um okay anyway um so if you look at that data your first instinct I think would be to try and fit a straight line through that data right so um we're going to do that but we're going to take into account one fact that I claim is relatively clear which is that cars that start with zero initial speed don't take very long to stop okay in fact they take zero distance to stop so I know that my regression line should go through the origin okay so I'm going to let x0 denote the uh First Column of the car's data set and why my response the stopping distance is going to be the second column and then I'm going to fit a model of Y against x0 without an intercept so that's what the minus one's doing there and if I look at a summary of this model then I get something looking like this and I can see the coefficient uh my gradient for this line is 2.9 and it's easy to impose the uh these are the fitted values that I get from my curve from fitting this model okay now that doesn't look too bad but if you know something about the physics of of car breaking you might think well maybe I need a a quadratic term in that model right so how could we do that I could create a new design Matrix X which is what I get by putting together X and x0 squar x0 and x0 squ so I get a new so 50 by2 Matrix now uh X like this okay now I uh fit my model again without the intercept but I also need to add in the uh well I could do this in different ways couldn't I um no this is okay just like that X like that so now if I look at the summary of this guy then then I've got a coefficient the the sort of linear term and then the quadratic term in my model okay so it's not quite so easy to fit this on on the uh uh to put the vector fitted values on but it's not too difficult if I create a sequence going from say 0 to 30 length something like this then I can add a line to my plot um so I need to extract the first coefficient of this model and multiply it by X and I add the second coefficient and I multiply it by X2 and I've got to do it in a dotted line then this is my Vector ofid values okay so the question is which of these models do I prefer okay I've created this orthogonal projection here I've got created the orthogonal projection here which one do I prefer well the easiest way to get R to show you which one you prefer is just to do an anov test between them okay and this tells me that my f statistic is 9.4 with a corresponding P value of about 0.35% okay so certainly I'd regard this as significant and I should be including my quadratic term in my model okay if I want to compute this by hand uh then what should I do I should create Matrix p 0 which is x0 time the inverse of x0 transpose time x0 times the transpose of x0 and I do the same with p and I know that my n is 50 my p 0 is what here what's what's p 0 how many parameters have I got in my null model one and what's my P two okay so now I've got to compute this ratio here I don't trust myself with all the brackets so I'll do it uh the numerator and the denom separately so if I look at the sum of P minus p 0 time Y and I square that and then I divide that by P minus p 0 and then I've got a denominator which is y - P * Y and I Square it and I divide that by n minus p and then then I compute the ratio of the numerator to the denominator then that's my f statistic there okay which exactly is agreeing with what what we found from uh how R did it and then if I want to compute the P value I need to look at uh the appropriate quantile of the F distribution evaluated at this capital f with p minus p 0 degrees of freedom and N minus P degrees of freedom and I get exactly the same P value for the test okay so that is something this sort of analysis is something that gets done on you know not a daily basis not an hourly basis a second by second around the world people are doing this type of analysis right um any questions about it no okay so um I want to finish off this little chunk of uh material on on the sort of classical Theory with um a Lemma that's going to talk about the performance of uh my estimator so in particular the performance of beta hat now there are different ways you could think about the uh measuring the performance but two of the most obvious ones are in terms of their prediction properties and estimation properties so this lth going to come in two parts and the first part's going to talk about uh prediction the second part is going to talk about estimation so um so if I look at the expected value of the average prediction error that I make on the observations my claim is that this is equal to Sigma s * p/ n okay so you might call it so this does often get called prediction error but in some ways it's more like a training error because I'm Computing the error in the fitted values at my observed X's I'm not sort of observing a new X right then the second part says that if I write Omega for 1 / n x transpose X inverse with Ines Omega i j then the expected L1 distance between beta hat and beta is Sigma Over N the times sum from J is one up to P of twice the diagonal entry Omega JJ divided by pi and then take the square root okay so this is talking about the estimation error how well Beta had is actually estimating the unknown parameter beta well the proofs are fairly straightforward so let's give them so I want to think about the distribution of x times beta hat minus beta well we said that beta hat had a p variate normal distribution about beta so if I subtract off Beta then it's this this guy's got a p VAR normal distribution about zero with coari Matrix Sigma squ * X transpose X inverse and when I multiply it by this x here that means that this has got an narate normal distribution about Z with coari Matrix Sigma s * P so if I want to think about the expected value of the average prediction error well I can take the one over N Out of the expectation and then I'm just looking at the sum of the uh marginal second moments of this random Vector which means that I get a sigma squ n and then I've just got to compute the trace of this Matrix P okay but the trace of the Matrix P remind ourselves what p is it's x x transpose X inverse time x transpose and then the trace is invariant to cyclic permutations of its arguments so I can move the X over to this right hand side here but this is just the identity Matrix in P Dimensions so this is Sigma squ * P Over N okay and then for the second part well we can write that beta hat minus beta is p variate normal about zero and then by the way that I've defined this Omega I can write this as Sigma s/ n * Omega so since the expected value expected absolute value of a normal 01 random variable is the < TK of 2 over Pi I'm being sloppy when I do this okay I should really write the expected value expected absolute value of random variable where that random variable's got a normal 01 distribution but I'm being lazy don't do this okay um the expected L1 distance between these guys so I've got to think about uh the standard deviation of each component and this is giving me a sigma over theare Ro TK of N and then I've got to think about the diagonal entry of this ciance Matrix multiply by the expective value of what I've got left the normal 01 random variable and I do get exactly what I claimed to Omega JJ over pi good okay so these properties are going to be useful to me when I come on later on this afternoon to talking about the lasso and the theoretical performance properties for the lasso okay we'll see how these compare with what happens in the high dimensional setting okay so that's really all I want to say about very classical um linear models okay uh we're now going to be thinking about a situation increasingly common in today's Big Data era where you're collecting more and more variables you um that might explain the uh variability in the phenomenon that you're trying to observe so let's go back to the uh cars data we only measured the initial speed for a variety of different cars but you can easily imagine other factors that might well influence the stopping distance the road condition the weather conditions the the depth of the tire tread okay um there are all sorts of other factors that probably you know it was too difficult to record back in the 1920s but nowadays it's so easy to collect and store data on those previously unimaginable scales that we very quickly end up with huge numbers of variables that might be relevant to explain the the variability in in the response that we're observing and at least initially we're not sure which ones we really need in our model and which ones we don't okay so um I want to argue though that um the rules of the game somehow change a bit when when the dimension gets large and what was a good estimator in the low-dimensional settings is no longer good when the dimension is high so just look at this ratio we've got P Over N here well very often in all sorts of modern data we might have P much bigger than n okay if you're working on micro Ray data there you might have many thousands of genes that might be relevant for EXP explaining the uh variability that you're seeing in your response of interest um and only tens or you know even in the ideal case hundreds of repetitions of the experiment so this guy is not going to be small okay and I want to argue that um a sensible way to proceed in those sorts of situations is to shrink the maximum likelihood estimator towards the origin somehow okay and the first step in this direction was to think about Ridge regression so this is an idea that dates back to PE and Kennard in a technometrics paper from 1970 Okay so um I'm going to consider my linear model again but parameterize just slightly differently I'm going to treat separately The Intercept term from the other variables so this one n is just a vector of ones of length n um and I've got some intercept parameter beta 0 then I've got all my other variables in in in here um uh where we assume that each column of X which I'll write as x i j is centered so that the sum from I = 1 up to n of x i j is zero for all J okay so suppose The Columns of X are nearly linearly dependent which is typically the case when p is nearly as large as n so if p is bigger than n then the columns must be linearly dependent but I'm not necessarily quite going that far yet but let's say uh p is half as big as n and and n's big then actually The Columns of X will typically be very close to linearly uh dependent then the determinant of X transpose X which is a square of the volume of the parallelopiped spand by The Columns of X is small okay so that's saying that this is the product of the igen values of X transpose X is going to be very small so at at least one and typically many igen values of X transpose X inverse will be large since we recall that beta hat has got X transpose X inverse in its coari Matrix this means that at least one component of beta hat has large variance so in many ways the moral of high Dimensions is it variance that kills you the estimator is still unbiased but now the variance is really big okay and what people realize gradually over time is that by shrinking your estimator towards the origin you're of course introducing a bias when you do that but your hope is that it'll be more than compensated by reduction in variance okay so if you think of the mean squared error as the sum of the squared bias and the variance by a reduction in variance and this is the idea behind the ridge regression estimator so for Lambda greater Z the rid regression estimator is beta hat R Lambda so R for Ridge and Lambda is going to be a tuning parameter where beta hat Zer and beta R Lambda minimizes what I'll call Q2 of beta0 and beta so this is like a residual sum of squares term plus Lambda times the sum of the squares of the coefficients okay so my least squares estimator or maximum likel estimator in the case of Gan errors minimizes just this term alone but what I'm doing here is I'm adding a penalty that penalizes having large values of beta hat okay so it's shrinking my estimator towards the origin compared with the maximum likelihood estimator this is called an L2 penalty it's really an L2 squar penalty and what Lambda is doing is controlling the tradeoff between the extent to which I want to be um respectful of the data the Fidelity to the fit term here and the extent to which I want to shrink towards the origin okay so I've got a really big value of Lambda that means that I'm shrinking I'm putting huge penalty on having large values of beta therefore I'm doing a lot of shrinkage towards the origin conversely if I've got a small value of lambdaa then I'm getting something very close to the least squares fit okay okay I think I will pause there and we can have a 10-minute break or something and I'll come back in let's start again at um 22 this is the definition of the rid reg regression estimator that minimizes this L2 penalized least squares loss function and one thing that's uh interesting to note is that we do not penalize The Intercept coefficient so we're only starting this sum from J as one not from J zero and why do we do that to preserve location equivariance so um this refers to the fact that it shouldn't matter whether I add a constant to all my observations and then compute fitted values or compute fitted values and then add a constant to them okay those two operations should commute and uh if if you have a location equivalent estimator they do and in the way we've constructed it is location equivariant however the estimator is not scale equivariant so if I make a scale transformation to my data and then estimate that's not the same thing as what I get by estimating and then making a scale transformation and the way we deal with this or the standard way of dealing with this is to normalize each column of x to have ukian length equal to the squ root of a so that's the same ukian length as a vector of ones would have okay and then a straightforward exercise is to actually write down the closed form expression for the ridge regression estimator so we can write it as X ose X Plus Lambda I inverse X transpose y so this is rather similar form to the original maximum likelihood estimator it's just that we've added in this Lambda I before we compute the inverse okay this has the effect of moving all of the IG values away from zero right before you take the inverse so it's sort of stabilizing the uh this Matrix in some way now okay we've come up with an algorithm we've shown it's got a closed form but in what sense might this be a good estimator of beta so we've already seen the properties that my original maximum likelihood estimator has um so this is going to be theorem two I'm going to assume for Simplicity that X transpose X is positive definite my claim is that for sufficiently small positive Lambda we have that the mean squared error Matrix formed by the ridge regression estimator is strictly smaller than the mean squ error Matrix that you get using the maximum likel estimator okay so I'm comparing a matrix with a matrix and this ordering is in the usual Matrix ordering sense so I'm saying that this Matrix minus this Matrix is a positive definite Matrix okay so this is somehow surprising but it does confirm for us that um this tradeoff of reduction and uh reduction and variance increase in bias can be in our favor at least for sufficiently small Lambda okay so the proof is a rather direct calculation first let's think about the bias of the rid regression estimator So that's its expected value it's just a linear transformation of uh my Y and then I've got to subtract off this beta sorry there shouldn't be a bracket there and this I can write as if I just think about adding Lambda I and subtracting Lambda I here then I can write this as uh minus Lambda * X transpose X plus Lambda I inverse Time beta okay moreover so that's a bias now I can think about the covariance Matrix of this estimator but again it's just a linear transformation of my y y is the only random thing here so Computing coari Matrix this amounts to putting the Matrix here Computing the coari of Y and then putting the transpose of The Matrix so this is Xpose X Plus Lambda I inverse X transpose * the coari of y * x * X transpose X Plus Lambda I inverse okay but the coari Matrix of Y is just Sigma Square time the identity so I get an expression like this well let me do this the other way around I'm going to look at the right hand side minus the left hand side so I'm putting this all together this is the Matrix I want to show as positive definite so the mean squ error Matrix of the usual maximum likel estimate is just Sigma 2 * Xpose X inverse then I've got to subtract off the square of the bias and subtract off the covariant of the rual regression estimator so that's Sigma s Xpose X Plus Lambda I inverse Xpose X Xpose X Plus Lambda I inverse and then I'm going to subtract off the square of this guy which is Lambda 2 * Xpose X Plus Lambda I inverse beta beta transpose X transpose X Plus Lambda I inverse okay and this is what I've got to show is positive for sufficiently small Lambda and what you'll notice is that you got three terms here and these two both have a preactor and a post factor which are identical and I can factorize them out at either end and then we'll have to adjust this term here so let's see how we can do that so I keep uh no I'm not going to keep that out so I've got X transpose X Plus Lambda I inverse and now the main term I care about is well let's put back what I took out then I've got an X transpose X inverse and X transpose X Plus Lambda I then this term simplifies a lot and just gives me Sigma s * X transpose X and then this one I've got a minus Lambda s * beta beta transpose and what's left is X transpose X Plus Lambda I inverse okay so these terms are positive definite so all I've going to show is that this Matrix in the middle is positive definite but if I look at this this is just a quadratic in Lambda and it's actually a quadratic with zero um um zeroth order coefficient because this x trans x x trans X inverse X trans with X cancels with this term at least provided I remember to put a sigma squ there which should have been there so overall I'm getting X transpose X Plus Lambda I inverse and then I've got two Lambda Sigma squ times x transpose X oh no X transpose X there sorry times the identity Matrix um and then plus Lambda 2 time uh Sigma 2 x transpose X inverse and then minus Lambda s beta beta transpose so my point is that I've got a Lambda term and two Lambda squar terms here but for sufficiently small Lambda it's going to be this term that's going to dominate right and this term is a positive definite Matrix so this indeed will be positive for sufficient L small positive Lambda so that's that's I think a very interesting result but it's important to understand its limitations as well as its attractions so the Lambda that I need in order for this to be smaller than this depends on things I don't know it depends on the sigma squ but it also may help to reduce the mean squar error but it doesn't help choose a model all of the coefficients will still be nonzero in the estimate okay and so although we may initially want to consider many possible covariates so as to avoid modeling biases there are several reasons for preferring a smaller final model so the first is that we may reduce the mean squared error so if we set things to zero we know we're going to get zero variance there of course that may not be the right thing to do um certainly an undoubted advantage is that we have greater interpretability so if you've got fewer coats in your model you can help you can hope understand the data generating process much more easily and the third which is probably a bit controversial is a belief or at least a hope in sparsity okay so a lot of the work in high dimensional statistics was motivated by micay experiments in genetics and there um you certainly might hope although it's not clear that this is in fact the case in practice but you might hope that it's a relatively small number of genes that are having the effect that uh are responsible for the effect that that you're observing um in many many other contexts statisticians now uh exploit sparcity but it's not so clear that this is really something that is Satisfied by the kind of true unknown data generating mechanism or whether it's just a convenient assumption for statisticians to use in their model I think in many cases it's If we're honest it's it's the latter but nevertheless it is a very convenient assumption and it's one that we're going to be working with okay um so here are a few classical model selection strategies the first is called best subset regression so here we start with the intercept um just beta hat zero which is the mean of the Y's and let's here for k equals one up to P um maybe I'll put square brackets around here choose beta hat K to minimize the residual sum of squares beta which is defined to be the least Square subjective function with my intercept term plugged in over all beta into R to the p with at most K nonzero entries okay so I can do this for each different value of K and in for a fixed value of K in some sense this is giving me fact so if we think of um a number of predictors models with the same number of predictors as being on the same scale then this is giving me the one with the best fit to the data okay but then I've got to choose k and it can be made by minimizing either the AIC who's heard of AIC okay so this is minus twice the fitted log likelihood evaluated at my estimator here plus 2K so this was proposed by AKA in 1974 he had the cunning idea of calling it an information Criterion but it's now been renamed as AKA information criteria and then there's Bic which is almost the same guy but you add a stronger penalty so this is a basian information Criterion due to Schwarz in 1978 and instead of having twice the number of parameters in your model that you're penalizing it you're penalizing it by K * log n now okay so pretty much nowadays you could put any letter in front of IC and you'll find that there's a paper with someone proposing a model selection strategy uh on on some based on some different penalty that's a little bit vious but only only a little bit um okay so what are the difficulties here well this strategy involves exploring all 2 to the P sub models so it's only really viable when p is less than equal 100 in fact at least until recently I would have said rather less than that maybe 30 or 40 but there been some quite interesting advances in uh IND programming uh that that means that we can go much further than we could before but 100 is still a li pretty limited for a lot of practical applications and even if we could double that to 200 I mean this guy is this is guy is growing very fast with P right so that's a strategy with its limitations here's another strategy called forward selection this is a greedy algorithm it says start with the intercept and at each stage add the variable giving the largest reduction in the residual sum of squares and then refit the model with the new variable included okay that's a kind of sensible na strategy this yields so we're going to get a sequence as we add more and more variables have to decide when to stop and we do this purely on a marginal basis the traditional strategy for doing this is to do it on a marginal basis so you just look about whether the reduction in residual sum of squares in going from Step K to step k + one is sufficient to justify including the extra variable in your model and we can do that using an F ratio as I described in the previous hour so this here is a sequence beta hat f for forward selection zero beta hat f one and so on and we stop when let's think about the residual sum of squares with this K here minus the residual sum of squares at the k+ one step so I'm dividing this by one but then I've got to do 1/ nus Kus 2 that's a residual sum of squares in the larger model so I stop if this is smaller than this upper Alpha point of the F distribution with one and N minus k- 2 degrees of freedom so this is just the same thing as the F ratio that I wrote down at the beginning we're just adding one more variable so there are several things um that make this not quite as satisfactory as one would like one is of course it's a greedy procedure so as with all procedures if you go wrong at the beginning then you you've got the potential to go really really wrong okay um another issue is that there's a multiple testing question about the fact that you're um doing many tests of this form and you're not accounting for multiplicity in your um inferential statement and there's another drawback which I will mention in a moment after I've talked about backward selection is my third traditional model selection strategy and this you can probably guess involves starting with the full model and at each stage remove the variable giving the smallest increase in the residual sum of squares and then refit the model use an F ratio as above to determine where to stop so um again another greedy procedure another is one uh that's got a multiple testing issue but there's another problem which is the following you might think about the following strategy use forward selection until your F ratio tells you you shouldn't add any more variables then start from there and use backward selection until your F ratio tells you you shouldn't remove any more variables and then use forward selection again until you don't need to include any more variables and keep alternating the trouble is that with that procedure there's no guarantee that you're going to converge to any kind of optimum okay so I don't find any of these three model selection strategies very satisfactory but that was the situation we were in in the mid 90s okay so not much happened in the ' 80s we've seen couple of things from the mid from the' 70s not much happened in the ' 80s and now we're going to move on to what I think has been one of the most important developments in statistics in modern statistics in fact you could in some ways regard it as the birth of high dimensional statistics and this is the idea of the lassu so this is due to Rob tib chirani from 1996 okay so the least absolute shrinkage and selection operator which rather conveniently abbreviates to the lassu you might think he came up with the acronym first who am I to say um okay let me just say this this two 2.1 this is going to be um a definition and introduction so again it's got a tuning parameter Lambda and I'll write L for lassu so it's going to be somewhat similarly defined to the r regression estimator we're now defining a new objective function which I'll write as q1 I normalize my Square's objective function just for convenience because that's the usual way it's done but instead of penalizing the L2 2 Norm of the vector of regression coefficients I'm now using the L1 Norm to penalize okay it looks like only a small difference but it actually has a colossal effect in a certain sense so um okay remarkably it turns out that beta hat L Lambda typically sets some or often many coefficients to be exactly zero so this is possibly the most important thing I'll say today it therefore facilitates simultaneous model selection and parameter s estimation this is a really beautiful thing I talked about these traditional model selection strategies I didn't talk about yet another disadvantage of them so what often happens amongst applied statisticians is they have a data set they play around with several different models for their data including terms deleting terms transforming variables doing all sorts of things and then once they found a model that they like and they settle on they report the uncertainty as if they'd never looked at another model in their lives right now we all know this is not a valid way of doing statistical inference and yet it happens absolutely routinely in apply statistics and the disastrous fact about it is that you're ering on the wrong side if anything you're going to be your confidence intervals are going to be too narrow because your over your overconfident of your conclusion you're under uh reporting the uncertainty because you're not um accounting for the model uncertainty so um this has a very attractive feature that it's doing a simultaneous model selection and parameter estimation now it remains a very difficult Challenge and one that's only very been very recently addressed as to how you can report uncertainty uh based on lassu type of estimates um but nevertheless you've got a much better chance than you have of really adequately summarizing your uncertainty by the time you've transformed some variables played around fitted different models plotted some graphs used the graphs to inform what model you should fit then you're you're really in a disastrous position uh when it comes to formally quantifying your uncertainty so this is much more of a a modern Paradigm and much much more effective uh when it comes to high dimensional data I mean if you've only got three possible variables to include in your model it maybe doesn't matter so much if you play around with a few different models but if you've got p in the thousands and you're playing around with tens of thousands of models then really you you're you're not doing Justice if you report the uncertainty you report your confidence intervals as if you'd never looked at another model okay does that feel better with correlated um so uh what what would you mean by better will it be easier for it to select a model have a lot of correlated predictors I mean if you've got highly correlated predictors it's always I mean that's just a fundamentally challenging problem to you know you you've got predictors that are virtually aligned how do I know whether this one should be my model or this one right or a linear combination right I mean that's just a that's just a challenging problem um and I'll say something a little bit more about that um in the next hour um there's something called the elastic net that was introduced that actually involves a convex combination of an L1 penalty and an L2 squar penalty and that's specifically designed for situations where you have highly correlated variables okay so it's convenient to replace y i with y i minus y bar so I'm just centering my responses so then I can see that this guy actually minimizes slightly simpler function and with a slight abuse of notation I'll call this q1 of beta over beta in R to the P so here is my uh L1 Norm it's just the same thing as I've got written down here it also mizes can think of this in two different ways this is very much should remind you of constrained optimization problems where this is a lonian term right so I can think of this as minimizing my least squares subjective function subject to an L1 Norm being less than or equal to s or or equivalently as minimizing an L1 Norm subject to my least Square subjective function being less than or equal to T for sum s which is a function of Lambda and T which is a function of Lambda okay so let's see if we can get some intuition about what what what's going on here with the lassu for intuition consider the orthogonal design setting where um one n x transpose X is the identity Matrix okay in particular this means that P is less than or equal to n so this is definitely a very restricted situation but it's one where we're going to be able to explicitly write down what the lassu solution is and that'll give us some intuition about what's going on in this case my original maximum likelihood estimator simplifies because 1 / NX transpose X is the identity and writing y hat equals x beta hat I can simplify this q1 of beta by subtracting and adding y hat in here and what I see is that I get a y - y hat 2qu plus the cross term cancels and the other term is a simple one like this and this is what I want to minimize over beta but what you can you tell me what's nice about this expression here people online are shouting at the screen sorry it's convex that's certainly true um but it's this is always convex in beta regardless of I got orthogonal design so what's special about um the orthogonal design setting here not everything is known no I still don't know what the solution is yet okay the point is that this term doesn't depend on beta so in minimizing it I can ignore it but look at these two terms this is a sum from Jal 1 to P of these guys but the L1 Norm is also sum over the different components right what this is telling me is it suffices to minimize over each component separately in other words I can just do p univariate problems much much easier than a p variate problem yeah use the convexity why don't I use the convexity I I here you mean yeah uh so the convexity will guarantee the solution exists it won't it why ensure what ures what what ensures it's convex oh okay um so this is a linear transformation of beta okay so this is a convex this is the L2 s Norm that's a convex function and then I'm adding it to another convex function this L1 Norm of beta this is this is a convex function so it's a sum of two convex functions yeah so the point is that I can minimize over each component separately and we can get an analytic solution which says that the J component of my lassu estimator is equal to the sign of my maximum likelihood estimator and its J coordinate times um and then I've got to take the absolute value of the J coordinate subtract Lambda and then take its positive part okay so in other words this this s sgn function is one if its argument is positive it's zero if its argument is zero and it's minus one if its argument is negative so thus in this case the lassu is a componentwise soft thresholding estimator Okay so so this is the line yal X so this is if I do no shrinking I just stay on the line yal X but what my Lau is doing is this in other words everything whose absolute value is less than or equal to Lambda is getting shrunk to be exactly zero okay that's what's helping me do the model selection but if the absolute values of the original maximum likelihood estimator component is sufficiently large then it doesn't get set to zero it just gets soft thresholded so brought down by by Lambda in absolute value for further intuition consider the case P equal 2 okay so this is a sort of simplest non-trivial example so I've got a beta 1 and a beta 2 and I'm going to compare the ridge regression solution with the Lass solution so with the r regression estimator the Contours of my constraint set are circles and if this is my maximum likelihood estimator beta hat then the Contour of the least squares objective function are ellipses and my R regression estimator is found at the intersection of these Contours okay that's a standard fact about convex optimization now let's think about the lassus solution now the The Contours of the L1 ball are like diamonds and if here's my maximum likelihood estimator again with its elliptical Contours what you see is there's a decent chance that the intersection occurs at a corner of the L1 Ball but a corner of the L1 ball corresponds to a place where one of the coordinates is zero so again it's setting some of the components to be zero any questions okay let me very quickly do an example to see this in practice and then we can have another break L um the l0 norm is not a convex so the Contours aren't convex so you can't quite use this analogy but in many ways you're using the L1 Norm to approximate the l l Zer solution so the usual way that lassu Solutions are computed in R is using a package called glm net so what I'm going to do is is I'm going to have 100 observations in a thousand Dimensions okay so it's a pretty challenging problem and I'm going to construct my coefficient beta so that the first 10 of these variables are nonzero and then the remaining P minus 10 that's 990 a zero okay so if I look at the first 20 components of beta I've got things like this okay and then zeros from from there onwards so now I'm going to construct my design Matrix which is going to be uh a matrix of normal random variables standard normal random variables and N by P design Matrix and now my response it's going to be x * beta plus some error like that okay then the output I want is to apply the gln net function with inputs X and Y and I've got to tell it that I'm working with a gaan family and bear in mind we've got 100 observations a thousand variables and we're going to compute an entire path of solutions as Lambda varies okay that's a pretty challenging convex optimization problem let's see how long it takes and this is a pretty crappy laptop not long right so let's see what this looks like this isn't as readable as I'd hoped uh never mind okay so um well I can see I'm sure you can't that these numbers the these numbers here are the numbers of the individual coefficients okay and perhaps it's just about clear that these are single digit numbers so the these are all my signal variables which are coming out and these ones here are all the sort of the other ones which are in a complete mess here so in particular let's look at this is coefficient one I think if I go back and look at what it was when I generated it it was both large and positive it's 3.17 okay so um uh that's not so surprising that this is coming out uh large and positive let's find another one 8.32 that should be number five I really can't can't read it well enough actually I'm not quite confident what what the numbers are anyway if my screen resolution were better suited you could actually see what the what the numbers are now this was a relatively easy problem in the sense that the signals were strong okay I gave them standard deviation uh uh five let's say I make the problem more challenging and I go 0.5 so now if I look at my um signal strengths you see they're much smaller okay and again I'm going to compute my y like this play the same game Computing the solution and now plot it and now I see a very different sort of picture right um I've got noise variables and Signal variables which are entering the model these I can see as single digit numbers so the the sort of the big ones are still coming out but there's probably no point on this curve where I can uh fix a Lambda so so this would amount to fixing a Lambda and say okay these are the right variables whereas in the previous setting the signal strength was was large enough that I could choose a range of Lambda values where I was getting all the signal variables and none of the noise variables okay so that's just easy problem versus hard problem okay so uh after a break what we're going to do is have a look at some of the theoretical properties of the lassu and if I have time I'll say something about how you compute it as quickly as this okay let's come back okay so now I'm going to talk about the theoretical properties of the lassu so let's assume that Y is beta 0 time Vector of 1 plus X beta plus Epsilon where each column of X is centered and scaled to have ukian length root n and where Epsilon is standard normal I need some notation for a matrix a in R to the N by p b in R to the p and the subset s of one up to P let a s denote the N by cardinality of s Matrix obtained by extracting the columns with indices in s and let BS in R of the cardinality of s be the vector obtained by extracting the entries of B with indices in s okay now I'm going to fix my particular s that I'm interested in that's the set of components where bet toj is non zero so it's my signal set I Define my noise set to be the set where B toj is zero and let s so little s be the cardinality of big S okay so I need one condition on my design Matrix for um the estimation and prediction properties so to study the estimation estimation and prediction properties of the lassu will work with the following what's called compatibility condition so it's a little bit hard to get your head round but let me write it down and then I'll try and say something about it it says there exists a positive 5 Z such that for all B in R to the P where the L1 Norm of B on the noise set is less than or equal to three times the L1 Norm on the signal set we have that the L1 s Norm of B on the signal set is less than or equal to S * a squ Idan length of XB over n * 5^ 2 okay so that looks a little bit unpleasant it may not even be completely obvious that this is a condition on the design Matrix but it is a condition on the design Matrix so remark if we replace this guy bs1 squar with its upper bound that is we weaken the condition sorry strengthen the condition uh this would be called a restricted igen value condition so you see that term used quite a lot in the literature because without the BN less than or equal to 3 * the L1 Norm on the signal set condition or restriction perhaps I should say to emphasize it we would require that X transpose X be positive definite so this is saying something about the conditions the the the the directions in which um X transpose X needs to not map Things To Zero okay so um when p is bigger than n I know that this can't be a positive definite Matrix anymore but I'm asking for nice behavior in certain directions okay now these are not directions that I know in advance because I don't know what the noise set and the signal set are nevertheless this is uh an assumption I require on the underlying data generating mechanism in order for me to be able to prove something about the estimation and prediction properties of the lassu um so this is a variant of a result in a paper well-known paper by ble rof and cof from seven years ago now okay this is pretty much the kind of final word on the finite sample estimation and theoretical and prediction properties of the lassu so let's assume my compatibility condition and I'm going to Define uh my Lambda to be a * Sigma * theun log p n my claim is that with probability at least 1 minus P to Theus a s 8 - 1 I can say that the average prediction error plus Lambda / 2 times the estimation error in L1 Norm now is bounded by 3 a^ 2 over 5^ 2 which is like a constant time Sigma 2 s log p/ n okay so that might look a little bit complicated um but let me try and explain what it's saying um so this is simultaneously giving me control of the average prediction error on the observations and the estimation error okay so I can cover this term up and I'm still bounding the prediction error by this guy and you should compare this result with what we saw in the first hour where we looked at this expected difference in the linear model right and there we saw we got Sigma s * P / n well now the true dimensionality of my model is s this is the number of non-zero variables right so I'm getting the same sort of Sigma squ s Over N to what I had before but I've also got a log P Factor here so the moral of this theorem is that you pay a price of log P for not knowing which of the variables are true signal variables and which of them are noise variables now because log only grows very slowly with dimensionality we typically think this is a relatively small price to pay and you might as well try and include a lot of variables that you think may be relevant in advance okay so it's a similar story for the estimation properties because I've defined my Lambda like this I can cancel off a square root of log p and cancel off an n a root n and I get a very similar story to what I had in the uh in the low dimensional linear model case where I'm paying a logarithmic penalty for my absence of knowledge of which are the true variables that I need in my model okay so that's one of the most important morals moral messages to take away from about the lassu so um there's a lot that I I I really like about this result exactly as I just just explained what are the limitations of the result well one is as I've already explained you can't check this condition in practice you don't know whether this holds or not from your data so you have to be willing to assume that in order to uh have this probability bound another thing is it's not a completely datadriven procedure in the sense that the Lambda that I'm choosing depends on something I don't know it depends on the sigma the noise variance or noise standard deviation that's not too serious that's something you can estimate and actually you can do it simultaneously with the square root lassu and do things a little differently that's not really too much of a problem but anyway that's there and then the other fact is that this this preactor here this constant Factor could be pretty large right I've got a three but in order for in order to be able to conclude this with high probability I'm certainly going to need a s to be bigger than eight right so this number is going to be at least 24 and then this 5 squ I can show is certainly going to be smaller than one so this constant here could be really quite large and actually probably will dwarf the log P Factor nevertheless we tend to not worry too much about this constant factor and we think of just paying a log P penalty for the absence of knowledge of the signal set or is is it a No it's it's a scaler yeah yeah any other questions okay so that's uh one of my big theorems it's not that difficult to prove actually if I were teaching this at sort of Master's level or something I I I would prove it um so but probably not quite right for this audience okay one big theorem deserves another um so now I want to talk about the variable selection properties of the lassu when can I be confident that I'm recovering the true set of signal variables with my estimate okay and not getting too many noise variables so we now consider the variable selection properties of the lassu so this is another variant of a result by Martin Way Again from 2009 okay so uh for Simplicity consider the noisess case where Epsilon is zero so I've got no no noise at all this is not essential but it'll simplify things a little bit and assume that XS transpose XS is positive definite so this is a very weak assumption I certainly want wouldn't want to assume X transpose X is positive definite because that would rule out PB bigger than n but I'm typically thinking of the signal set being relatively small and so having linearly independent columns amongst those guys is is a pretty weak condition so I'm going to define a vector Theta which is xn transpose XS time XS transpose XS inverse times a vector of signs of bet to S okay I'll say something about that definition of moment so the claim is that if every entry of theta is bound in an absolute value by one and the absolute value of beta J is bigger than Lambda times the S of beta s transpose times the inverse of this and take its jth column for J in s then there exists a lassu solution beta hat L Lambda whose Vector of signs agrees with the vector of signs of the true beta okay so there are two conditions but if those conditions are satisfied then I can find a lassu solution that on the signal set is giving me something non zero and on the noise set is giving me exactly zero okay as a partial Converse if there exists a lassu solution beta hat L Lambda whose signs agree with those of the true Vector then the L Infinity Norm of theta is less than or equal one okay so I'm not saying anything about the uh this condition here but certainly this other one is satisfied okay so now now let's try and understand what these two conditions are saying this one is probably the easier of the two to understand it's saying that my on my signal set my nonzero guys need to be sufficiently large well you should definitely expect a condition of that type it's often called a beta Min condition in the literature if I've got a a a variable that's non zero but absolutely tiny how can I realistically my lassu solution to know whether that's zero or nonzero okay I can't so I need a sort of sufficiently strong signal to be able to get the right uh uh sign vector and then there's this condition here and um well I'll say something I'll write something about this in a moment but this is somehow talking about it's saying no noise variable can be highly correlated with signal variables okay because if I've got noise variables and Signal variables which are highly correlated it's going to be extremely difficult for me to know which of them is is really needed my model which of them is a signal variable and which of them is a noise variable so those are the two conditions and they're very intuitive conditions I think that I need to be satisfied in order for my lassu to to select the WR variables and you should think of these conditions as being strong okay um you're asking for a very strong conclusion you're getting every single sign coefficient right all all all the nonzeros are declared to be non zero all the zeros are declared to be zero that's a very strong conclusion you should expect expect to require very strong hypothesis for that to be true okay so let me just give a remark about this then um so for J and N so for my noise variables uh Theta J is um uh the inner product of this Vector of signs with the vector of regression coefficients obtained by regressing XJ on my signal variables okay so the condition that the L Infinity Norm of theta is less than or equal to one is known as an irrepresentable condition so you see that term used in the literature a lot so we're saying that no noise variable is too highly correlated with signal variables again a very intu condition that we should expect to to require in order to get the the vector of signs correct okay any questions about that so that's really all I was planning to say about the theoretical properties of the lassu yeah are mcale equival um where you have these properties and have can have 10,000 Dimensions yes yes so there's a very wellknown paper called the group PLU um where you exactly expect sparity at the group level it's written by muan and elen in 2006 it's had well over 3,000 citations so it's it's it that was a good idea um yeah so what you end up doing is putting is penalizing the L2 Norm of the coefficients within a group and that encourages group sparcity so all all the variables in a group are either selected or not selected okay um so the last thing I want to talk about today is is computation so we saw that algorithms can compute the lassu extremely quickly nowadays okay so how are we doing this these days lassu Solutions are usually computed via coordinate descent this is a general way of minimizing functions F from R to the P to R of the form f ofx equals g of X plus a sum of HJ of XJ where G from R to the P to R is convex and differentiable and each a h j as a function on R is convex okay so this is exactly the situation we've got with the lassu we've got a smooth convex function the L2 squ Norm the least squ subjective function and then we've got a uh one that's not smooth so I'm not insisting that H HJ is smooth but it is convex and it decomposes a sum of additive terms okay so here's the general strategy we start with an initial guess x0 for instance x0 equal 0 and for m = 1 2 and so on so coordinate descent is a rather simple algorithm so um at the MTH iteration we we set xm1 to be the Argin over X1 in R of f of X1 X2 nus1 up to XP nus1 and then for the second coordinate we Define it to be the Arin over x2 in R of F and now we plug in this with its MTH value not the M minus one guy X2 and we keep the other ones as they were and we keep going we keep updating the coordinates one by one until at the last stage in every coordinate up to the P minus one we can take the MTH iteration guy okay so this is rather like Gibb sampling who who's seeing this with Gib Gibb sampling no anyone if anyone did mtmc that then then this should be sort of familiar right um okay um so what can we say about its convergence well sang in 2001 proved that if a set which I'll call a z which is the set of points where f ofx is less than or equal to F ofx 0 is compact then every convergent subsequence of the sequence XM converges to a minimizer of f okay so in our case this set definitely is compact because my objective function blows up to Infinity in the tals so the set where I'm smaller than where I start wherever I start is certainly a compact set um so um since actually my sequence has is always improving my objective function um if a z is compact um then this sequence is wandering around in a compact space so actually so every subsequence of XM has a further subsequence converging to a minimizer of F okay but that's very interesting because it means in particular if F has a unique minimum at X star say then XM converges to xstar so the algorithm converges as I'd like so the only issue is let's say I've got two minimizers of this function f what I can't rule out is the possibility that I S could it be sort of alternating my algorithm could be alternating between being very close to one and being very close to the other right but if I've got a unique minimum then I've got this sequence that's wandering around in this compact set that means every subsequence of has got a further subsequence that converges and if the minimum is unique that further subsequence must be converging to that minimum but it's a basic result from a sort of first course an analysis that xn converges to X if and only if every subsequence of xn has a further subsequence that converges to X right so that's a kind of nice result I think um okay thus in order to compute the lassu solution lassu Solutions on a grid and where we start is at the um smallest value of Lambda that gives me a solution equal to zero and it's not hard to show that that smallest value of Lambda is exactly the L Infinity Norm of X transpose y divided by n and then I keep going down so I'm getting these non-trivial Solutions down to Lambda L we start with beta hat L at Lambda 0 being zero because that is we know that is the solution and use the previous solution as a warm start okay so once I've got it at a the solution at a particular value of Lambda that's where I'm going to start when it comes to looking for my uh solution at a smaller value of Lambda so this can be combined with an active set strategy as follows so I'm going to initialize my active set at the eelf iteration to be the set of coordinates for which my solution at the L minus one iteration with nonzero okay in step two I perform coordinate descent only on the coordinates in a okay so I'm not cycling through all all of these guys I'm only going over a restricted subset of them yielding a candidate solution which I'll call beta hat where I'm padding out the values of beta hat off the to active set with zeros so having computed that candidate solution I need to check if it is a solution and I can do that so if any of you are familiar with convex optimization you'll know that the optimal solution is defined by the kkt conditions right the karun Kush tuer Kun Tua Solutions um so it turns out that the the KK T conditions in in uh this case it's not difficult to show that they amount to checking whether or not for every variable this guy is less than or equal to Lambda okay so V is my set of violators so these are the violators of my kkt conditions if V is empty um then I know I've got a solution and I can output beta hat okay otherwise I'm going to update my active set by adding in the violators and repeat step two okay so I go back and I form coordinate descend again and then I check if I've got violators now this is very quick because the individual coordinate descent steps only involve a soft thresholding operation that's what we saw with the the once we fix every other coordinate okay so that's why this algorithm is so fast and we're able to compute Solutions in the blinker an eye okay any questions well if not I think I'll stop there and thank you very much for your [Applause] attention

Info

Channel: The Alan Turing Institute

Views: 8,829

Rating: undefined out of 5

Keywords: the alan turing institute, turing, science, data science, data, iMovie

Id: 5m3djtMGyHs

Channel Id: undefined

Length: 141min 1sec (8461 seconds)

Published: Fri Nov 11 2016