Analysis of Discrete Data: Model Selection, Akaike and Bayesian information criterion

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
what we're doing is trying to kind of find this sweet spot in the trade-off between bias and variance which is another way of conceiving of this problem of if we want to use a model which is rich enough to capture all the true information the true signal in the data but not so rich that it over fits the data and if we use a model which is too simple and so that parameters say are close to zero oh right this might send a lot of things to be zero or close to zero it won't generalize well to data that we didn't use in fact it won't it will already have poor prediction error on the data set we come in with if we use a model that has a lot of parameters we can expect that the parameters will can fit our data exactly but as a consequence when we look at data Bluebird at that point so we try to make inferences about it the inferences will be very poor or equivalently it will it will generalize poorly to new data and the idea of this was to sort of add this extra idea so we had likelihood log likelihood as our main tool of inference all right we said that like lot the principle of what likelihood says that you have this little likelihood maximum likelihood principle says we have a a function which we can think of as a likelihood of the log likelihood it's a function of the both the parameters and the data and we should choose the data that makes the parameters most likely right and so this this function here as I increase the number of parameters I increase the complexity I model of course this function will always go up the data will get likely and so it's kind of blind to this variance side of a phenomena it understands how to reduce bias but not how to how to deal with variance or overfitting and so what we were we proposed is that we should add to this another term which is a function of the sample size and it's not immediately clear why it should be a function sample size but but I'll explain a little bit more about that in a moment and a function of the parameters or the number of parameters so let's change this to a so I had P there but we can say G of theta so some function of the parameters which should measure somehow how big they are so last time we talked about this function being just the number of parameters or the number of parameters divided by two and this function being just one or LOGIN and there's a lot of work that's done in thinking about other alternative ways of measuring the complexity of the model so you can measure how far away is this vector of parameters from zero you know and you have a choice of distances to obtain to use you can use L to distance L one distance or L zero distance L zero distance is just count the number of parameters which are zero which are non zero and call that your penalty but there's other choices you can make too and those other sources have some advantages so the two criteria that we mention is this AIC criteria which is just you take the log likelihood function and you subtract the number of parameters and another word another way of writing a number of parameters is to say it's the l0 norm then there's this be IC criterion it will mention in a second which uses a little bit more of this flexibility it's the number of parameters divided by two times logged in so the point is that while this one will always naively tell you to go to a bigger and bigger model a more complex model these two will give you a reason to stop right you increase your complexity the model and at some point this term will overwhelm this term and it'll start to go down so both of these give you an objective function that looks like that and will let you choose a best model you know of course in kind of models we've talked about this is not a one-dimensional space since that we have I got a lattice and models a very simple model like independence and we get more and more complex models you know and there's they can depend on each other and so each of these points will get a score and the idea is that for small sets like this we can just evaluate these two things at every point to pick the smallest number and for big sets when when there's a lot of parameters and this this lattice grows quickly we have to have some your stick procedure for searching it because it would take too long to to search over every possible note it's every possible model okay so stopping short of a full derivation what I want to do is tell you a little bit about why these functions make sense so this one all right subtracting the parameters that already something we're doing a little bit when we think about something like like the Delta G squared right because if we think about this thing and we're comparing it to this degrees of freedom which is equal to the number of parameters right so we're already kind of doing that we're doing the G squared and asking is it significant but there's a more precise way to justify it which which I'd like to talk about now so you could think about there being two likelihood functions so here we're considering a penalty which is subtracted if I was looking at the e to this then it would be a multiplicative so what I'm gonna talk about how you could find will get to less and we'll talk about that towards the end okay so we have two likelihood functions you could think about there being a true likelihood function and a fitted likelihood function so you know this is the true likelihood function that's generating the data that corresponds to the function generator data and this one is our estimate of what the likelihood function is and we want to compare them and a natural way of comparing to probability so these are just probability distributions if you think about them as a function of the data right these are probability distributions for the data X and to compare to probability distributions so if we take the distance between them one of the most common ways to do that is this thing called the kale which stands for callback lie blur distance or divergence that's kind of you sure what you should think of from a likelihood framework that has the natural model of difference between distributions for example we can think of maximum likelihood estimation as minimizing this divergence between an empirical distribution and the point on the model so we want to think about the KL divergence from between these two functions between the true likelihood and the fitted likelihood so suppose we have a set so in this case we have a finite set of models so we have a set of models M 1 M 2 M for example this might be the independence model hey this is a joint CEO you can think about this as being you intercept only intercept plus B 1 we have a set of models and we have a fitted probability distribution function or likelihood function for model J all right so I pick a model of this then I fit the parameters and I and I get out a probabilities for each one in the data or equivalently all likelihood function and so our model says that the the outcome of the probability of outcome PACs what do you think that was being a single data point or as being the whole data set generated by the likelihood function is f J theta star x function and it has a log likelihood which is usually more convenient to work with which looks like well I sum over all observations log FJ Dana star thanks so our loss function how far we've gone wrong is the Kell backlight Willard divergence between the true distribution and the estimated distribution the one that we've figured out from our model and we can think about our goal now as so this sort of the first order of goal right is I have like an empirical distribution I try to fit it to the estimated one and I can always get a better fit by making my model bigger but now I want to think about before I've rolled it before I've rolled any dice for our throw any darts right before my get before my experiment there's this true distribution and there's this estimated distribution which is a random variable right which will change depending on what the outcome is and so what I'd like to do is minimize the expected difference I think about this whole thing as a random variable now because this is a random variable which depends on the data right which in turn depends on this distribution so this whole thing is random this loss is random and I want to minimize the expected loss so that with respect to the true distribution minimize D F F star so this is a new idea in the sense that this can take into account both both bias and variance right like if I'm minimizing this function if I over fit I get penalized for it if I have a procedure which fits which over fits the difference between these the the true and the empirical will be will be BIC is it'll be exactly fit to the data instead of instead of two the best estimate so this is a strategy for baking in that penalty that's fairly reasonable and if I calculate what that is well my definition of KL divergence diversions between F and F star is just the sum over the observations f of X log f of X over f star of X we'll say this is f DFG okay and if I look at D F F star and calculate it I get some number constant which is the sum f of X log f of X minus some other number which is a function of both F and F star in other words this quantity breaks up into two pieces right because the log of something over something is the log of this minus the log of that right so it's a constant and we know what this constant is called this type of things is an entropy probability Lots how some over the outcome is probability outcome times log of the outcome so this is H of F and then I have left over this other piece which is like the adjustment okay so fixing the true distribution F I don't care about the entropy of F because that's fixed that doesn't change when I change my model only this part can change only a can change and so here a of F F star is the sum over the observations f of X log F star of X - that's sort of a quantity where something interesting happens that includes the estimated distribution okay so if we want to minimize this this number we want to minimize the expected risk we should try to maximal track we have the entropy - this thing we want to maximize this thing maximize the part that gets subtracted that will minimize the part that gets that will minimize the total so minimizing risk or minimizing expected error linearity of expectations I take the expectation of that difference it's the expectation the first thing minus the expectation of the second thing an expectation of the first thing is fixed the expectation of C is fixed given the data minus expectation of this other guy a that's what we have something we can do so minimizing this total error the total amount by which we expect that our fitted model will differ from our true model is the same as maximizing this okay now if I wanted to plug into my computer a maximization problem and try to solve that there's a problem right I can't do that why can't I do that it's just a function right am F F star I have observed some data and I want to solve this problem I can't do it with I don't know what the true distribution I don't know what F was all I have is the data and I have my model so what do we usually do when we don't know a distribution we use a plug in estimate right we just we just do the best we can with the information we have so we estimate it using this F star is dead we estimate F you know assume that F and F star are essentially the same of course this has problems because we're using the model to sort of deal with itself so this is a biased estimate it's not unbiased because it doesn't know which model is correct but fortunately with some work good I won't repeat but you can read about the bias can be shown to be approximately this quantity day sometimes they loaded the size of the model now we're in model J which is equal to the number of parameters and you know you think the justification has to be something like the way we get the chi-squared distribution asymptotically and it has this Elektra piece of freedom right similar arguable um so the idea is that the AIC is just the likelihood corrected for this bias and L minus P is an approximate unbiased estimator of this quantity expectation of a or at least maximizing this quantity will will maximize this quantity okay I might have lost a constant or something but solving these two maximization problems are equivalent at least approximately under certain conditions smoothness regularity of the model and so that's why has kind of where this L minus P comes from it comes from an argument that we ought to minimize this stiff difference the expectation of this difference we make some approximations and it turns out to be about that which if you're thinking about G squared and degrees of freedom is not surprised yeah this is the definition of D as a function and I'm plugging in Esther yeah no it's not quite that yeah I'm gonna make a note I'll write up a little more care I'll fold the full derivation of it it's not very hard but the idea the idea is just to use a plugin estimate basically in other words we're doing the estimate expectation of a star this is the respect to F is approximately sum over X F star of X yes I think it is the entropy right star of X yeah so I guess it is approximately the entropy but then we know that now you have to compute the bias and the biases is an additional additional step okay so what we're doing is we're minimizing the kind of the wrongness of the error and and it's another way of saying it is that this AIC tries to choose model with the highest expectation with respect to the truth likelihood with respect to the fitted Thanks so just this little the inside part is what we do is just straight maximum likelihood estimation find that maximum likelihood estimate is maximize the likelihood of respective fitted you know what we're doing with the AIC is essentially adding a new expectation on the outside what do we expect given the truth might find that procedure the other observation to note that if you remember took our 501 or you know about linear regression then this criteria AIC is equivalent to called Mallos CP so in other words you live in logical a/c you can get two to the Melos if you don't know what that is then don't worry about it but remember for regression great okay so what's next the other one Bayesian information collection let's see how we can get to that minus 2l plus P log n now you know we don't do a lot of Bayesian estimation in this class but I think it isn't important it's a very important method for inference and it's especially important in situations where the frequentist methods that we use don't work very well they hit boundary conditions they hit other issues or you don't have enough data in the sense that you're way too many variables you're number of variables is bigger than your number of data points so what's the Bayesian basic idea the Bayesian approach you know to model selection and it's also good when you just have a lot of knowledge about your data set that you need to incorporate so for example if you're trying to estimate you know gene gene or protein protein is some sort of interaction networks which is it which is a common pastime of doing omics of some kind it's often makes a lot more sense to use a Bayesian technique because we have so much information about you have this whole body of scientific literature that says which which interactions exist already and you want to incorporate that prior knowledge somehow into your into your new test right if you just do it blind without using any of that information it's very unlikely that you're gonna get the right answer okay so the idea is that we have some models again so we have some number of models and again you should think about these as being our context independence model up to some other complex model or different choices of which which main effects and which interactions to include in a logistic regression and now we have an additional piece of information two additional piece of information which is we have a prior on the models that means we have a judgement before we see any data or based on previous data that we've seen in different experience we have an idea of which a probability distribution over which models are most likely so we think independence is very unlikely the saturator it is very unlikely for example and maybe we think it's quite likely that it's one of the conditional independence models for some reason so we have some prior distribution of models we have a prior judgment of what probability we think each model should have and then second we have another prior on the parameters of each model that is we have a probability distribution on the parameter vector for a model given that model okay so I have a probability that of which model I should use and then inside of that I have a further of public solution on which parameters I've use we haven't seen much of this we mentioned a little bit when talk about the beta binomial things like that the very beginning right so you could think about for example in a in a binomial a Bernoulli distribution if I start out with you know I have you know flipping coins and I think most likely it's close to 1/2 the bias right so I have I have a I have coin or probability P for the coin and maybe I think that you know almost all the time the true probability is going to be close to fought on half and it's very unlikely that it's 1 or 0 right that would be a prior on the parameter prior on the model would be whether I you know I thought I was more likely that the coins were fair or a bias for example or whether the coins were somehow related to each other okay so then combining those two pieces of information we obtain posterior probability of a given model which is probability of a model a particular model one of my lists and models given the data right that's really what we're interested in for model selection is finding the most likely model given the data but now this is instead of the simple likelihood function situation this is a more complicated calculation so just by phase probability of this on that model which was our from our prior probability of the data given the model which involves performing a fit and then the probability divided by the probability X which is the true probability and that quantity is proportional to probability of the model times the probability of x given MJ because they don't get rid of this thing that's not part of my procedure thinking of the data is fixed which is in term proportional to probability of this model my prior probability of the model times an integral probability of the data given the parameter vector and the model probability of the parameter vector given the model D theta J so this is my parameter prior and it's a very reasonable thing to do kind of full Bayesian inference where you you always make these sorts of calculations it often becomes computationally very difficult you need to run market multi-month Markov chain Monte Carlo and other numerical integration now they're fairly exotic sometimes techniques in order to solve it and you could think about bi C as being well I don't really want to do all of that nonsense but I'd like to have a little bit of the Bayesian juice in my inference procedure it's kind of a shortcut to do some of that so if we did this and we could compute the probability of a model given the data with an assumptions of priors priors and data then we can compare two models very easily once we have all made all of those assumptions so let's suppose I'm contrasting model MJ versus model ml we have this posterior odds it's called which is probability of model J given the data calculated as written over there divided by probability of model L given the data which is by this proportionality equal to the probability of model J divided by the probability of model L so this is the part of this which depends only on my beliefs before I began the experiment times the effect of the data so this is the data and this is the model so this is the part that I can compute ahead of time and this is the amount that I sort of adjust these odds after seeing the data a lot of the trick of Bayesian stuff that works well is making these kind of updates happen in a clean way it's the idea of the conjugate prior you can think about these things are something very similar to likelihood okay so then if the posterior odds is favors this if it's greater than one favors this guy that means we choose model J if it's less than one we choose L and we have a very natural measure of sort of either by looking at this vector of probabilities we're looking these ratios we can quantify our uncertainty about which model is best yep that's that's our prior so we have to start with a distribution on model so you know for example and you know say I'm thinking about the hierarchical stuff and I have ABC ABC ABC this first simplicity maybe I put one-third probability this one-third this one goodness and zero and everything else that would be an example the prior so I imagine that the world first draws a model then draws parameters then generates data that's the point of view it's not usually literally true that's what happens but it's a way this sort of increase you know have a nice kind of hierarchy of probabilities and it often works very well in practice because usually we do have some intuition or some information about the likely state of these for certain times it's very popular and finance for example where there are sort of reasonable things and unreasonable things and it's too easy to overfit because there's tons of data well summation is just a special case of integration within a certain kind of closure yeah when everything is discrete integration becomes the consummation but here we're integrating over theta so you would need anything with a tiss oh well you need to screen parameters mr. theta over theta is usually like a real vector of real numbers so we're integrating over the RN or RP if theta had a faint a J was in RP there's a P dimensional real vector this integral of U over R which is why it's more natural to use an integral although you know in a computer there are only finitely many floating points that you could sort of think of it isn't the sum over a very large number okay so we choose we have we've given these assumptions fairly strong and sometimes hard to make assumptions we can choose which model is best by comparing these values by comparing this posterior odds we have a portion that comes from our beliefs before we begin the experiment and we have a portion that comes from the data and the portion that comes from the data is called the Bayes factor and it reflects the contribution of the data to the posterior odds as by zooming in on that Bayes factor that we can get to the Bayesian information criterion so if you think about a very simple model like this right where I list out all possible models I give them all equal weight this factors it goes away because it's always you know one tenth over one tenth one twentieth of 120th right or if the models are look very close together in probability I don't know much about which model is more likely this factor becomes less important than this factor it's the important part of the decision so assume that assume that the prior over models it's the screen thing be uniform so I have place equal weight on every model I don't think that I know anything so P mu R but this thing is a constant woman over the number of models it's all about choosing the largest the largest one of these the look which is very much like the maximum likelihood but this thing is an integral so we have all ability of x given MJ is the integral of probability of x given theta J so there's the actual likelihood then I have this other term which incorporates the model the prior on the parameters this thing is basically a likelihood and up you have to do some approximations we'll applause approximation and some other approximations so this so magic we do some work and we get that the log you think this is interesting I encourage you to look at the details that you don't have time to get into them but before doing them we want an estimate of the log of this all right that's gonna be sort of my score that right if I if I have a whole bunch of these I want to choose the max of this I want to choose the max of the log of it and so I'm basically I've created a score for models and I want to choose the model with the best score so this is a perfectly good version of my best score so this yeah the MJ is implicitly in this this is this depends on MJ a form of the function okay so what is this thing by approximation this ends up being log P log P of x given theta star the maximum likelihood estimate the Maximizer of this and minus the number of parameters divided by 2 log n plus some error okay so I've skipped all the steps but okay so what it spits out is that whole chain of argument that I gave you you get this score function which is the right score function to use it may be hard to compute it but there's an approximation that works pretty well and that approximation is exactly the the B I see the BI seeds you know yeah oh it's number of parameters in J in in model J so in the in Linux for example in a progression the constant term would would be one of those parameters the thing that multiplies the one okay so the idea then is that choosing the model with with the best v.i.c is approximately the model given these approximations are valid with the largest posterior probability given all these assumptions so we start with a prior distribution over models we had a prior distribution for each model over parameters we observe data we integrate using Bayes rule let get out which model is most likely the BI see let me write that down one we start with prior over models prior over parameters for each model - we observe data and we possibly update that information to integrate using Bayes rule integrate that information and finally we get out which model is most likely given all those assumptions and this is an honest probability I'm like a likelihood it really is the probability of a particular model being the true model and the B I see so this is the full Bayesian procedure which is great to do but of course requires more work than more thinking the B I see is an attempt to sort of pick a shortcut it gives you a similar answer by making certain approximations you like and certain simplifications like the prior the model is just that it's close to uniform yeah I don't know how it relates to the risk risk neutrality or sensitivities it's interesting question that could be Conner metrics text or something might be a good place to look for our game theory I think about risk aversion ok net posterior a couple of other observations the probability of the model given the data if our assumptions are correct is approximately e to the minus 1/2 times the B I see for that model divided by the sum over all the models e to the minus 1/2 V I see so in other words the approximation doesn't just help you pick the best model it actually also gives you an estimate of what you know given the assumptions that you make what the probability of each model is this is different from a likelihood which is not an honest probability all right this is an honest probability really is a probability is you know with assumptions so we get not just which model is best but a certain eighth point of view on how to measure the relative merits you see where the two models are close so a couple of sort of practical points the B I see is more asymptotically correct it's uh its assumptions and approximations are less heroic but it often chooses to small models they're both both great to use and I should both be helpful especially since they're so easy to calculate it giving away the heads so as I mentioned if you think about this arch problem when you have a very small space of models finite space and models that's like you know there's ten models or there's 20 models it's perfectly reasonable to just compute the AIC B I see G squared for every model in your class and pick the thing that you think is best great if they all agree for example when you have many variables that becomes impractical it's not it's in fact model selection this model selection problem with any reasonable definition Thank You Theo BICS and is an np-hard problem and so you need some way around it some heuristic you've already seen some of these so you can do you know for word for example in a regression context logistic regression contacts little miss logistic regression you can do forward stepwise regression and that's the idea that you start with no covariance when you start with just the intercept and ad covariance add more coefficients so you go and you increase your you have sort of a fan out right or maybe you pick one path that you think is most reasonable for your data and then you just stop when the AIC bi C goes down whichever you've chosen or doesn't improve which is basically the same kind of procedure and saying I'm going to stop when the increase is no longer significant from the perspective of G squared perspective of deviance the other thing you could do is backward so you start with all your start with everything start with all your interactions your saturated model and then drop one at a time how do you choose them usually it's with the least cost so in other words you look at dropping each coefficient you drop the one that costs you the last the least in terms of your score or maybe even improve your score if you're using a SUV I see my least cost might make more sense for a pure likelihood so least cost going into benefit so these aren't a solution to the model selection problem right because there's no guarantee that either these procedures will choose the correct model or choose the best model but they'll usually choose a pretty good okay so there there are other I said there are other information criteria or scores for the model fact there's quite a large industry of producing those scores and using trying to justify them some of that is exists because you can you can look at your there's some magic that I put in there there's some asymptotic assumptions when those don't hold of course it's not correct to use the AIC of EIC and so there are other criteria like this widely applicable information criteria that can apply in more situations where the assumptions break down and then there are also what you could call sort of more ad hoc although they're increasingly well justified choices of adjustment to your function choices a penalty so we considered you know FN equals one or two we considered it to be log N and we consider G of P if this is the upsurge of beta so G of theta to be a number of parameters namely what are other choices so one choice is to take some some function or some scalar times the l1 norm so I didn't veto so basically that so if you have if you have a you have a vector theta think about your vector theta as being in some real some real vector there's different ways to measure the distance of the vector these correspond to this sort of norm notation usual distance is called the l2 norm it's written like that that's Euclidean distance L 1 norm is the taxicab distance so in this example it would be this Plus that I have a three dimensional thing it would be like this way this way that way to get to the vector it's the old one norm and then the l0 would just be the number of parameters number of non zero one zero parameters and infinity one will be the maximum value okay so people discovered that searching over the number of nonzero parameters gives us this np-hard model selection problem and it can be quite difficult to figure out if you switch to this slightly strange norm where I'm measuring the sum of the absolute values of the parameters it turns out that that is often computationally easier to solve so in other words the problem that max the likelihood minus some constant times the absolute sum of the absolute values of the parameters this thing can be solved with convex optimization reasonably efficiently and this leads to things like lasso Lars and another called l1 or sparsity promoting techniques if you compare it to this there's sort of the there's a connection between the prot the idea of a prior on the parameter it's connected to the norm or this this idea of a norm or loss function sometimes more or less formally but it turns out that in that connection this one is roughly equivalent to saying my parameter has a Gaussian distribution one of the features or problems of the Gaussian distribution is that kind of or with this norm as a penalty if you think about what the derivative of this thing is when you get close to zero the derivative starts to get flat and what that means is that if I use this norm as a penalty so if I use a penalty like L - squared norm in regression is called a Ridge regression as my penalty term when I get close to zero the effect of making it even closer to zero will be very small because it sort of looks like that means that this one will give you a lot will give you a dense parameter vector so most of them will be nonzero or all of them will be nonzero but they'll be sort of forced to be small in the parameters if I use this one this roughly corresponds to a prior that looks more like it's got a big peak at zero call it applause prior for this one the derivative as I get close to zero in the penalty function is not zero it's just theta and so this one tends to choose solutions although it's not explicitly optimizing the number of parameters it tends to choose solutions with lots of zeros because it can take the take the vector all the way to zero with the derivative also going to zero so for that reason in practice for large problems this is much more common than this then then the one that counts the number of parameters simply because it's a computationally a lot easier to solve and one one other thing I'd say is in a typical way to use that sometimes is to say estimate was one of these penalized things and then redo it having thrown out the other variables to get the actual parameter vector
Info
Channel: Linear Algebra
Views: 9,009
Rating: 4.8956523 out of 5
Keywords: YouTube Editor
Id: 75BOMuXBSPI
Channel Id: undefined
Length: 50min 4sec (3004 seconds)
Published: Fri Mar 18 2016
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.