MLSS 2012: J. Cunningham - Gaussian Processes for Machine Learning (Part 1)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
everyone can hear me okay great alright we'll talk about Gaussian processes for the next couple hours so I wanted to start by saying that Gaussian processes are a mathematical object with with a great history in theory and also has are something that's been used to good effect in quite a number of applications we're not going to focus on either of those two pieces specifically but rather talk about the piece in the middle which is which is from from a usability context how as machine learners can we use Gaussian processes so here's how we're going to go about introducing Gaussian processes I want to start we're talking about gaussians in general both in words and in pictures this is just going to be sort of an easy introduction into this so we can think about we can think about what a Gaussian process actually is from an intuitive perspective then we'll go in and build out some of the equations we'll talk about using Gaussian processes in a basic regression stud setting that'll be that'll get us about through the first hour and then we'll take we'll take a quick break and then we'll come back and we'll think about moving beyond the basics of Gaussian processes so what what kind of things can we change we'll connect that to some of the different technologies and machine learning that we've seen and that'll just about do it all right so what is a Gaussian as far as machine learning is concerned this should tie into some of the notions of Bayesian inference that you've seen in the last couple days so a Gaussian a Gaussian distribution is essentially a handy tool for Bayesian inference on real valued variables so here's a here's here's a specific example that we're going to talk about throughout the course this I'm interested in measuring my heart rate so how might I do this from a modeling modeling perspective and I'm here I'm going to I'm going to measure my heart rate at 7 a.m. and the fact that I do that index by time is going to be important we're going to see that in a moment okay so I'm a reasonably healthy guy so you might think okay aa priori I have some belief about what my heart rate is going to be when I measured it seven eight seven a.m. maybe it's somewhere between fifty and sixty beats per minute so I put some gaussian some Gaussian prior on that and there density now I can go in on a particular morning and I can measure my heart rate I measure it at 61 I can go in on a couple other days and I can measure that three three more times so now I've got these four observations these four noisy observations that I measured on four different days is my heart rate and what the Gaussian allows me to do and what the notion of Bayesian inference allows me to do is that I can then take my prior this grey distribution and those four draws from that Gaussian and I can do posterior inference so I can come up with a posterior P of this of my underlying heart rate given the noisy observation they've seen you see that that's again a Gaussian I now have more confidence about where that is and I see that in fact it's it's it's centered around say 60 62 okay so let's take that univariate Gaussian that up to multivariate gaussians so we talked about measuring my heart rate at 7:00 a.m. I could also want to measure my heart rate at 8:00 a.m. and let's think about how how those observations have not just a single real valued variable but rather a pair of numbers at 7:00 and 8:00 a.m. how that would change so to do that we can't use the same univariate gaussians very gaussian this should be an object that we're all familiar with now the multivariate Gaussian we're going to conventionally look at as these on a flat surfaces of lip sides of I so probability and so what is this distribution telling us this distribution is telling us now I've got some prior belief not on on a single heart rate measurement but on a pair of heart rate measurements it shows that there's some positive correlation which is to say if I measure if I've got a higher heart rate at 7:00 a.m. I imagine it'll be higher at 8:00 a.m. and so then we can do the same thing so here's our prior I can go in and take four measurements on four different days now remember these measurements now are a pair of numbers 7 a.m. and 8:00 a.m. so I can get that data I can use Bayes rule in the same way and I can come up with some posterior inference and now I have I have a refined belief of what of what my heart rate is at 7:00 and 8:00 a.m. so now I'm going to take those two measurements the measurement that happened at 7:00 and the measurement happened at 8:00 and I'm going to represent that in a slightly different way so this is the same data I've got for data for four pairs of numbers it's the same data I just indexed it by time now so you see that the one met the one red point we were looking at before which is the pair of numbers at seven a and eight am is now is now put there on that axis at 7 a.m. four data points still so then the natural thing you would want to do next is you want to say okay what if I measure it at 9:00 a.m. what if I measured what if I measure my heart rate at 10:00 a.m. and so on so what this is getting at is if we wanted to measure at 7:00 and 8:00 a.m. we used a bivariate Gaussian I wanna do that again at 9:00 a.m. maybe we would use a three dimensional Gaussian a four dimensional Gaussian a five dimensional Gaussian when really what we're getting at what we care about or what we might be interested in is inferring that entire function over time and that that's how we get to a Gaussian process intuitively so rather than having some finite set of gaussians that we have to measure strictly from we have if you will an infinite set of gaussians and that's that's what a function is so here's how we're going to represent that throughout throughout the course of this so each of these each of these function curves in color is going to be a single draw from a Gaussian process that's a function the way we're going to represent the way we're going to represent the prior distribution is with this this mean which is this gray line here and this envelope that sits around it which is two standard deviations around it so what this is saying is that we imagine that our average draw is going to be something like that and we have some some distribution that Wiggles around that inside that envelope well this allows us to do and this is this is this is really one of the key features of a Gaussian process is that remember before we were talking about measuring rigidly at those hourly times but now if we've got this nice infinite dimensional object we can measure really at any time we want and we'll get into all the mathematical reasons for why we're able to do this so I can measure at this particular time and you can see what's happened is that I've measured a data point at about 10:30 and what that's done is that's taken my prior in the same way that we did in in that fixed dimensional case that's taken my prior and it said actually it's refined it into a posterior and it says actually I believe that my function is not quite flat anymore but rather is closer to that data point and you can see that around the data point that I've measured I now have I now have increased confidence we can then take another measurement more measurement still and you see what's happening as we go through this process as we get more and more data we're scribing out this regression function this nice smooth underlying function and we're getting more and more confident about the envelope around it ok so that that intuitive summary is this is as follows when we were taking when we were taking measurements single measurements at 7 a.m. we were getting real value variables so is the univariate gaussians was a nice distribution over real valued variables when we move that to pairs or triplets or what have you that the multivariate Gaussian allowed us to do that and now when we want infinite numbers of real valued variables in other words functions of real value variables that's what a Gaussian process allows us to do so so if you take nothing else away from this lecture take take that away that this is something that allows us that the Gaussian process is it effectively something that allows us to have a Gaussian distribution over infinite numbers of variables and what that drives us to is this notion of regression so let's look at regression real quick and talk about a few reminders of what regression does for us so I'll put this up again so here we've got all these blue data points that we've that we've observed and we believe that there's some smooth underlying function that is really what the description of what the data is doing so one of the one of the things that regression is quite good for is just that which is denoising and smoothing so we want we don't want to follow every little wiggle of these data points but rather but rather come up with some good description of what is noise and what is true signal we also want to do prediction and forecasting so I've collected all this data that's great but now I want to know what my heart rate might be a couple minutes after 9:00 in the morning well to do that you can you can then have some some query at that time and you can say okay I believe that my heart rate should be centered centered at this point with some with some variance envelope furthermore one of the one of the things about regression and and and this you've heard you've heard a bit from from from Peter batten will hear whether Moore's is the dangers of parametric models so what I've done here is I've taken this data and I fit a quadratic to it so you can see okay this quadratic has a reasonably good reasonably good fit to the data but it seems to miss some of the features and that of course is because it's a fixed parametric model and it can't it can't respond to a lot of these features furthermore the parametric models get us into some dangerous places like because of this because of the way this data was fit there seems to be some magical point around 11 a.m. where my heart rate Peaks for the day and then it sort of falls off after that and furthermore if you if you really take this too seriously and you and you extrapolate from this this has my heart stopping around around dinner time so there's there's there's some dangerous the parametric models furthermore overfitting under fitting is always going to be a concern with regression and we'll talk about how Gaussian process deal night nicely with that so heat you can see here we've got this model that is that is overfitting there was just chasing all the little Wiggles all the little noise in the data conversely you can have a data that's you can have a model that's under fit and so you see here we're still we're still fitting the data so to speak but we seem to have missed a lot of the interesting structure alright so that is that's that's basically what I wanted to go through in this first section of gaussians and words and pictures so now we're going to now we're going to fill that intuition in with some equations I should say please interrupt me throughout the course of this if you have questions about this I suppose everyone's been doing that okay the multivariate Gaussian this this this this should be a review I hope we say that F which is a n vector is normally distributed if it has the following distribution and that following distribution is parameterize by some mean vector M which is which is an arbitrary and vector and some covariance matrix K and the only constraint on that covariance matrix is that it's positive semi-definite the shorthand that we'll use throughout this is we say the F is distributed normal with mean M and covariance K so as we said before the loose definition of what a Gaussian process is is a multivariate Gaussian of an uncountably infinite length in other words take that multivariate Gaussian vector and just make it longer and longer and longer and what does that get to that gets to a function that that that's a very loose definition indeed here's a slightly more rigorous definition we say that F is a Gaussian process if for any subset indexof that a set of index is T if F of T which is which is which is the function f evaluated at those index point has a multivariate distribution according to normal M of T and K of T all right so I'll say that I'm using tea here as real numbers for familiarity with regression and time but the domain can be can be any dimension any dimension X in NRD and we'll show an example of that later all right so I kind of breezed by this this fact in this definition what are those functions em and what are those functions kay so let's talk about that now because that's that's an interesting part of what makes a Gaussian process so the mean function by analogy to that mean vector in the multivariate Gaussian case or we said the mean vector can be just about anything the mean function can be can be any function that map's index points T onto real values often in Gaussian in the Gaussian process literature because you can mean subtract your data and because it makes notation easier we'll just we'll just set set the mean function to zero because really in the modeling context what what what often makes things most interesting is modeling that kernel or covariance function so that that covariance function is is is again a function that map's your input space onto a real value except it takes a pair of arguments so it's any valid Mercer kernel so this connects to all the stuff that you've talked about in kernels already and what this is is just any function any function that has two arguments and that function has to be a positive semi definite a positive semi definite function in other words it needs to it needs to obey Mercer's theorem now Mercer's theorem again is a very rich mathematical theorem from functional analysis but when it's whittled down to two what we care about in this particular case what it says is if you give any finite there any any any subset index of T and you evaluate that function into a matrix K T T in other words take all your time points evaluate it build it into this n by n matrix that that matrix K will be positive semi-definite so to summarize that the GP is fully defined by a mean function and a kernel function and this requirement that every finite subset of the domain has this multivariate normal distribution this consistent multivariate normal distribution f according to M of Al u at the mean function evaluated those points T and the kernel function evaluated those points T so a couple notes one this is something that we can conceptualize pretty easily and say okay I've got these two functions the mean and the kernel function and I can evaluate that and that will give me that will give me a mean and covariance matrix great and I can stipulate that I want that I want those to always be this to define this Gaussian but the fact that that should exist as a valid mathematical object is not at all trivial and furthermore the fact that this is a full specification for it in other words that you give me you give me one M and one K and that defines uniquely a Gaussian process is is not at all trivial one of the things that's also not trivial and is quite nice is that most of the interesting properties that were used to in in dealing with Gaussian variables and we'll get into those in a moment that those are all inherited okay so this kernel function is the only is is the only really interesting thing that doesn't sort of slot in seamlessly into what we were talking about with Gaussian so so let's let's unpack that a bit more so the canonical example for a kernel function is probably the squared exponential kernel so I know that looks like a Gaussian but ignore that for the time being just consider that just consider that a kernel function of two arguments so what I want to do just to just to make a very very explicit connection between a kernel function and a covariance matrix is evaluate evaluate this kernel function at a handful of points so to do that we're going to choose some hyper parameters so you'll notice that I've slipped in I've slipped in a couple new parameters here we call these hybrid parameters because they live they live in the kernel this will use the clicker so we've got two we've got two hyper parameters here we've got L the character its characteristic length scale and we've got Sigma squared F which is the which is the variance or the power of this kernel and so what we're going to do is we're just going to we're just going to evaluate this so let's say I take three index points in T that's 7:00 a.m. 8:00 a.m. and 10:30 these are the three measurement times that we care about now what are we asking we're asking what is what's what's the what's the correlation what's the covariance between between random variables between my heart rate variable at these times so how do I go about doing that okay I take this function here I take these pairs of points I evaluated at all pairs and I build this into a matrix so what we can then do is we can then change these kernel hyper parameters and see how this covariance matrix change so what I've done here will just flash back between that we had a length scale of 100 now we're going to go to a length scale of 500 and what we see is that the diagonal we've still got the same values but what's happening is as we move away from the diagonal the correlation is falling off much less quickly so what's this saying with a higher length scale value between 7:00 and 8:00 a.m. these variables are highly highly correlated as you get further away to 10:30 this variables not quite as correlated but still it's still quite highly correlated on the other hand if we make this a smaller number you see that the correlation drops off very quickly so this is nearly scaled identity matrix in other words in other words my heart rate at 7:00 a.m. is nearly independent for my heart rate at 8:00 a.m. we can also change Sigma F and see how that changes things so this remember we were putting up that that that envelope arounds around the Gaussian mean around the GP mean that gray envelope so what we're doing here is changing that so we double that the envelope doubles okay so that was just a tie in something something where a lot of people when they're learning Gaussian process you get tripped up is connecting the kernel function to two covariance matrices so so I'm going to I'm going to repeat that a couple times but I think but I think it's valuable to make that connection all right so an intuitive summary of GPS so far GP offer distributions over over functions and for any finite subset vector we've got this normal distribution and you see here as promised I've dropped I've dropped the mean function and the covariance the covariance matrix K is calculated by just plugging the T the index points into this kernel function so to introduce some new notation before we were saying F is normally distributed with with mean zero and covariance K you'll often see written F is distributed as a GP with mean function m and covariance function K all right so I mentioned that I mentioned that most of the important Gaussian properties that we care about are inherited by by Gaussian processes and so what I want to do is walk through a couple of properties of the Gaussian that are that are going to be interesting for today's purposes that are very useful to in the GP context so one is is addict tivity in other words adding two gaussians together gives you a Gaussian again that'll that'll be nice to us in forming a joint conditioning so conditioning on gaussian random variables this is this is important for inference the ability to calculate expectations which is going to be interesting for calculating posterior predictive moments and finally our ability to marginalize out variables that we don't care about so there there are many other gal there are many other nice properties the Gaussian of course but those are the ones that we want to that we really want to care about today so let's first talk about forming a joint jointly Gaussian distribution so I've got some prior I've got some prior on f so f I'm going to use throughout as our prior I've got some Gaussian prior with M F and covariance k FF I've got some iid noise that I add to that N and then I want to let y equal F plus n so what's this saying this is saying the underlying function that I care about the underlying regressor that I care about is f and I measure some noisy data observations of that Y which is y plus this independent noise n so it's nice about this is that this allows us then to form our joint distribution P of Y and F and you see that this is again so a couple objects that we haven't seen before so kff we gave to you mff we gave to you and so you say okay what's what's kfy and what's kyy okay you can just evaluate that out kfy is this expectation and in this case that equals K FF k YY in this case equals k f f+ the notes so the nice thing now that we have to connect this back to the regression problem is that the latent F which we care about and the noisy observation Y are jointly Gaussian okay but wait a second we just did this all with regular multivariate gaussians so where did where did the GP go because we've we've just been talking about GP so the point I want to make here is that if F and Y are indexed by some some input points T in other words if M F is actually just some mean function evaluated at these at these end points T these n index points and K FF is about being some kernel evaluation then it could have just as easily written this as a GP prior F and some added an additive noise GP N and use this same additive 'ti property but here when I wrote this this specific Gaussian I would have just indexed this Y at T and Index this F at T so this is this is you're starting to see one of the really nice features of the GP is that is that all we need to do is bring in a finite set of index points and then we're working with multivariate gaussians so so as a warning because of this is that there is some overloaded notation here so f well people people are generally pretty loose about that notation it can either be infinite in other words a GP here I write F is GP so now f is this infinite dimensional object or a finite multivariate Gaussian and that that's generally pretty clear depending on the context all right the next property that we care about is conditioning or doing bait doing Bayesian inference so here we've got our Leighton F and our noisy observation y and we know that those are jointly Gaussian and we've got F and Y are distributed according this distribution so then we can do inference and we can say that the posterior of F given Y is again a normal distribution so this is this is this is an important fact of Gaussian distributions and this is just a stock and trade manipulation of a Gaussian distribution this is actually proving that this is this is the case is something I think that everybody should do once and only once once once you've done it just just just forget about it because it's rather tedious but but it's a cool fact to know so a couple things to point out here you see that we've got this mean function I'm sorry this this mean and this covariance we can unpack this a little bit we should all be pretty familiar with this one interesting thing here is that you see that this is just a linear function of our observations Y so that's that's nice to know and further this term here K F F is our prior is our prior covariance so that's the uncertainty that we had about that latent and you see that we've subtracted here this other term and that's essentially how much our data explains about what we know about about our prior so if our data tells us nothing about our prior uncertainty then this will be a very small term and in other words our uncertainty is still just around K F F if instead our data tells us a whole bunch then this get this approaches K F F and our uncertainty decreases considerably so the main point of this and you can you can not worry about parsing this too much but rather inference of the latent given the data is simple linear algebra so we've reduced all of all of the complexity of Bayesian inference and all the problems that are sometimes associated that associated with that - a simple - a simple set of linear equations alright the next feature so we talked about forming a joint we've talked about doing inference so now we can talk about calculating expectations so again this this simple term the simple conditioning term gave us this fact and what this allows us to see is that I mean this is this is repetitive I suppose but the expectation the expectation of F given Y is simply this mean term and so that's what is that that's the map estimate that's the posterior mean there are a number of other moments that would be interested in one one other thing I want to bring up so we looked at the posterior moments f + y we've been talking about this this joint Gaussian t between the late and F and and the noisy observation Y instead we can look at we can look at Y which is data that we've collected and y star which is data that we haven't collected in other words when when I said I want to query on and see what my heart rates going to be at 9:00 a couple minutes after 9:00 a.m. this is convention in the literature that YY star is a data that you want to predict so those are also jointly Gaussian and so there's no different we just use the same we just use the same conditioning property to get to to get to this fact and you can see that this is our predictive predictive mean the final property that I want that I want to explore is marginalization so again we have these join the Gaussian variables we can marginalize out the weighting because you say maybe I don't care about the maybe I don't care about the latent function at all I just want to know how well this Gaussian process model describes my data so to do that you just P of Y you integrate out f and another another nice property of gaussians is that you can just read that right off of here that Y is distributed my width covariance kyy so this is nice because it gives us the the data log likelihood which then can be useful for model selection model comparison and things like this oh right so one one one note because we'll come back to this you notice here that when I introduce those hyper parameters that lives up in the kernel those have been suppressed here but actually the data is P of Y given those those kernel hyper parameters and this this will be the basis of model selection because we want to tune our our data marginal likelihood based on what those hyper parameter settings settings are okay so at this point how we do okay at this point you might be complaining because because you might say okay I'm bored all we've done so far is is is messed around with gaussians and I'm familiar with gaussians and you know I thought it felt we were coming to talk about infinite dimensional probability distributions and interesting stuff and so if that's your complaint you're you're correct and I'm sorry about that but but in fact this is the whole point right the whole point of this is that we take this this beautiful mathematical theory and when it comes down to actually dealing with these objects it's simple in your algebra it's it's it's simple inference on Gaussian distributions and all this and so what I want what I want to convince you of is that even with that sort of banal set up we can do some some really quite remarkable things okay so now let's look at some of the remarkable things I suppose that GP can do we've talked about gaussians and words and pictures we've talked about some of the equations so let's talk about using using GPS in a regression context so our example model which we've introduced throughout the course the equations section and now we'll now we'll see what it can do so I say the F is our is our latent it's a GP with zero mean and some and some kernel k FF and the kernel has this form this is the square root x peole when we get to talking about talking about kernels we'll mess with that but for now just just just let that be we say that Y given F this is our noise term right in other words up if I give you if I give you the latent function value F the data that I observe is distributed some with some independent noise on top of that that's got some kernel that's the court that's this is though this is the white noise kernel and that all this is saying is that two different is that two different the noise that I observe the measurement noise that I observe on two different time points is independent what this allows us to do again because of this additive 'ti property is we can we can add this and see again that Y is distributed as a GP with some kernel kyy and you see this kernel functions add so all right so now we've got the probabilistic model the distribution fully specified so let's just fill that in with some of these hyper parameters so I'm going to choose Sigma F equals 10 so that's the standard deviation envelope I'm going to choose a characteristic length scale of 50 and a noise a noise power of 1 so all right so let's let's let's look at again of our visual representation of that so this is the prior on F what this is saying to connect this to the equations is that we've got a mean function of 0 so that mean continues along zero at all and we've got Sigma squared Sigma F of 10 and so this is the two standard deviation envelope so now we can go ahead and we can take draws from this Gaussian process so this is a single draw from that prior F the GP and so now you see hopefully connected to this notion of of how we can draw a function from from a GP so the steps of this should be clear this is the only this is the only code snippet I will give you throughout the course of it but it's but it's but it's only it's only one line and deal on what will unpack that more ok so how do I get how do I actually get this draw so to do that I take a whole bunch of index points a finite number so here I took the integer index points between 0 and 500 I evaluate that kernel function just like we did with those three index points I evaluate the kernel function and build that into some into some 500 by 500 matrix kff and then I can take a draw from a Gaussian with zero mean and this covariance so how that's actually done right and this this would be your your MATLAB code how that's actually done is according to this so that will give you this this procedure will give you this nice will give you this nice draw okay so there was one drawing green now we've taken four draws of that and I'm belaboring this point just so that just so that we remind ourselves that a draw from a Gaussian process gives you a function so four draws from that will give you these nice four functions before when we were evaluating the kernel matrix we we messed around with the hyper parameters a little bit to see how that changed see how that changed the covariance matrix let's do that here in pictures so here is Sigma F of 10 and the length scale of 50 if I change I'm going to leave that length scale of 50 but now I've changed the power of that the envelope of that from 10 down to 4 and you see what's happened is that the the the envelope has shrunk and the drawers are shrunk not surprising if I change if I change the length scale of 50 to a length scale of 10 right so what this is saying is as two points get further apart they fall off their correlation falls off more quickly so accordingly you get more wiggly draws one final point on this is this this should feel a whole lot like when we were evaluating those those covariance matrices in other words this is a is a is a is a closer to the identity matrix the covariance matrix is close to the identity matrix then for example this which has much longer range correlations okay so I mentioned that we've been looking at regression in time and taking draws that are nice temporal functions those are easy to look at and familiar to us you can also have multi-dimensional input so one of the one of the the histories of the application areas where Gaussian processes have been used a lot is in geo statistics and they're they're often interested in spatial Gaussian processes so for example latitude and longitude instead of regression in time you want to regress on latitude and longitude so to do that we can make we can make each input instead of instead of a single real real valued number a pair of numbers so now we've got some Gaussian process here F is the same GP and the kernel is almost exactly the same except here there's just an extra term there's there's a there's a square term for each dimension of that Gaussian process so what might it draw from that look like instead of one single function what you got now is a field over over latitude and longitude so that your random function is drawn over that field so I suppose shameless plug if you are staying for AI stats and you want to see a multi-dimensional GP in action we've got it we've got a paper on that and I'm sure there will be other papers with GPS as well okay so we've got the same model that we've just been dealing with and now let's let's gather some data so this is that this is this is our model the GP model we've got our prior and we can go in and take we can go in and take a data point so I want so let's say I gather a data point at time to 204 so what I know is that I can evaluate Y of 204 and I know that this is according to the model this is Gaussian distributed according to mean zero and kyy evaluated at 2:04 comma 204 that's a simple univariate Gaussian this is all just pulling this right out of right out of the definition of the GP Y and I'm just evaluating the kernel matrix so then we can use conditioning to update the posterior so here I've still got the prior P of F but now I've got this data observation so I can use I can use the inference rule that we talked about and evaluate Y to ZAR take this data point Y at 204 and run this through the equation and what does this give me well now this is refined my posterior estimate of F so this is no longer a flat mean function but I think the mean function comes down here and goes through F and furthermore because of our choice of noise parameters you can see that what's happened here is right around 204 I'm awfully sure my might my covariance envelope has collapsed and I'm awfully sure that that the measurement is there but as soon as I get further away well I forget because because I don't think that what happens at 400 is particularly related to what happens at 2:04 so by the time I get over here I'm basically back to the prior so a small change is that here we were looking at the posterior this this I think is important it is important to hang on just for a second because often when you look at GP work it's not clear whether people are talking about whether they're showing that put the posterior or they're showing a predictive model so this is on the posterior and this here is the predictive the predictive distribution so you'll just notice there's a very small change here but this this variance envelope has just increased slight ever so slightly because we think that on top of that on top of that posterior is actually some measurement noise so this is this is our belief about the predictive distribution okay so then I had I had point to a for let's say I also gather another data point at 90 well then I can add that in the same way to this distribution so and you see what's happened here is I've added this data point at 90 and the same thing has happened that the variance envelope has decreased the mean function has changed I can do this and add more and more data points as I go and this is what we've got - and this is again the predictive this is again the predictive distribution it's Gaussian and all I'm doing is adding more and more more I'm making this vector here what I've observed longer and longer okay so this gets us to this gets us to a question which is alright so I get more and more data but when am I getting to my actual regression function because I can see that we're doing regression here but when am I going to produce the parameters of my model right but of course this is a this is a nonparametric regression model so this is one of the virtues of Gaussian processes that we're not going to just spit out a couple parameters of a quadratic function but rather as we as we gather all our data the GP regression gets more and more refined so this is one of the benefits of GP regression which is it lets it lets the data speak for itself I suppose the downside of that is that all the data all the data must be so you see that as this Y grows and we collect more and more data we're doing we're doing a larger and larger problem here with inverting kyy doing this this this nice piece of simple linear algebra so you'll often hear you'll often hear this this comment which is nonparametric models have an infinite number of parameters and I'd like to refine this slightly in our minds the way we think about this and not say that nonparametric models have an infinite number of parameters but rather nonparametric models have a finite but unbounded number of parameters and that and that number of parameters grows with the data so the way you can think about it here is that as this gets larger and larger and tends towards infinity yes we have we have an unbounded number of parameters that can describe what our prediction is going to be but that's still just growing finitely with with the amount of data that we get okay so we are almost we're almost through the basics there's one more piece of the basics and Gaussian processes and using Gaussian processes for regression that I want to talk about and that is model selection or hyper parameter learning so all throughout the course of this we've been adding these data points looking at the predictive distribution looking at posterior inference and we've been doing this with a fixed with a fixed model F this Gaussian process with K F F according to this function so I want to talk about these hyper parameters now L and Sigma squared F so here we've got L equals 50 and we've seen how changing that can change the fit that we get so if I make L quite a bit smaller you can see that now L is sorry that this GP is over fitting the data in other words it's chasing it's chasing each individual wiggle of these of these data points it's recurrent it's forgetting very quickly probably too quickly such that really it says here that we know effectively nothing about about about our about our inference on this data point when in fact given this is probably a better description we believe that the function should be around here again if we have a length scale that's too high then we're under fitting the data so here we've got some very confident prediction but it's probably missing some of the interesting structure that exists in this data so the question that we want to talk about right now and what we want to address is how can we how can we tune or integrate over these hyper parameters l and Sigma squared f so that we get so that we take our data and we get to this model which is just right so there's two popular ways to do that the first of which is to use the marginal likelihood so we talked about the marginalization properties of the Gaussian which is to say that we can marginalize out we can marginalize out the latent function f and we can just look at the marginal likelihood of the data Y and again I said that this is actually hiding these extra parameters which are now the parameters that we care about these hyper parameters we want to do model selection on these hypo forevers so just looking at this and particularly when you talk to people that are outside of Bayesian machine learning it's it's quite common that people say it's not obvious why this this this should this should do model selection automatically and that this shouldn't over fit or under fit the data but it's writing the math so let's let's unpack that for a moment so here is the log marginal likelihood of the data and you can see that this has three terms okay let's ignore this one this is just a normalizing constant so what are these two what are these two terms do this term typically is called a data fit term and this term is often called a complexity penalty so what you can see here is that if we just consider Sigma F right so that's the envelope that's the envelope of the Gaussian process so as Sigma F gets larger this will fit the data better in other words more data gets more data gets inside that that envelope but as that happens as Sigma f gets larger this term also scales up so you pay a penalty here so these two terms these two terms your complexity in your data fit are at odds with one another and that gives you this this this this automatic determination of overfitting versus under fitting which happens in a Bayesian model unpacking this for the length scale is is just a little bit trickier but again is simple linear algebra and comes to the volume of this the volume of this ellipsoid which gets larger this gets larger as you get closer to to white in other words as you get a shorter shorter length scale and and that opposes in the data fit term as well so that that's that's worth spending some time unpacking on your own this is why you'll hear the term Bayesian Occam's razor or Occam's razor is implemented via this regularization or this Bayesian model selection because there is this you've got this data fit term here and you've got this automatic regularizer here which which discourages over complex models so the details of this dealing with model selection will be will be fleshed out some in the practical another way to do this and this this falls outside of the this falls outside of the Bayesian context I suppose but one can also use cross-validation cross-validation approach which is quite popular machine learning so instead of considering the marginal likelihood we can consider the predictive predictive distribution for some held out data so here I'll call this the predictive log-likelihood so here I've holed out some I'll hold out some test data and as we know this is again a Gaussian P of Y test given Y train so it's again a Gaussian so you can use this you can take derivatives on this and you can tune my model hyper parameters in the same way that you do in any cross validation approach and again that that'll be a model selection will be dealt with in the practical all right so that gets through the basics of what I want to talk about with GPS and words and pictures GPS and equations and then using GPS intubation regression context we're just just a few minutes ahead of time so why don't we take a few questions nice code so I suppose I suppose choosing choosing a Bayesian model selection versus cross-validation is something that people can do for a lot of reasons some of which is is is just the just their opinion on this divide between between between Bayesian and frequentist statistics I think one reason why a lot of times people use cross-validation measures is for out of model data so one thing that this one thing that this marginal likelihood measure is very much within this is very much within the model that you've chosen whereas cross-validation you're saying let me let me ignore the model so to speak and I'm really just interested in in the predictive ability of that on held-out data so that's that's that's one distinction that that's that's one distinction I think with the habré's example as possible - is there a nice cow or give you kind of periodic yeah so we will get to that in kernel in kernel choices we'll talk about a periodic a periodic function exactly yes so so that that that kernel that we've chosen the squared exponential kernel is is is just a nice and canonical choice that gives you smooth functions over time we can we can and will mess with that as soon as as soon as we go to the next section if there no other questions why don't we take why don't we take a five-minute five-minute breather and then we'll come back and we'll we'll get into some some more interesting details
Info
Channel: ML talks
Views: 34,319
Rating: undefined out of 5
Keywords: MLSS, 2012
Id: BS4Wd5rwNwE
Channel Id: undefined
Length: 49min 11sec (2951 seconds)
Published: Fri Jul 06 2012
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.