Machine learning - Gaussian processes

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
then doing predictions on a test set simply amounted to conditioning so very basic operation of gaussians is all you need in order to do nonlinear regression and so that if you have just a few data points like indicated in this picture we have just a few data points in this case only five but despite this that there's very little data we're able to come up with a very with the fit of the data that seems to be very plausible not only do we get a nonlinear fit of the data but we also get good confidence what appear to be good confidence intervals for that data set in particulars have said many times and I repeat where there is data you should be confident when you don't have data you shouldn't be confident where you don't have data you should believe on your prior knowledge where you have data you should believe the date Gaussian processes the key to Gaussian processes is they captured our belief that a lot of functions in the real world are smooth okay so and and indeed if you look at the space of natural images and sounds in the world you will find that smoothness is a very reasonable assumption so if did with the data that we acquire from the world satisfies this property then we would expect that Gaussian processes will turn out to be a very powerful way of modeling the world and we started going over Gaussian processes when there is no noise so to recap our data set consisted of points and evaluations of a function now this function is an art it's important to get us the following thing the function all I need to do is evaluate it I do not need a mathematical expression for that function if that function could be for example his preference for different countries in the world and I could be asking him questions like do you prefer Brazil to Canada do you prefer the UK to France do you prefer Korea to Japan and by doing these questions I will be able to learn a GP that will describe his preference a chip in this case would be an object that's I guess they're three deeds over the globe and you could map it just like a map to 2d and that would give his preferences other countries in the world so he is the cost function he is f I'm asking him I'm giving him points I am giving him countries or pairs of countries in this case and then he's returning to me the evaluation of those points I gave him two points he replies one or zero one if he prefers the first one zero if he prefers the second so in this sense the function just needs to be evaluated so GPS are actually very good when you have into intelligent user interfaces when you want to have computers that can interact with humans if you want to build robots this turns out to be a very useful tool there was a question so in this in this case what's the distance between country and why is yes right so I have not introduced very good observation so the way he was giving me data was as binary labels I prefer to be a b2c I have not taught you how to do that yet what I have taught you to do is regression so let me repeat the exercise with what I've given you on a scale from 0 to 1 how much do you like Canada on a scale from 0 to 1 how much do you like Brazil on a scale from 0 to 1 actually like Costa Rica on a scale from 0 to 1 how much do you like Antarctica you will get a unit function I could go to hell let's gillip 0 to 1 how much do you like North Korea on this probability subjective I've already we've established that with experiments in this class so she's giving me different numbers and his function would not be very interesting his mug should be just uniform nonetheless it would capture the data his his case is really trivial to fit it's just a line just need to bias her case a bit more interesting 1/4 is 0.1 Canada the US Mexico Burundi the cost function has noise now the important thing is the setup here when I choose a country I'm just choosing a point in 2d when she gives me a number in return she's give me a subjective evaluation for that point she's giving me F I give her X I she gives me F I where I is the index of a country's and just by doing this I would fit a GP and I would already learn which country she tends to prefer you know whether she prefers Asia whether she prefers provider that I assume that if countries are nearby the Preferences are similar that's the sort of hypothesis that I'm using here the point being that F is a function that might be in the head of the user F is not on a function that for which there is a mathematical object that describes it when I learn to fit a cheapy I am learning the mathematical model that describes the world or the Preferences in your head and so on so in this case initially I said that X I how many people have run the code those who happen to do it soon because it really maximizes your learning experience to do so certainly in the next homework you will get to play with it but but it does help when I give use code snippets have a look at them take the time because it definitely makes much more concrete what I'm going over here with sort of dry math okay so back to the dry math if you know the joint distribution of a Gaussian and we have the points F in the training set arbitrary test set points then computing computing predictions is just a question of deriving the conditional distribution from the joint an exercise that we went over in great detail in the last class so if you know how to go from you use your axis to construct these matrices which are the matrices of similarities between the training set and the test set the self similarities in the training set which is just K and there's some self similarities in the test set which is K double star and once you've constructed these matrices of similarities F star is a known F is known from your training data and then the process of prediction is just computing the posterior distribution of f star given F and the previous X for which there is an analytical expression telling us what the mean is and what the variance is and so for several points here several test points X star so I create a very fine grid of these X dark points and for each X star I can compute the mean and the variance and so I can plot it and it looks like that beautiful plot there where I'm showing you the mean and blue the true unknown function is in red so I'm showing you my model seems to capture the world pretty well and quite importantly where the model is doesn't have data it's very uncertain and those confidence intervals capture that as well since I have a Gaussian distribution for X star okay so this is a Gaussian distribution with mean mu star and covariance Sigma star since it's a Gaussian I can draw samples from agassi each sample is a vector in this case I think of 50 components and each color that I'm plotting there on the right hand side is one such vectors so each color corresponds to a vector with 50 entries there I've drawn from that Gaussian distribution and these are essentially the predictions I drew 50 but I could have drawn 50 thousand I could have drawn 50 trillion and the shapes which you wouldn't be able to perceive differences because there's enough for a solution that to fool the eye that could have an infinite number of points and so be able to do this because indeed a Gaussian process defined of a functions in what you see that essential Gaussian process does is that where there is data you have all these functions to screen in your prior and what the data does is it just grabs all those functions or squeezes the uncertainty so you're not mysterious samples and look like samples from the Paseo district and it makes sense because if you look at this curve the curves that would fit inside the gray intervals on the left okay so we did this in detail in the last class and we also looked at the effect of choosing the kernel with the kernel has parameters so you can decide whether you want to be similar to things that are only very close by or whether you want to be similar to things are a bit further away that's just what to do that especially changes the variance of the Gaussian this parameter L and here I'm showing you a few values if you use an L that is very small so in other words a very thin kernel if you use a very thin K then only points that are nearby similar to each other and so you would expect a more Wiggly function which is what we observe what we try on the other hand so as you can see here the mean is in red is a very weakly function and as we increase L the function becomes smoother because it book and eventually if I kept increasing it it would become actually a straight line you'll become way to smooth so there's a natural trait of here so we have to choose this kernel width and the simplest way that you already know is cross-validation so you could use quite reasonably all right who's got Skype going in their computer can I ask you to please turn off all online media new computers it is distracting thank you where was I yes we control the smoothing by controlling this parameter of the kernel one way to choose this parameter is by doing cross-validation so like like you guys have done before go ahead I think the because the true function seems to be the same everywhere so I think that your function here it no I know the true function is blue and the prediction is in red because it's now as you can see it's the red the red is more weakly here as you can see it's it's going up and down as whereas the red here tends to be a much more straight function much smoother so actually I hope I did so just to say I hope I got it right here oh no I did it so this seems to be the true function and the other one is the prediction if you run the code you will see it for yourselves so this is the important thing I gave you the code that does this so you can go to that code now and you can tune you can play with that parameter in the code and you can see what happens and nothing's going to teach you better than for you to go and play with it and see okay five degrees L what happens if I make it L really large what happens if I make L really small going through those experiments because you will get a lot from doing those experiments in the smallest gun the one on the left there Picon the red what's dragging the prediction off like that there's no data points up there you know what's dragging the prediction up on the red here red right there yeah I have no idea it could be just simply because of the nature of the curvature of the function as it's coming from left and right they both have negative coverage and this might be what's forcing it one thing I could do with the GP to get rid of this is I could also force the day if the data consisted of the slopes and the gist of function evaluations I could match on the slopes as well and I would get a much smoother there was another question come on okay let me move on like I said play with the code and then you can see the effect of this parameter and then doing cross-validation you can set this parameter and I'm going to talk to you about a few other way so how to tune this parameter today so I would seem looking at these pictures that also changing all it changes your confidence quite dramatically so if you because on the last year very yes yes it will change your confidence because when you change L you change K and both the mean and the variance or functions of K so it's very important to when you change this depending on the objective of the analysis if you're just looking for mean predictions and you will do cross-validation but if you need to model uncertainty then the choice of L is actually quite tricky and especially in the next class on Thursday I'm going to come to a case where it is particularly tricky but for now imagine that you have lots of data your purpose is choose a filter function and that you can do cross-validation because it's the one technique I've taught you so far and in that setting you now have a very powerful way of doing function approximation before we get to the more detail on how to learn that parameter of the kernel let's first deal with a few other things that are equally important the first thing is how to deal with noise so far I assume that the function f didn't have noise but of course sometimes when I ask a question like when I asked her for a rating on a country like Burundi this you might not be entirely certain as to how much you would like it because she doesn't know what Burundi looks like and then there is uncertainty and if there is uncertainty you want your model to deal with that uncertainty showed that there would be uncertainty in her reply and so we need to then a knowledge that uncertainty in one way to do it if I assume that the data is just ratings and that their noise is Gaussian then I will just add a Gaussian variable to my function evaluation and get a noisy function of evaluation which I call Y so Y is the noisy function measurement in a initial experiment where he was giving me zeros and ones I would have had to use a Bernoulli model to model that kind of noise okay so if we have Gaussian noise in order to compute the distribution of Y given X I need to integrate over F I need to marginalize that okay so if we have if I want P of a and I have P of a and B I have to sum over B in order to get P of a that's the basic operation of probability whenever you have a variable you Negatron of it you need to integrate over it we know each of these quantities like for F is just the prior which I said let's assume that it's zero mean assuming we have standardized the data and let K be our similarity kernel as before but now we're saying that each data point that I observe will be independent that will have a Gaussian noise it says each point is independent if I have n points I just need to multiply n 1 be Gaussian distributions for each point weii that I observe and if we use the convolution theorem that is the integral of the product of the Gaussian through introducing that in last class then you'll see that the distribution for why the noisy function still has the same mean but its covariance increases and increase in covariance the covariance for y now includes the covariance of F plus the covariance of Sigma square if you do not want to use that theorem I want to think of it that different way just think of the linear properties of expectation and the linear property and and covariance indeed is just an expectation if you have independent random variables using and the linearity of expectation if you want to know the variance of adding two random variables you just add individual variances it's a property that follows from linearity the conclusion being that Y will have now covariance KY so in mapping F the noise-free function to a noiseless function y all I'm doing is I'm multiplying the kernel so that the kernel now has an extra diagonal entry which corresponds to the noise in the date the noise observed when I ask a question once again we know how to model now that I know what the covariance of Y is it still has mean 0 except that the only new thing is I replaced K by this K Y where KY has this extra Sigma's parada to the diagonal and the rest is as before if you have an expression if you if we have an expression for the Joint Distribution of a Gaussian we just use our theorem to get an expression for the conditioner which includes the mean and the variance so having gaussian noise only involves a minor modification of the algorithm which is the addition of a diagonal term Sigma square everything else is the same it's just a very minor modification to account for uncertainty when you plot this you however observed that even when you have data even in the regions where there is data like here there is still uncertainty before because we assume no noise that a certainty collapsed at the detector now the uncertainty doesn't collapse at the data because there's always this error we will make we're assuming we'll always make an error of Sigma squared at least because that's the uncertainty in the data however we still have the same properties as before where we have data we have less uncertainty where we don't have data we have greater uncertainty now that's essentially all there is to GPS I think this slide summarizes the whole thing if you want to fit a GP to data you just construct a matrix K using the similarity kernel if you have noise in the data you just need to add Sigma the variance of the noise to your diagonal okay that gives you know your K Y matrix and if you've rescaled the data so there is zero mean then the only thing that's left for you to do is code these two equations from the star Sigma star and that gives you the predictions on any point X star so the mean prediction as well as the confidence interval and we're done we have a beautiful way of fitting data so the code is really simple how did how can we measure the variance of her in Sigma why how can we measure it you have to estimate it from data how do you estimate exactly exactly how you did it in homework 1 maximum likelihood or which prior would you put on Sigma exactly and the inverse way shirt is one thing I didn't say is the inverse way shirt in 1d has a name it's called 8 vs gamma distribution okay so in 1d the inverse Wishart distribution if you plot it it looks like this part it there is a part order parameters like upon Sigma so you should funny yes they you still need to and choose the parameters of the prior that is correct so and actually that points a very subtle question I'm choosing I don't know what my Sigma is so instead I'm going to say that my Sigma comes from this distribution and what I'm going to do is I'm going to try to you know figure out I'm going to assume that this is my prior for Sigma and then given the data I will readjust this distribution and I will get something different a posterior for sick vo Sigma squared given the date the prior has parameters however those parameters are useful because for me as a designer I want to be able to tune those parameters to say what are my preferences choosing the shape of the prior to capture my preferences however in some cases I might not have information that allows me to choose the best shape of the prior in such cases you can then put a another level of inference and actually estimate those parameters so you try to infer those prior parameters directly from the date that is called maximum likelihood type two or as some people call the difference maximization as well in this course I've chosen not to go into that but that's usually that's routinely done in practice definitely beautiful day that's exactly empirical base and some people questioned its validity because you're learning the parameters of that we learned the prior from data so that seems somewhat wrong it's not in practice but there are smarter ways to do it the other argument is that what you can do if you're amazing you can do a something called hierarchical base so what you do in hierarchical Bayes is you have uncertainty over a quantity so what you do is you put a prior on that quantity so I might not know what my Sigma is and so what I say is that I believe it's in this range with high probability that's less commitment than to say Sigma 2 or Sigma is 1 all you're saying is it's in a range however if you do not want to commit to what the written so the next question is what's that range it's clear that committing to a range is not as big a commitment as committing to a specific value and if I do not want to commit to also this range what I can do is I can be like this I can put a hyper prior on this so I believe this guy is on this interval and I believe the boundaries of the interval are between these values and these values so you're making your knowledge very diffuse and what most often people find in practice is if you follow this hierarchical Bayesian modeling strategy the algorithms tend to be fairly insensitive toward with respect to the specification of these parameters but they tend to be very sensitive with respect to the specification of this parameter so the idea is you're trying to create an algorithm that when you tune the parameters you don't want what you don't want is an algorithm where you're just a slight change of the parameter gives you completely different results if you can come up with a basin hierarchy where you get such that when you tune the parameters they don't you know if you choose point one or ten the results don't change much then you have a fairly robust right however when doing this it's always important to test the hypothesis it's always important to verify that you indeed have a robust prior so often I see papers that build a hierarchy but forget to do the experiment to confirm that their prize ended in sensitiveness so that it's important to the sensitivity analysis for now we will assume that we specify the price that we have good knowledge is best buying prime for your project of course you will likely do things like empirical based and make some like if you like to because that's what people do in practice in fact I'm putting a prior of a functions and I'm saying that I don't know what L is that you know the width of the similarity kernel so as such that width of the similarity kernel is kind of taking me one level in the hierarchy what I'm going to do next is I'm going to describe an algorithm for choosing that with innocence I'm doing maximal I clicked up to so I'm doing empirical base before I get there though actually that comes though let me jump over a few slides since we're discussing this I will go straight to that I'll come back to the other stuff that I wanted to do but just to continue answering your question suppose I want to learn the kernel parameters so one way to do it is cross-validation and then to learn small Sigma Y I would just do I could do cross-validation or like to do maximum likelihood or I could do Bayesian learning so you start realizing that there's more one more than one way to do this what I could also do in order to learn the kernel parameters is the following I could try to write the likely now the likelihood is just the probability of Y given X and this quantity here it's actually called the marginal likelihood in this model I have a latent variable which is f or a hidden variable called F and so I need to get rid of its effect in the choice of the likelihood so in order to get a probability of Y given X I need to and I should emphasize an l squared where L square is my parameter I need to integrate out F now to integrate date out F once again I use the rules of probability that gives me a Gaussian to zero mean and covariance K Y I can then compute the log likelihood for the log marginal likelihood which is this expression and then K has parameters in this case my parameter was just L and if K has parameters then in when I compute the derivative the log likelihood I just need to compute the derivative with respect to each of the parameters of K so in other words I can do maximum likelihood to learn L which is effectively what I'm doing here now P of F is a prior and now I'm learning the parameters of the prior which is the width of the prior so that's a sum from datas and that's as I'm doing this I'm following this procedure that people refer to as empirical base and the empirical because we're using experiments you learn the price now I have not torture yet how to do optimization those of you who know how to do optimization for will see this is trivial now because all you have you have an expression for the gradient if you follow the gradient you will be able to get get a good estimate of theta if you haven't seen optimization then wait a few lectures and we will cover optimization and once you learn optimization you'll be able to learn be able to actually implement this and be able to learn the optimal theta by maximum likelihood okay but let's go back again to Gotham processes and what I want to do next is I wanted to write Gaussian processes from a different perspective and hopefully by deriving it from a different perspective it will connect Gaussian processes to what you learned when you did the linear models and it will connect it with homework 1 and hopefully will give you more insight into why these methods work so well ok so let's assume we have a single test point X star so we have a single test input X star so then the the prediction of F which in this case is 1 by 1 it's just given by this Gaussian here which has mean K star K Y inverse x Y and it has the covariance that appears in the second argument of the Gaussian in this case because we only have a point K star star is just a scalar and in fact if we use the similarity kernel that I've been using which is the exponential is just 1 right because e to the e to X star minus X star is just e to the 0 and e to the 0 is 1 this will be a vector of size 1 by n and then the matrix KY is the matrix of size n by n as before but now these quantities here k y and y are from the training date whether the why was the vector of answers that she gave me and K Y was the matrix that I built using the XS that I used to question her and so all this comes from the training I can now take this expression for the mean okay just as the expression for the mean and instead of writing it in vector matrix vector notation recall this is 1 by n this is n by n this is n by 1 and then Y is n by 1 and so and of course the whole thing is 1 by 1 because it's just a mean for a single point I can then rebuy calling this so this K inverse Y is what I'm going to call alpha and that's a vector that's n by 1 right using basic properties of matrix multiplication if alpha if then what I'm left with is the multiplication of K star transpose times alpha okay a vector times a vector just a dot prod now vector times a vector I can rewrite it in component wise and this way okay so each alpha now appears as multiplying a colonel what does this remind you of that we've seen in this course nothing nar PF this is just the equation of an RPF an RBF was a linear combination of basis functions this is a linear combination of basis functions because each K is a basis function each of these guys can be written as the sum of I of I e to the minus 1 over lambda X I minus X star squared that's precisely an RBM and in an RBF wherever there is data we place a basis function we scale the basis function with alpha and so when I fit a function I will get a function that will look like this okay that function being this F it's a sum of basis functions each basis function scaled by out so each other is just the real number that scales the basis function and so a Gaussian process if you look at the mean what you're doing is you're just fitting an RBF you're just fitting a nonlinear function using a combination of basis functions where you placed each basis function at the data so one basis function per data point and then you just add them up so it's a which completely alternative way to arrive at the same result that we had before but now at least we know that putting the basis function where the data is is the principal thing to do let's look at something even simpler where a Gaussian process arose reach reach regression so recall that in rich regression it's a linear model so y equal X theta just the question of a line or a plane and we put a regular eyes around theta because we prefer smaller features as you saw in your homework by using this regularizer you get better answers the solution by differentiating that objective function and equating to 0 is given by these equations and here I remind you that X is a matrix is n by D and is the number of data D is the number of features Y is the output just in this case there's n on puts n by 1 so the matrix X I can write it as X 1 X 2 all the way up to X n where each row X I is a vector of 1 by D entries okay so each X is one of the rows in this matrix the matrix is n by D each of the rows with the address is one of these vectors sub goal in small X in your homework you had to prove the following thing you had to prove that the parameter theta could be alternatively be written as X transpose ax where alphas D becomes the unknown so this is a ripper ammeter ization of the solution where in in doing so you had to prove that alpha was equal to this quantity that's fairly easy to do because if we go back to the rich solution we have X transpose X theta plus Delta squared theta is equal to X transpose Y from which delta squared theta is equal to X transpose Y minus X theta and hence the result follows theta is equal to X transpose Delta minus 2 Delta the scalar times y minus X theta in other words X transpose alpha so that was the first thing you had to do rewrite output this way moreover once you write offer this way that is once we know that alpha is given by this expression we went further and we said that alpha could indeed be written in this form and to do that we again would just go back to this expression and we know that Delta squared alpha is equal to Y minus X theta but we know that theta as before is just X transpose alpha big x transpose alpha okay and so let's do that and so now I apply the same trick that I did with sort of in dangerous direction and I can just say that X X transpose alpha plus Delta squared and I can just put an identity here of size n because it doesn't change multiplying a vector times identities just gives you the same vector is equal to Y in other words alpha is equal to X X transpose plus Delta squared I and minus 1 Y so what we have is two different ways of writing the solution for rage regression in the first way we use a parameter theta that is d by one in the second way we use a parameter alpha that is n by 1 or sorry I was using small and here small n by 1 now which one you use largely depends on how big is your end how big is your D if you have an example like by informatics it might be that your n is 20 you have 20 patients but for each patient you have 20,000 features which could be how much is genius manifest so in that case it would make much more sense to parameterize in terms of alpha because this matrix that you have to invert would be n by n which is 20 by 20 is easy to invert if you however were to use this matrix here and consequently is Theta you would have to invert a matrix that is 20,000 by 20,000 okay so depending on which one is bigger D or alpha you would use one of the different solutions so this is an incredibly useful trick computationally when you come across tricks that allow you to knock your computation from especially in this case in the inverses are n cubed to compute and so if you have to go from 20,000 cube operations to 20 cube operations that's a huge gain that's you couldn't just go and buy computers to get that game because the computers don't grow power cubically so it's really useful to know these tricks in addition this trick allows us to get close bring retrogression Clause to Gaussian processes let's see why so here I should add that the prediction Y star for a new input X star would be just X star times theta the usual prediction or using the expression that we have for theta we could write this as X star times X transpose F so there's two ways to make the predictions either use it or you use alpha depending on which is easier to compute now if the prediction for a new point is x times theta or equivalently x star times x transpose alpha or equivalently x star times x transpose x and now i'm using i'm going to use the estimator of alpha which is this equation here okay and moreover if we recall the definitions of these matrices X so X is equal to X 1 X 2 all the way up to X and so that X X transpose is equal to X 1 all the way to X n times X 1 transpose all the way up to X and transpose which is equal to a matrix so it's an outer product of vectors where each entry now is X 1 times X 1 transpose and so on now recall that each of these vectors each of the axis is the entries so this vector here is 1 by d x 1 is 1 by D and so is X 2 and X n CH so X is a matrix and what I'm using a small X's are the rows of the matrix so they're vectors the row vectors of the entries so what I have here is a row vector of D entries times a row vector of D entries is the dot product so each entry in this matrix which I'm calling x times X transpose which is of size so this was n by D this is D by n so this matrix is of size n by n just matrices as big as the number of data points that you have a new training set and each entry of these matrices is the dot product between X I and X J now the dot product before in the last class I argued could be interpreted as a measure of similarity if two vectors are very similar their dot product is high if the vectors are very dissimilar the dot product is long so dog products are measure of similarity in other words this is a kernel matrix instead of using the exponential kernel I mean now using a linear curve for the dog profit curve using that I can just rewrite this whole thing this way we had added a delta square so let's call this cape in other words the dual form of Ridge regression is a GP Ridge regression in the dual form is a GP where you have at least the mean is the mean of the GP is what we've proved we haven't talked about the variance I'm not going to go into that part but it's the derivation is similar so we prove that the mean of the GP is in fact the same thing is what you get when you do Ridge regression in it's duo for and where the similarity kernel in this case is just a dot product we use dot products to measure similarity and it makes sense because a dot product is just the magnitude of the first vector times the magnitude of the second vector times the cosine between them if two vectors are similar they're cosine justice is small if they're very dissimilar then you get a big difference in and of course in in the cosine and yeah so retrogression switch it be that's another way to look at it so there's many ways to arrive at Gaussian processes I like the probabilistic way of getting to Gaussian processes because I think it's very simple provided that you know what a Gaussian distribution is it's really trivial to just apply one theorem of gaussians to get the predictions and the gives you could understand it but it but also knowing how one can arrive at the gaussian process through function arguments is useful another way to arrive at Gaussian processes as some folks use out there is if you take a we haven't done this yet but for those of you see neural networks before if you take a one hidden layer neural network and you let the number of neurons go to infinity you can show that that leads to a Gaussian process as well you need infinite neurons or just one you're on per data point infinite number of lyrics I can point you to a reference for the construction so David Makai professor and Cambridge has a nice tutorial where he showed this where he shows this was so non rigorously but you know he has a very nice intuitive way of explaining it there wasn't in this case the Delta squared term is just doing regularization so the Delta squared term we introduced is the Delta squared term that we used in rich regression that was for penalizing parameters being too large so it's just smoothing your function pardon let's do the same thing a sigma squared why not quite Sigma square is a parameter suit so the two are related because this thing here is what I'm calling KY you write KY inverse is just this whole thing inverse and this is what I'm calling K star and to be a bit more clear k star transpose in this case would be a vector and it would have the dot product of X so X star where X star is a scalar this is sorry it's 1 by D 1 by D and this is d by one the prediction is a scalar so I would have X star times X 1 and here X star is a row vector x1 transpose a column vector and then this would be X star times X 2 transpose will be the next component all the way up to X star times X and transpose and of course this is of size 1 by n because each entry is the multiplication of two vectors which is a self escape so choosing Delta will affect what my KY is and has will affect what my mean prediction is it will also fact that my bearings prediction is so it does controls with us so I have two parameters to control smoothness the kernel width and also Delta and in addition I also have the variance of the noise signal Y to play with just that it seems that Delta squared is playing the same role as assuming that your input data is noisy oh wait I see I see what you're saying yes that is that that is true that's very true when I introduced when I introduced here the Sigma indeed indeed it will now I know I understand your question yes when we model the noisy data our trick to model noisy data was to introduce the Sigma squared and then the way it gets mirrored it gets mirrored to just it gets mirrored to just adding a diagonal term to that again when I'm doing the mapping from rich regression I end up with this Delta square which is indeed just like equivalent the same as Sigma Y except here it has a slightly different interpretation it was a trade-off but in terms of their functionality that they're the same parameter in the code repeat the same quantity so yes so you would have to tune Delta squared and you would have to tune the you have to tune the width of the kernel or any other parameters of the girl may have and how you tune that if you have only two parameters cross-validation does the trick provided again cross-validation works when you have enough date when you have lots of date you could do maximum likelihood with maximum likelihood also is only efficient when you have lots of date so positive data causes probe there's Bayesian ways to deal with this but they require techniques that we haven't discussed yet in the course underlay that since Delta square is similar Sigma % mean having noise in your bed of soul yes precisely and that we will use as the print that's basically the principle of technique that we will look later called dropout nets so rich regression is also called Tikhonov regularization and Chris Bishop has a nice paper actually calls Tikhonov regularization is equivalent to adding noise to your date and people will exploit this to construct algorithms in some crazy ways don't squared if we found that value by cross-validation is it an estimator for Sigma like for the variance of your measurements like for example the prostate data we did rec we found Delta would that value Delta be a good estimate for the variance of R of the measurements we made in the prostate in this case into the Jaypee yes with this interpretation in the dual space okay so the only thing left is how we implement this and if you look at well implementation is really trivial is that's a short snippet of code that it's on that's available on the course website so I recommend you go to the course website and download it these slides are also there I will upload them with the changes I've made today but they're in the code the only other subtle thing you needed to know in order to understand it is when we compute the mean instead of just inverting K we instead do a cholesky decomposition of K and then solve a couple of linear systems and that's equivalent to invert okay it's just a bit more stable and there's many other ways to do this this one is good because there are lots of very good algorithms for inverting linear systems conjugate gradients is one that you could use in this case inverting a linear system is NQ if you use conjugate gradients you usually can get very good answers in just a few durations and the cost is just a few iterations times N squared so you reduce your cost by a factor of n and reducing a cost by a factor of n is what determines whether an algorithm gets deployed in the web setting or not lots of algorithms when you work in a live search engine and this has happened to me many times working with large data sets that if the algorithm is in queue and there's no obvious way to reduce it to N squared we abandon the idea even if we think that is a good one just because we do not have the computing power and buying machines and clouds will not solve that problem so in this case by going to linear system with conjugate gradients is possible to knock it down to a few iterations times N squared and then it's possible to do a few more tricks to make it even faster and but that's pretty much what a Gaussian process is so essentially that snippet of code summarizes how to do Gaussian process and regression what I haven't told you yet is how to deal with classification how to deal with the fact that sometimes the labels are binary all preferences for example when I said you prefer discount order to this country and so if we if you comparing items then the data is discrete and and there we're going to have to do a bit more work than in regression so if it's good morning for the maths a bit more involved but to get the era will first go and teach you what classification is and teach you some more basic techniques for classification and then if time permits at the end of the course I will discuss classification with guns and processes however the textbook of Kevin Murphy does go immediately into classification after this point and so if you read if you're interested in quickly knowing how to the classification of Gaussian processes I refer you to the book and if you have questions office hours now in the next lecture we're going to use Gaussian processes mainly because they're very good estimators of the variance in order to do active learning in other words we'll have learning machines that ask questions those are the most common learning machines deployed present if you think about it every time you get an ad and Facebook that's a question that's being asked of you do you like to that and you decide whether you like it or not by clicking on it or not clicking when you do a Google search for an image a bunch of images appear and whether you click on these images or which order you click on the images you're sending to Google your vote as to which order you think these images should be in and even if Google started with a very bad image search if you typed cat and the third the first two images were not cat but the third image was cat you would likely click on that image and by doing that you would have been sending Google information on what an image of a cat is so even if Google was not able to do object recognition Google would have learned how to build the image search engine just by relying on your votes speech recognition is another example where you continually you continuously providing data to people who build recognition systems so active learning is a big part of machine learning on the web and it's a huge part of the whole advertising engine out there one of you know reason sadly what machine learning so popular because advertised alright so in the next lecture I'm going to go over how you can you use a base in optimization to build systems that ask questions and I'm going to if for the remainder of this lecture what I'll do is I'll play a video giving you an example of one of such applications built using Gaussian processes I in the Google group I shared with you a couple of links one link is a tutorial on base in optimization which if you want to prepare yourselves for the next class I strongly recommend you go and read it because it already contains the you've already done Gaussian processes so you'll find that they don't like easy to read there's also recently also this blog entry on how Google or by Styx card or Google and multi armed bandits and I'm going to cover that in the next class because it's very similar to the problem that we're dealing with here with Gaussian processes so I recommend those links you know if you get a chance to look at them okay so I'd like to end the class with showing a project of how you could build an interactive interface to assist animators by using Gaussian processes in a nutshell we humans are very good at looking at an animated movie and being able to tell whether that dinosaur has a nice book whether that dinosaur actually walks like Brad Pitt we very easily can tell that you know if you know how what Brad Pitt works you can play a movie of Brad Pitt you can see do the dinosaur walking and then you can say know it that walk is not like the work of Brad Pitt or yes in most of humans are very good at doing that certainly you're very good at going to an animated movie and saying yes the animation was good or not good what's incredibly hard and those of you who do animation here will agree is to come up with the mathematical models that make the dinosaur works walk like Brad Pitt that's incredibly hard you have to come up with a mathematical nevermind that works like Brad Pitt come up with a more mathematical model that governs how a human walk it's incredibly complex so there are these problems for which engineering the cost function is extremely hard but for which it's possible to get good evaluations so I could show a movie of a dinosaur and in order to produce a movie of a dinosaur I need to change my parameters of my simulator I show him the movie and ask him on a scale of 0 to 1 how good a walk is this and he gives me a rating and then essentially I've given him a point he's given me an F an evaluation for that point so I'm learning what basically the function in his head as to what a good walk is and I keep trying different walks now let's look at this walk and I asked him how much do you rate that walk he gives me a rating and I keep doing this and eventually what I will learn is a function as to what that describes what he thinks is a good walk and in other words I would have tuned my engine so that you're actually dealing with the use of course different users would like different walks so this would be personalized here is a an exercise on building such a system the computer graphics field is filled with applications that require the setting of tricky parameters in many cases the models are complex in the parameters unintuitive for non-experts we present a machine learning method for setting parameters for a procedural fluid animation system it uses a gallery of animations and collects actually let me start this again because this video is actually crushed the audio is coming but not let's try one more time the computer graphics field is filled with applications that require the setting of tricky parameters in many cases the models are complex in the parameters unintuitive for non-experts we have sent a machine learning method for setting parameters for a procedural fluid animation system it uses a gallery of animations and collects user feedback to find sorry it's just making sure that I would editors the user likes we apply our method to generating smoke animation which drives a set of passive marker particles through a procedurally generated velocity field this velocity field is generated by taking the curl of a vector-valued potential function there are two main components to this potential field the contribution due to a set of vortex rings and a spatially varying noise function in total the model requires the tuning of between six and twelve real valued parameters generating high-quality renderings is expensive fortunately we can quickly generate low resolution animations in seconds this allows the user to find the desired settings and then generate a high quality animation clip different parameter settings can be used to produce a variety of distinctive animations however finding the right values to generate a specific animation is difficult especially for non-experts our solution is to use techniques from machine learning to present the user with an animation gallery at each iteration of the gallery the user is shown for animations which they rate by indicating preference for animations that look more like their target by using preferences we reduce the cognitive effort required to evaluate the animations though users can also directly assign numerical ratings to the animations in desired when the user completes the task of entering preference feedback the model of the users image is updated and for new sets and parameters are automatically selected in selecting the examples the system automatically evaluates the trade-off between two competing goals try to offer improvement on the animation users no one to like and exploring regions of the parameter space for which the users reference is unknown it is not our intention to supplant user expertise but to use it more efficiently in reality users are unlikely to come to the system with absolutely no idea of what parameter values they are looking for even novices after looking at a few samples can quickly realize that certain parameter values are unsuitable furthermore even when using feedback efficiently exploring the entirety of a high dimensional space to a reasonable sample density is a tedious activity our interface allows users to restrict the ranges of any or all of the parameters or fix them at a specific value these techniques work well for a few parameters but as the number of free same space all right so I'm going to move on toward the end patient or as free book the word is used the more accurate the priors become it is even possible to adapt to different users or groups by using different hires that's kind of your race and chance to give those words important like in this case what you have is a way of generating smoke there's very tricky parameters and so what the system does is system pick some parameters that's the X the system then gives that X to the animator by presenting it in a nice UI and then the animator does gives a rating but a matron escaping the user the rating is the function evaluation they have and so you collect data like this and you fit a GP and then individual the lady discusses this trade-off between trying to show things that are very close to the ones that he liked best and also exploring different areas of the state space in order to exploit we use the mean where the mean is high we know that that's what good values are where there is uncertainty where the variance is large that's where we should choose points in order to explore so by having Gaussian processes we'll be able to have a very nice mechanism for trading of exploitation and exploration which is one of the fundamental problems in a are very key to do to do doing progress I think one of the central problems if you want to make progress in the eye is to nail this done in a simple case it's possible to tune to automatically tune a system to do a certain kind of animation now this is far more general than just animation whenever you have a system that has parameters and those parameters have no mathematical model allows you to tune them easy and computer science is full of it ad centers have parameters that people tune by hand optimize you know compilers it's another big area where people spend a lot of effort tuning parameters if we have a way of trying some parameters and build a function like a GP function that tells us what are good parameters we will be able to optimize compilers at centers information extraction architectures in language computer vision algorithms interfaces and I could go on at UBC we have a lot of people who work on this actually and discussing this now already for research projects Holger who's and kevinator Brown have been pushing this idea not using us in process but using random forests which we will discuss next week which are very similar they would allow you to do some of the things and they've been exploring this idea to India to Naga rhythms automatically and they've had incredible success in fact Jolla has gone beyond this and says it was not just a question of tuning parameters but in fact when you wrote write code and you introduced all these choices you should make those choices explicit they have an algorithm that will make those choices intelligently for you okay so this is very critical toward increased automation in society if we have want to have systems that will tune interfaces for us automatically systems code etc this is sort of one of the important building blocks and toward it the challenge will be able I think in the video it was mentioned there were just a few parameters one of the there's many challenges one of them being doing this well for very high dimensions there's been a lot of progress in that recently so this right now is still an area of research it's very new it's very hard there'll be lots of course projects about based on our optimization I imagine certainly if you come to office hours I'm happy to discuss some of these with you and hopefully we've covered the techniques early enough that we can get started with the projects if you already want to explore this verdict all right so Tuesday on Thursday this is what I'm going to do I'm going to go over how to use
Info
Channel: Nando de Freitas
Views: 71,968
Rating: 4.9700375 out of 5
Keywords: nando de freitas, machine learning, empirical Bayes, Bayes, Gaussian processes, nonlinear regression, confidence intervals, active learning, Bayesian optimization, animation, graphics, automatic configuration, prediction
Id: MfHKW5z-OOA
Channel Id: undefined
Length: 77min 28sec (4648 seconds)
Published: Tue Feb 05 2013
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.