Gaussian Processes 1 - Philipp Hennig - MLSS 2013 Tübingen

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
so thank you all very much for being back it's so it's Monday afternoon you've all had lunch and you're all in your in your food come coma now and so settle in for some for something that's hopefully going to be easier on your brains than some of the previous lectures and but not something that's going to be well well maybe elementary but not necessarily simple so some some of the the peter and francesco the practical teachers actually told me that they've unearthed a few surprising holes in some some of some of yours your knowledge on linear regression so let's see whether we can whether we can fix those today so and much of what i'm talking about is going to be something that you've heard before in other parts of the summer school but typically in a very different mindset and a very different from a seen at from a very different direction and i'm curious to see whether you can catch the connections again so today if i get close enough maybe this thing will actually work the entire next one i have ours will be about this guy and about this curve essentially so this is what our this is what our german money used to look like before the monetary union in europe this is a teen deutsche mark note and this is calculus cows and on this banknote you can see this famous bell curve that you all know about and this is what the one-dimensional form of this looks like this is a probability distribution over variable x which has a mean and a variance or a standard deviation the mutually called mu under standard deviation is usually called sigma over the course of the past week i think almost every single course mentioned the Gaussian distribution at some point i think the only exception is bernhard last week and everyone else mentioned gaussians whether they're patients of frequentist doesn't matter everyone talks about gaussians why is that what's so important about these Gaussian distributions why do you like why does this show up so much in this field of mathematics or computer science whatever we want to call it and even though you very ever hear about it in real life in your in your school and here's the multivariate version of this you've already heard this several times now just to be very clear that's a very straightforward way of telling this one-dimensional distribution you just saw in a multi-dimensional distribution here it's a two-dimensional version of it that's actually the only really interesting version of it one dimensional one while apart from Ralph very few people care that much about this this is almost exactly the same as before this is a normalization constant this is the business end there is this is the variable we're interested in X in this case it's a two-dimensional vector there's a mean which is now a vector and that mean is this center of this distribution that's where the where the density of probability mass is highest and then there's also a slightly more complicated object called the covariance matrix whose inverse show shows up in this multivariate version so this is the multivariate form of this object up here for a scalar you can just write a fraction for matrices you have to be more careful this matrix has to be positive semi-definite you've also heard that term a lot before so this is a matrix which is hermitian and all these eigenvalues have to be larger than zero and this matrix determines the shape of this sort of cigar separating like object if it's a unit matrix then it's a circle and if it's if it has if it has a more complicated structure if it has higher off diagonal terms then it sort of looks like this elongated ellipses and you can rotate the ellipses around depending on how you change the entries of that matrix okay again why is this distribution so important why do we constantly hear about it oh great answer she's just sort of taking away the answers fair enough amazingly okay so usually people then say ah because most things are sort of Gaussian distributed that completely wrong so here's here is just one data set out of uncountably many in the real world this is me saying machine learning summer school from here to here and this is this is a histogram of the amplitudes of of this this recording of me this black line here doesn't look anything like a Gaussian but it's a distribution that sort of has a center somewhere and it has some tails and sort of unimodal so there's various ways of approximating it with gaussians here are two you can come up with other ways we've already heard about various ways approximating one distribution with another terms of Kubrick Leibler divergence --is and other objects nevertheless I think almost nothing in the world worlds actually Gaussian distributed usually at this point somebody says but there is one thing that is Gaussian distributed and that's the sum of iid random variables that's true that's what the central limit theorem says actually only in the limit that's exactly true personally I think it's actually the other way around it's not that gaussians are important because you have two central limit theorem I think the central limit theorem is important because we use gaussians for everything and why did we do why do we use gaussians for everything you're exactly right because they make computations extremely easy that's the only reason I can really come up with for why we use these distributions all over the place it's actually the exact same reason why in physics and other parts of quantified science essentially we use linear functions everywhere because we know how to do linear algebra now almost nothing in the real world is a linear function not even acceleration or classic in classic mechanics acceleration is a linear functional force but our velocity is linear function change our velocities it in your functional force but even in there if you look closely enough nothing's actually linear coefficients are similar to that nothing in the real world is really Gaussian but most things are sort of almost Gaussian and if you assume the things are Gaussian then a lot of computations become extremely easy and the computations you have to do are linear algebra computations here are some of them so one of the most amazing properties of gaussians is that if you multiply them and Ralph s already mentioned this Chris has already mentioned it and zooming has already mentioned it then the product of two gaussians is a Gaussian itself so the gaussian exponential family is closed under the operation multiplication so if you have one Gaussian distribution here which is in green and then you have another Gaussian distribution which I'm plotting in blue here and you literally multiply the density at every single point then what comes out as a new distribution which as it happens is actually itself a Gaussian distribution times some normal normalizing constant if you like and the normalizing constant as it happens actually also has sort of the algebraic form of a Gaussian and what do the parameters of this new distribution look like well this is this new mean see and this new covariance Kappa see and this new covariance is literally just some linear algebra computations so you take the covariance matrix of the first first Gaussian you invert it as the precision matrix that well talked about before you take the covariance of the other Gaussian inverted that's the other precision matrix and you just add those two positions that gives you the new precision positions the inverse covariance so if you invert it again you get the new covariance matrix and the new mean is a very similar kind of was actually a different computation but it involves similar kind of objects the inverse of matrices sums of vectors multiplied through matrices that's linear algebra another great property of coefficients is that they are closed under linear projections so if you take a Gaussian distribution like this one and then you decide on one particular linear subspace so you decide on it on a matrix a door through which you're going to map this Gaussian and then you project the entire distribution down to the space the lower dimensional subspace is defined by this matrix then you get a new distribution which is again Gaussian and this new distribution has parameters is this distribution this new reservation has parameters which are again simple linear algebra computations of the original distribution you have to multiply with the linear map either from one side for the mean or from both sides for the new covariance it's not just projection though by actually by the way we can choose a particularly convenient projection we can take the projection that projects onto one of the coordinate axes that's called marginalization like getting rid of other parameters and then this simple formula over here actually helps even more simple so if you have a Joint Distribution over various various variables x and y maybe Z and ABC and so on with a complicated mean vector and the complicated for variance matrix but all you actually care about is available X then what probability theory says you have to marginalize over all the other variables right and oceans that's extremely easy it's so easy it sort of it's difficult to write down right you essentially just select this one element of the mean that corresponds to this particular variable you're interested in in the covariance matrix you just just select the sub matrix which corresponds to this particular variable that's an amazing property of gaussians because it means that if you have a billion variables you might be possibly interested in like all these players on xbox but all you at the moment actually care about is just one of those players just one person that you don't need to spend a computation on sort of cost 1 billion to get to get an answer out you need to spend the computation of cost one you just select this one element from the mean and this one object from the covariance it's also a horrible property because it means that if you have a billion objects that all have something to do with each other and every single one of them tells you something about the other ones but all you only all you actually care about is one element of them then by using a Gaussian model you're throwing away all these relationships to all the other variables right because you actually only care about this particular single one what's also nice about this is this here this is one of the two basic rules of ability so in the Gaussian kind of algebra we can we can encode one of the two basic rules of probability the sum rule in very simple linear algebra terms in fact in sort of a trivial linear algebra term which is just selection of individual elements what's the other basic rule of probability beyond the sum rule product rule wonderful so the product rule actually also keeps Gaussian T that's the final great property of gaussians so if you have a Gaussian distribution over variables x and y and you will actually care about the value of the variable x given a particular set of information about y then the you have to apply the product rule as your overcorrect and then you get out a new Gaussian distribution Gaussian conditional distribution over this variable so coefficients are closed under linear projection that's what this red line is but also if you make a cut through this for this distribution that's what conditioning is then you also get other guy ocean distribution and if you Gaussian distribution has new parameters which are low and behold some kind of set the result of some linear algebra computation so all you need to know to compute with Gaussian is linear algebra and I hope that all of you have in your algebra maybe in high school or at least as an elementary course in university all you need to do is invert by some parts of the covariance multiplied with some matrices and some tract some means from observations I'll talk about this equation more as we sort of Bres okay so what does that mean here's the most basic plot and zooming has already talked about this Chris has already talked about this but let me just repeat it again we can do closed form probabilistic inference within the Gaussian of algebra within the set of Gaussian distributions so let's say and I'm going to make up yet another silly example application for machine learning methods let's say and so I'm going to the shop and there's two kinds of things I would like to buy butter and milk and you don't know what these things actually cost and they are measured in some weird form of currency and you don't really know how much of a packet of butter is in Germany and how much a liter of milk costs maybe you have some initial weird like rough guess of what their possible values might be so you might say I think maybe a liter of milk cost something like a euro and a packet of butter maybe cost half of you half a euro maybe 50 cents made up all the Germans in the room will say that doesn't really work out anymore but okay so um so you're also uncertain about the value could be so ideally you'd like to use a distribution which encodes all this complicated information you have about these things you know that everything has a positive price nothing has a negative price so you'd like to encode all of that but you don't really know how it is a complicated computation so instead you just use a Gaussian this sounds like a drastic simplification but it's actually what we do in machine learning algorithms all the time we just use Gaussian distributions even though they are not VD what we actually believe so you could say I use a Gaussian distribution I just I don't actually know either of these two things and I assume they're completely separate objects so I have a diagonal prior covariance over the price of these two objects with an arrow bar of something like sweet euros to either side so that's relatively uncertain and now I go to the shops and I come back with one liter of milk and 600 grams of butter all right and I tell you a connection I actually forgot what the price of the individual things worse but I paid 6 euros so the sum of those two those two things for 6 euros now what do we actually know about the individual prices well what you know it's just it's just probe listing inference so here's a likelihood I observe that there is a variable Y which is 6 euros and that variable is the sum of those two objects that I just bought well the object actually have units so you have to say how much of each of these things you bought I bought one liter of milk then half oh come on Oh point 6 kilograms 600 grams of butter and I actually also forgot a little bit I don't really know where there was six zeros or not it was 6 euros plus minus but maybe was 580 I don't know maybe was 620 I don't know that sounds like 60 wasn't my first okay that's what this distribution looks like that's also a Gaussian distribution but it's a rank deficient Gaussian distribution it's a distribution that's only determined along the subspace which is orthogonal to this line here and along this direction I really don't know now Bayes rule says to get a posterior distribution to encode how much you know according to this weird sort of made-up model of probability about the price of milk and butter is you start with your prior over this over these two variables that's P of X that was this big coefficient up here then you multiply by the likelihood that's also a Gaussian the product of two gaussians is a Gaussian and then you normalize by the normalization constant that's a very simple constant because it's actually this Gaussian term that was in on a slide two slides before and what comes out it's a posterior distribution which is itself a Gaussian distribution it's this red object up here and it's literally just multiplying these two this the blue lines which is the likelihood with the green prior to get the posterior that's what Bayesian inference is that's this the entire idea um that's what the computation does so we just plug in all this complicated linear algebra but it's really just inverting some matrices which is straightforward for two-by-two matrices so nobody worries about that and all kinds of posterior which interestingly has posterior correlation so we started out assuming these two things are entirely independent of each other once we've observed a weighted sum over the two prices the posterior distribution is a correlated distribution because it could be that milk is actually quite expensive but if that's the case then butter has to be quite cheap or it could be that butter is quite expensive but that milk has to be quite cheap because I told you something about the sum of those two objects okay it's called explaining away and that's this V structure that crispy shop troll talked about right in graphical models if you observe this is completely obvious for everyone no no one's actually nodding right so you remember these graphical models from before this is the price of milk this is the price of butter I've observed the variable Y which is the sum of those two two weighted sum of those two and if you observe the children in a graphical model with sort of a V structure with two arrows pointing in then the parents of those two nodes become correlated in the posterior that's exactly what's happening here so if I would now tell you you know what milk was actually quite expensive have paid five euros for the milk then you suddenly know that the butter had to be quite cheap ah ah or right obviously what's on purpose yeah no you're right that definitely correct okay so so that is that is couched in algebra that's seen that's everything I'm going to I'm going to mention in this about this the whole point is or my whole point is gaussians the main reason to use gaussians is that all computations involving gaussians are linear algebra computations they make everything really easy and they make all the theory you need to use very nice and elegant because we know so much about linear algebra they are not necessarily very good models for the real world and in fact they're going to spend I think at least the next hour just thinking about what a good model for the real world actually is and how they can get away with using Gaussian distributions and still get reasonably good models for the real world out and how much we have to think about these models to make sure they actually are good models for the real world so let's talk about a slightly more realistic more a problem in the real world and that's one that you've already seen before and you've done your practicals on them and you've seen this over and over again over over your undergraduate courses and maybe in high school that's linear regression so um I always have trouble coming up with good explanation is for sort of examples up here but something I know this chef and chef and shall talked about this let's say this is a really heavy doing experiment with the robot we try to identify the dynamics of the robot let's say there's only one axis we're interested in only one joint of the robot and what you what I've plotted here is with let's say we've applied certain amounts of torque to the to this joint of the robot there's some negative torque some positive torque so that's acceleration in one direction acceleration the other direction right arm moving up and down and for where for each of these torques we've measured the acceleration right how do you measure these accelerations I actually don't know you should ask fun some of the people from the chef's on Charles Department I think there's some sort of sensors in these robots but measure acceleration and these sensors aren't perfect so they make little errors when they measure and as a good scientist you draw these these little arrows as these not of nice arrow bars right with uncertainty what does it actually mean to draw an arrow bar around the location well for most people it means that they are assuming there's Gaussian observation noise on the values they've observed right so the real value is some value which is let's say at this point is some excess at some why some some variable why like she's not actually why let's call it so at this point maybe we have a true function value of x but you can't really get to see you don't really get to see what that function value is what you get to see is an observation why and that observation is the true function value plus some random variable epsilon and that epsilon comes from a Gaussian distribution with zero mean and some little variance Sigma and we're assuming that they're all independently drawn so there's lots lots of assumptions going on here another way of writing this probablistic lee is as a likelihood the probability for the observation given the function value and given this variable Sigma is a Gaussian of Y around f of X with this variance and this is actually a big modeling assumption right it's something we've already talked about but the likelihood here is a massive massive assumption we're assuming you have this Gaussian noise and it's independent from each other and there's a particular variance which I assume I know but let's just say we normally know all of these things so now of course the question you're asking is well there's something going on here there's some regularity behind this problem there must be a function behind this explaining what's going on and this function is maybe a particularly simple function a linear function a function that is only a straight line that is f hazard has an intercept and a bias it's only moving up and down maybe has a certain thickness so how can we write functions like that linear functions look like this they are so this the function f of X is a constant which I'll call w1 plus a linear term so that does not get another constant check all w2 and times X right so that's that should be video obviously lots of nodding nodding heads does that sound like a linear function to you it's very little nodding going on okay a little bit more nodding is going to be helpful okay good so now probabilistic inference says well I would like to know what these two variables are W 1 and W 2 I'd like to get a good fit a good explanation for what what this this function looks like M so to do that in a probabilistic sense only two things and your likelihood an idea prior okay first I'll write down my prior over these two weights well because everything is nicely the gaussians I'm going to use a Gaussian in fact I'm going to use a particularly simple Gaussian I'm going to say that the mean of this Gaussian is zero and the covariance of this Gaussian is just a it's actually just a unit matrix it's just like just a constant times T times the unit this particular case the constants actually 1 then this is what this prior looks like so this is the space of these weights is a 2 dimensional space because there are two weights weighed one way to and this here might be is w1 so if a point is far over here then this is a function which is high up on the y-axis and this here is w2 so if a point is high up here on this point then this is a function which is particularly steep and if it's further down on this point and it's a function that's steep in the other direction it's falling down very quickly now what are the function values of this function at various points so this is a description of the of the parameters that describe this function what are you actual function value well oceans are closed under linear maps so if I can come up with a linear map that describes this function in terms of these parameters W then I can get a Gaussian belief over this function at every point here is such a linear map that does that it's actually very simple linear map it's a vector that just contains them the number one and the variable X at every point so that's a function you can call it a feature function if you like but it's just a linear map so if either way of writing this function up here is as the function f of X is the vector Phi of x times the weight vector W this is an inner product you already heard lots about inner products from Bernhard earlier on right um and this is my maybe slightly special notation I this is how I write a function value for these feature functions so this is the value of the function Phi at location little eggs okay now it's clear hopefully that this is the same thing right and inner product is just a sum okay so so if this is the function value then I can that this all this also means because gaussians are closed are the linear transformations that if I have this Gaussian prior where the function values I also have a Gaussian prior over the sorry if I have this Gaussian prior whether the weights these parameters W that describe the function I also have a Gaussian belief over the function values at every point and this Gaussian belief is I literally just apply this rule for linear closure that I had on the second slide or so this is a linear map of W it's Phi times W and the mean of this map Gaussian is that it is simply you simply apply the weight vector edit the projection to the mean and you apply the projection on either side to the covariance matrix in fact I can actually do this for all the X at the same time so one way instead of writing this down I can actually write down a matrix capital Phi which consists of which consists of Phi of X 1 Phi of X 2 Phi of X 3 Phi of X 4 and so on right for lots and lots and lots of X and what I've done here is I've chosen X to be an entire set of lots and lots and lots of points okay um and then this still applies it's now just that the the size of these options are a little bit more complicated so this is now a vector of length number of points at which I evaluate this function and this is a matrix of size number of points at which I evaluate this function times number of points at which I evaluate this function so as the two dimensional matrix in the middle and then there's this huge Veck use huge projection matrix Phi on one end and then its transpose on the other end okay and this whole thing that comes out is a matrix of size four let's call this n n by n and it's a matrix of rank see it loud in times n the rank of this matrix it's an outer product and there is something in here so there's a there's a rule for rank which says that the rank of a times P is the minimum of the ranks of a the rank of B so the rank of this objects actually - it's a Rank 2 object which is saying that this complicated distribution here has actually just two degrees of freedom because that's what the linear function is right whatever the answer is it has to be a function that is like a linear function up and down here are three samples from this prior suite Fancy's three samples from the weights this one here is a function which is very steep because it's well it's relatively steep because it's sort of over sort of up here and it's quite far up because it's over here that's this function here it's exactly the same that I've just mapped this point to this function here this one here is a function which is even steeper but it's closer to the origin it passes through the origin that's this function here right and this thing over here is a function which actually has a negative slope and it's a little bit below the origin that's this function here and a Joint Distribution over all of these function values is this big n by n made em by n Gaussian check when it has two degrees of freedom and what you see here in sort of greenish that's the probability density at every single point for the marginal distribution so at every single point on the x-axis I've asked what's the marginal distribution over over its individual function value that marginal distribution is a one dimensional Gaussian it's very very easy to evaluate you just take the mean the element of the mean vector that corresponds to the function value and the diagonal term of this big matrix at this particular point and then I evaluate at each of these points of the Gaussian that gives me this little picture here now actually so I've drawn these three points here but I could have obviously drawn lots lots of other samples from gaussians right I could just just just have repeated this over and over again I would have gotten different samples for example it actually doesn't really matter where along one line of sort of one contour line of this distribution I draw my samples I could also have sort of rotated this entire picture around by 45 degrees or so and then I would have got not a different function but from the point of view of the prior it would be completely equivalent right it would have the exact same probability the probability mass those those three samples and a 45-degree rotated picture of for three more samples would be exactly exactly the same probability so it makes sense right so I might as well just actually consider all of the samples along a circle like this so that's what I'm doing here so for the rest of this blood I'm going to for the rest of this talk I'm going to show you plots like this where we have animated samples running around in a circle essentially unfortunately though from the next blood onwards I'm not going to be able to show you this plot anymore because we're going to have more than two parameters but what you can still see is the samples in this distribution so the point of these plots is that along every single every single frame of this video has the exact same probability is a probability density as every other frame of the video so they are sort of equivalent under the prior or posterior distribution once the posterior distribution looks like well we sort of go back four slides look at what posteriors of a Gaussian look like and which is compute them this is the posterior distribution over the weights so it's an new distribution over W which is the product of this green prior prior or prior with the the likelihood which is still blue you may notice that in all the plots the prior is always green the likelihood is always blue and the posterior is always red and this is actually a likelihood distribution it's a product of a lots and lots of gives different gaussians every single one of these points here is like one of those Bayesian inference slides and head previously right so for every individual let's say for sort of this point here we've observed that if X is relatively small then a relatively small component exactly let's take this one right if x is 0 then the function value is sort of typically large so there is in here there is one sort of degenerate gaussian sausage which is sort of lying up here right and then there's another one which says if X is quite large then Y also has to be quite large so there's another sort of gaussian sausage over here which sort of looks like looks like this all right and I have I think 20 of those multiplied them all together this is what the posterior looks like the Mysterio has this exact algebraic form this is the mean slightly complicated this is the posterior covariance it's actually a typo in there yeah this bit shouldn't be there so this final doesn't make no sense em that's the posterior division of the weights this is this little vexing up here you notice that they're correlated just in the previous plot and this is the posterior over the functions of the function values how do I get that from this distribution what gaussians are closed under linear projections I literally just multiply from the from the mean from the left with the projections and I multiply the covariance on the left and the right with the projections to get a posterior distribution so let me write this down I might need some space here on the blackboard so heaven having observed function values why at locations capital X the posterior distribution over the function values at some little X is a Gaussian over this value with the following mean it's the feature function of those locations I'm interested in that's this sort of lots of points along the x-axis times the mean vector that's the prior mean plus the same feature function applied to the prior covariance times the observed features times I'm litchi just writing down something that's on previous slides so I didn't want to bother you on previous slides with this and there's a little bit of noise and then there's the observation I've made Y and from that I've subtract this kind of projection and the posterior covariance is this object - and all of this is just linear algebra right it's just multiplying vectors with matrices inverting them multiplying again so a way of thinking about is what's going on here is that let's think about the mean first this object here that's what I've observed that's the function values Y minus what I would have predicted better predicted under the prior that's the mean value for these function values under the prior distribution right that's what we saw the function value should be this is what we observed so this is this is a kind of residual right so how far you are away how surprised you are actually it's just how far you are away from your best protein for your best prediction well how surprised are you well how surprised you are depends on how much variation you actually expected in all these function values and that's what this thing encodes here this is the prior covariance over those function values this is the kind of scale on which we would have expected function values to vary so we're just taking the residual and normalizing by how how big we expected these distances to be and then we just multiply this object which is what we've observed which did with this object over here which is the covariance between the things we have seen and the things we're actually interested in so we've observed some function values that we care about some other function values okay and the posterior covariance is a little bit more difficult to understand as sort of this I can't have don't really know a nice like similarly nice interpretation of this but this is the uncertainty you had over the function values before you got to see something and now you're becoming more certain so you're subtracting something by the aerobars shrink by how much do they shrink whether they shrink by some sort of interesting algebraic object which is an outer product between which in the center has this how uncertain you we're about the function values in the first place and then projections to the left and right which correspond to the covariance between the observations and the things you're actually interested in okay ah okay so we have two different matrices Phi little X is the function values at all the locations little X that I'm interested in so Phi off in this particular case I've chosen little X to be a lot of points along this along the x-axis I think it's 150 points there's so many you don't actually see them anymore okay sort of on these on these lines and capital X are the locations at which we've actually seen function values so if you've seen n if you've seen n function values and you're interested in the function values at M locations then Phi of X is in R to the M Phi cross number of features I don't have a number for that let's call that F that's how many features we have this case is to the feature 1 and the feature X and this capitai like Phi of big X is in R to the N cross F so this could be sort of a reasonably large matrix with the other way around because that's connotations with the transpose right F cross M F cross n so it's a 2 by n matrix Phi X is a two by lots more M matrix and objects like Phi little X Sigma capital X are in M by n okay a bit of nodding sleeping yes ah so the variables why those are actual numbers someone told me why is the actual observation we've made these observations at locations that I call capital X so these are these are literally this is that this is capital X 1 capital X 2 capital X 3 capital X 4 capital X 5 and so on and this is y 1 y 2 y 3 y 4 y 5 y 6 that's one and I stack all these Y's and one vector all the X's and another vector actually the XS I don't even care about at all the XS don't show up in in the final computation at all or like all I care about is the features of those eggs and each of these X has two features it has the feature 1 which is just a constant and the feature X which is just X itself in this particular case okay so at each of these points I get out a two-dimensional vector and I stack all of those app that gives a 2 by n dimensional vector matrix and for the predictions I need similar objects they're just 2 by M and the rest once I have these matrices I just do linear algebra let's see what that actually looks like in code so this is MATLAB code if you don't like MATLAB you get to choose your other your favorite language and that's the entire code for the plot that you saw on the previous slide apart from the animation but you know there must be something something to try out afterwards so here is literally the entire computation here's what's happening so I define this prior on the function values so first of all I actually have to define these feature functions this is maybe the most complicated line if you don't know MATLAB this is going to be a bit painful but let me just briefly take two minutes and explain it to you so I decided I'm going to use two features F is 2 and those features are the feature 1 and the feature X this is like taking X and taking the 0th power of it because X to the 0 is 1 and then taking X and taking the first power of it because X to the 1 is X ok and there's a cool function in MATLAB which is under appreciated and that function is called bsx fun that's a binary single binary scalar valued function something I think is the abbreviation and what this thing does is it takes in a vector a so this here lambda calculus in MATLAB multiple functional programming so this is a function Phi which takes in a variable a that variable is a vector of length N or M or some kind of length and it's a vector okay and it takes a second input it takes in this object here which is a MATLAB list it's a list containing the number zero the number one the number two three four up to F minus one this particular case F is 2 so f minus 1 is just 1 so it's literally just the number of 0 and 1 and now it looks into that first the sort of the first entry of the input that's the function power and says ok I need to take this vector and apply the 0th power and the first power to every single element of this vector and then stack them up in a two by n dimensional vector matrix so what comes out here is a matrix of size length of a times length of this object over here so in this particular guys it's an N by 2 or M by 2 matrix okay that's the hardest part everything else is easy this is the prior mean of other weights we had this on previous plots is this green object down here that prior mean is 0 so it's just a vector containing zeros there's the prior covariance which is just the identity matrix it's just this is just math ops way of writing a 1 in two dimensions that's it for the prior now we now we just do linear algebra we just construct the prior on the feature so this these first this first four lines up here that's this green object here now we're going to construct this green object over here for that we need to decide how many points we're interested in I'm going to take 100 I distribute these points evenly across the space 4 minus 6 to 6 then I call this feature function and just say give me these features for all these points that's a 100 by 2 matrix and now I can use this object to construct all the parameters of the prior mean at a prior distribution that prior distribution has a mean I call that mean M which is the map applied to the prior mean and it has a prior covariance and I call that K xx because it's the covariance between little X and little X and that's let you just this inner part exodus is my prior covariance Sigma and this is big vector a big bit sort of long and skinny matrix times fat and short matrix right that's a Rank 2 object and it's of size n by N and now I also need to draw samples because I wanted to do this plot with these sort of green lines actually just this one which isn't animated right I wanted to have these lines here these drawers how do you draw from the Gaussian distribution well I'm not going to go into details if you want to know more about how to draw for motion distribution as asked me after the talk I have some plots and slides as well but I don't fit in here for time reasons instead anyway what is the time good um I just tell you what you do to draw from a Gaussian you first draw from an iid Gaussian so let's say we want to draw from a the normal distribution over X with some mean mu and covariance Sigma and let's say X is in what didn't I use so far n MF the error it's not good maybe okay then what you do is you first draw you first draw our independent Gaussian variables you have a question okay so first you draw our independent variables for that there's code there's coding and very smart very efficient algorithms and that you don't really want to think about how they work really because they actually numerically quite tricky this is what this thing does this is a random matrix which just contains in Atlanta bleed wrong normal variables of size n by 3 because I want to draw three functions okay then you multiply let's call that vector u that's our random draw from a Gaussian with 0 mean unit variance and it's sort of a big product over them right and then you take this u and you scale it by the show Leske decomposition of Sigma transpose u again don't ask me why I can tell you after will ask me after don't ask me now if I you afterwards this is what I do here I take the prior covariance ok xx I actually add me something sort of slightly complicated I add a little bit of tiny bit of jitter to the diagonal this is something people just need to do to get your algorithm stable because this K X X is actually of Rank 2 and the Tresca decomposition doesn't work on ranked efficient matrices so I'm putting this in here not to confuse you but to allow you to actually copy this code afterwards and make it work otherwise you would just come back to me and say this doesn't work there's a bug in there ok so this is an annoying technicality that you really shouldn't care about so I take this Joel s key I apply it to this vector u and then I'll let you just add the mean so what's happening here is I draw random points from a circular Gaussian then I transform this circular Gaussian into a correlated Gaussian that's what I do I apply natural s key and then I'm adding a constant term to it the mean to get it to be a probability distribution so this is sort of coordinate axis right so now we have something Gaussian distribution which has the correct mean and the correct covariance and again I do that to all the dimensions of this so the mean is a vector and I need to add it to every single column of this matrix which is why I have to use this bsx fun again ok I also want the Aero bars every point so that just take the diagonal of the covariance matrix that's the marginal variance take the square root of it and done ok so far there was no data notice that I hadn't loaded any data yet that's what the prior is write it down before you get to see the data now we get to see some data getting to see that means I just loaded from somewhere and to finish our computations we now need all these objects in this algebraic sort of construction so we need phi of capital x i construct that by just calling my feature function phi then that can be used to construct the prior mean on the function values at the locations X this is this the prior covariance at the of the function values at locations capital X this is this then I can use this to construct this matrix here which is missing an inverse in my copy so that's just this matrix which I call G to store it somewhere patience will become easier afterwards if I already constructed to rescue the composition of G you don't need to understand this at this point it just makes the computation faster and then I need these objects here this is the covariance between the function values at little X and the function values at Capital X so it's called K little X capital X and again it's just multiplying matrices it vectors and then I just evaluate these objects here to do that right I sort of store things in just the right way so that I don't have to recompute them all over again okay and the rest is just multiplication and then there's a few lines of plot missing down here which you have to come up with yourself okay so that's the entire thing that's Gaussian inference for Gaussian regression and we don't just have to we don't just need to use it for simple problems like this linear regression here this one but we can just as well think about maybe being able to apply it to slightly more complicated problems like this nonlinear regression so let's say we've done another experiment with a more complicated robot that robot has all sorts of complicated effects there stiction like no friction from well stiction so friction from staying still I don't actually know what that would other way of explaining this in English and they're sort of nonlinear effects if you move over here it sort of flattens out and over here it becomes it becomes sort of steep that's a more realistic data set and I think this is kind of the kind of stuff we see in machine learning problems all the time now what do we do this is nonlinear and I just said Gaussian distributions are for linear problems can we still work with this yes and what do we do mixture of coefficients is something non linear mappings okay so mixtures of gaussians actually make for very difficult inference you can use you could come up with some way of using mixtures of coefficients here actually I'm not saying it's impossible it's just algebraically vd challenging the idea up here is actually where is very good let's use a nonlinear feature set so so far I've just written the function as a linear function in the feature so this is a linear function not just because Phi contains linear features it's also a linear function because it's linear in W it's just a linear map okay so maybe we can come up with other features Phi which are nonlinear and describe a nonlinear function what kind of features could we come up with tell me a few features huh sorry polynomial okay so that's saying instead of using Phi just one and X we could also use X square and X cubed and X 4 and X X to the fifth and so on okay so what do we need to change in our code well almost nothing just this first line up here literally we just have to set 2 to 3 or 4 or 5 everything else stays the exact same so on the next slide I'm going to show you various feature sets and I'm just going to plot this one line of the code up here and everything else stays the exact same and I literally generated these plots by using this code and changing this one line so here's what happens with polynomial features I've literally just moved from a feature function which contains constant linear features to quadratic and cubic features and here you see those features the gray lines in the Indus blood other features there's a constant feature which is the function that returns one everywhere there's a linear feature which is the function that returns X everywhere as a quadratic feature which is the function of returns x squared everywhere and the cubic feature which is the function that returns X cubed everywhere ok and samples on this distribution look like this this is what the posterior looks like this is better because it's a better description of the data it sort of captures some aspects of the data very weightier there's no strict mathematical statement it's just somehow better right it's somehow it gets it gets closer to the true data than a linear function would be but it's not good right in particular it's really certain that the function value should be down here somewhere even though the measurements are very far away from this observation this is a feature of Gaussian inference a Gaussian doesn't actually care about how far the observation is from the predicted point it just becomes certain with every single observation and this is something that has actually been mentioned by a lot of previous speakers and maybe it sort of slid under the radar a little bit rather this morning talked about becoming more and more certain with every game that someone plays you becoming more certain in what actually in true scale there's some subtlety because using EP but up to that you're essentially becoming more and more certain in a deterministic way independent of how surprised you ever what you're observing actually with EP that's not correct you actually actually stay a little bit more uncertain okay so cubic features are not good what she views instead sorry how are you moving way too fast next let's come up with a few more features we will get your calcium kernels cosines all call signs are good not another idea okay I was hoping somebody say well why don't you use seven the seventh order polynomial or a tenth order polynomial right we can do that this is what the prior looks like just more polynomials this is the posterior you're becoming you're getting a better description of the function that has a really weird behavior in the corner so this thing just sort of becomes really stiff it's also still not very good okay cosines very good let's try that so here are cosines and sines features again you see the features down here lots what I've done is I've just taken eight different features per course it's like eight for cosines eight for sine there's a cosine with the fake frequency 0 which is this constant function up here then there's eight different cosines which you can see here and then there's eight different signs okay this is what the prior looks like this is very interesting kind of structures of functions right they're sort of smooth they're moving in very complicated ways up and down and this is what the positivity looks like hmm maybe this is a bit better I'm not sure so there's this is still not very good around here right but there's a better behavior maybe in the corner also if I would make you able to enlarge the x-axis you would notice that these are periodical functions of course because signs are periodical features right if you do Fourier regression which is what this is you get periodical functions out okay more features ah okay you're also over educated it's cover yes let's try something crazy let's try some step functions all right here are some step functions actually there's two different step functions ah if you think about it what actually is that French this is a function that go for minus one to plus one or is it a function it goes from zero to plus one sounds like a minor problem it actually makes a major difference in what your functions look like this is the case where I've chosen step functions which start at plus one and then they drop to minus one or by symmetry you can make the other way around as well they start from minus one they go to plus one doesn't matter because the Gaussian prior is centered at zero so it's symmetric around changing the sign okay so in this case we have eight of these feature functions and this is what the prior looks like so each of these functions are little step functions right now interestingly if I observe that the function value is large up here right the grains are the feature functions so they overlap so it sort of looks like it's a weird letter but actually there's just 16 different functions that are plotted on top of each other so there's one function that comes from here drops down stays down here all the time as a function that comes down here drops down and stays here all the time okay if I have a large function value over here then not only do I know that the function value has to be large somewhere here it probably also has to be low somewhere there because it means that this feature probably is large or maybe this feature is large or this feature is large whichever feature is that is large it's a feature that if I scale it up and make it make its weight large it's also going to give a big negative value over here all right those are the kind of effects you get when you do parametric modeling that are not entirely obvious when you when you write down your your model so if I observe a function values like this outcomes this kind of posterior this is actually really cool it looks really ugly right but it's actually really cool because you can do vehicle things it can get this step right here right it's really flexible it sort of moves up here also it continuous in a very interesting way it doesn't like bend down to zero again actually it does very far away very far over here again but it sort of moves back down it doesn't do something crazy like the polynomials did okay there are other step functions the way it also uses their function that starts at zero and then moves up to one so it doesn't start at minus one I moved to plus one it's not at zero and moves up to plus one that means that if I observe a large function value here then it's likely that it comes from one of the later features right and not from one of the earlier features it also means that there's an endpoint somewhere over here where I really just know that the function value has to be zero because there are no features over there anymore yes no I will come to that later on that's a very good question there's yet more step functions you could have yes/no so actually let's see what happens in this particular case this line up there is straight it just stays at this final point for all eternity becomes broader and broader but it stays at one well whatever that value is 18 forever over here it has to go back to zero because there are no features there's no way to express that that it's not zero because there's just no features that allow me to do that by choosing your feature set you're choosing a language in which you are expressing your problem and if you choose a bad language a bad encoding you'll need a very complicated code to express what you're doing if you think in terms of coding theory and if you choose a language which is not fully expressive which can't express every single function like these ones then there are certain things you cannot express right that's parametric regression there is only 16 different floating-point numbers in here that we are trying to learn so the degrees of freedom of this model are 16 but this model also had 16 degrees of freedom and it's a completely different answer you have to choose your model to get a good answer out and if you don't know anything about where these function values come from how should you know how to choose this feature right there's not another set of features I don't need to use features which are piecewise constant step functions I like use features which are piecewise linear so these are functions which start over here and then the first one bends over here and then the second one sort of looks like a V over here then is a V over here they're all sort of stacked V's lots of features right again run the same code out comes a new posterior looks like this very different kind of prediction same number of degrees of freedom you need to choose your features to get meaningful answers alt okay here's a crazy set of features just to sort of really drive home the point maybe you know for some reason that your function values lie that your direction is only defined for - 8 - 8 maybe this is some kind of parry boundary condition seeing or maybe there's a box in which you're sort of in which this whole problem runs because on either end of the sort of problem the robot runs against the wall I don't know so then you could use feature sets which are only defined on a finite domain in this particular case I've chosen some that are called Legendre polynomials because I'm a physicist and I clear all the polynomials because MATLAB has them built in and this is what the prior looks like these are all these polynomials up to degree 2 degree 13 and my battery is running out this is what the posterior looks like more complicated only defined for minus 8 to +8 there's no way of expanding this to more than 8 then it just becomes an ill-defined problem okay yet more I could say maybe the features I'm using are sort of little little Eiffel Towers right sort of X of a constant function randall munroe the xkcd guy claims that the eiffel tower in in lock space is a triangle I'm not sure this is true but if it's true then this is what the features these features will be little Eiffel Towers okay looks kind of interesting that sort of easy to bump somewhere right and this is what the posterior looks like this is maybe quite interesting somebody said before square Exponential's or RBF functions or little Gaussian functions or whatever you like so you could also use these features I'm not saying they are the right choice they're just some features that people like to use because they think they're very very good right okay it's just another feed set of features they look like these little functions so every single gray function here looks like a little Gaussian function not a Gaussian distribution a Gaussian function because they're not normalized right but there's no reason to use those over the other ones I'm just just taking those because someone said I should use them right okay and if I use those then an interesting thing that people really like is the fact that these are all smooth functions so that's particularly nice right and the posterior is lots of smooth functions okay so that that seems pretty cool still 16 degrees of freedom just like all the other previous models and I hope by now I've driven home the point that every single of these four of these models looks entirely different from the previous ones how should I choose between them people don't they just choose them and that's what half of you have already done in you're essentially done in your in your practical own kernels just that you haven't chosen features you've chosen kernels what's this whole kernel business okay so we've got half an hour left here's my very oh no I should forget about this just a few think notes on the side I've shown one dimensional plots because it's easy to plot one dimensions easy as nobody is saying the input domain should be one-dimensional right if you have a two dimensional input domain and just put your you just line up your features along a two-dimensional space as various ways of doing that here are just another I think 32 or so of these or 64 of these little Gaussian bombs and then you get two dimensional priors over functions right it is nothing is different exact same codes exact same computation you could also say okay so this was multiple inputs one output you could also have multiple outputs one input so let's say we have one input which is along here it's called at the variable T time from 0 to 1 or so and I have two different outputs X and well let's call them x1 and x2 so you don't get confused or y1 and y2 and in each dimension I've lined up lots of little Gaussian features ok and then I can also draw from these 32 different random variables it's just a bunch of 32 when the variables that are jointly Gaussian distributed the code does not really change you just have to be a little bit more careful with the matrix algebra ok if I take this to three dimensional plot and rotate it such that you look at it from the top so you don't see the input variable anymore then what this thing looks like is a curve so this framework can be used to learn curves there's no problem whatsoever right okay good so now kernel business so a lot of you have already seen this before so I'm going to be going to be going through it Vella tively quickly let's see how much we can sort of get in five minutes so this is this posterior distribution over gaussians of over over features feature function values and this price already right and it's number of points at which I care about the function value by number of points which I care about the function value this is this covariance and this is this posterior mean vector okay now what people have noticed before and what you by now have already heard several times for a particular phone Bernhardt is that if you stare at this equation for long enough you notice that actually it doesn't contain any individual feature vectors as such all it contains are two kinds of objects it contains these kind of objects which are the projection of the mean of the features of the feeder weights with the feature function and that's called a mean function sometimes so it's just a function that at every point in the input we turn some number and then it contains this these kind of function shows up twice and then it contains these objects which are inner products and this term we've heard so often by now right the inner product between two feature vectors and they actually in general they're sort of inner products in some space which is scaled by this positive definite prior covariance matrix force Blissett II you could assume that this file covariance is just a unit matrix then it's literally just an inner product in the standard form it's just Phi transpose times Phi okay so we don't actually need to write down this feature function explicitly all we need in the code is a function which returns four pairs of input X and capital X or X and lower X or capital X and capital X returns these inner product objects and this function is called a kernel function because it's sort of the kernel of your code if you like various other ways of thinking about why it's called the kernel and actually you don't need to change much in the code because I've already written the code in such a way that it essentially already presupposes the existence of a kernel so when I wrote this little K xx I said covariance offered of course this is exactly the same as the kernel from X to X and the kernel from capital X to capital X and the kernel between lowercase X and capital X so what I use here is a function which takes in two and entries a and B which are of length N and of length M and returns a matrix of size n by M or takes two inputs which are size N and n it returns a matrix which is of size n by n all right and while if you change in the code is mostly I need to change two lines up here and define this this covariance function or this kernel function K and after that I just need to be placed wherever previously I called feature functions I can just simplify the code a little bit to make it even shorter so up here I've defined this kernel function which which it just takes two feature functions and compute the inner product but other than that I haven't changed much so actually the plots I showed you in all the previous plots were already if you like kernel regression or Gaussian process regression plots they were just parametric they were just finitely many features so now let's just say that now the main big kernel trick idea is that actually there's a way of doing this without explicitly writing down the feature functions in the first place which allows you to essentially use a very large or even infinite number of features and here's one way of thinking about this I think it's a slightly different way from the way that Barnard presented these and it leads to a very very similar conclusion though so let's say we go back to these RBF square exponential features that we used on the some of the previous previous slides that almost last previous slides the one that gave us this nice smooth functions so we're going to say the features we use are down here these RBF features so every individual feature looks like a little Gaussian bump not a normalized Gaussian bump not not a distribution just a Gaussian function a little bell curve it's just one choice right you could choose lots of other than other features as well but I'm going to choose this particular feature set and I'm also going to make life easy for me I'm going to say the prior covariance between those features is actually a scalar matrix so it's a unit matrix times some constant and that constant is that the difference between the two boundaries of the plot I'm making divided by the number of features so that as I get more and more features that number up here becomes smaller and smaller okay and as I enlarge the plotting area and make C Max and C min larger and smaller this number becomes larger and larger and there's a constant in front as well which is just a constant you can choose it whatever you like okay then this inner product the thing that we need for our code that's a simple it's just one sum because this is a unit matrix so the good matrix makes turns these two sums for the inner product into one sum and the constant just drops out of the sum and now what's left to do is just a little bit of algebra so let's write down these actual lists these Gaussian features now this inner product is some constant times a sum over the product pairwise product between two Gaussian features evaluated at two different ends so it's sorry not to blush we just same Gaussian feature evaluates at evaluating two to two different ends now for this particular set of features this particular product simplifies in a nice way because it's the product of two Gaussian terms and the product function terms is another pair of Gaussian terms with rearranged parameters one Gaussian term which doesn't actually contain the location the center of the feature and one term which contains the center of the feature and then a linear combination of the evaluation points now what I can do is I can just increase the number of features more and more and more and then eventually so what I've done here is I've put Gaussian features in a regular way across this domain from C min to see max here's the first feature is the second feature the third feature the fourth feature the fifth feature and so on up to the 60 or the F feature now increase F so now we get more and more and more of these features they get ever denser and denser but at the same time they also get sort of lower and lower right because F Rises so they're the constant one of them becomes smaller and smaller eventually I sort of have a dense dense set of features and that tends of that is this the density of features in that set is the number of features divided by the width of this box in which I put them times the width of a little packet in which I'm measure the number of those features is that is that number the density of that is this number of the without the Delta C so it's like an integral that it's just just turning a sum into an integral and when you do that you get like actually you just replace the sum of the integral that's all that's the entire thing this object down here just stays and this year if we let C Max se-min go through infinity and minus infinity just becomes an integral an indefinite integral over a Gaussian function V onobu that is the normalization constant of a Gaussian so this infinite sum here over features is actually a function that can be evaluated very very cheaply even though it's essentially summing over infinitely many features right here's again and this function is just this famous RBF kernel that everyone's already been shouting out because they sort of heard about it so often ok so here's this again just graphically if you found this algebra complicated here are a finite bunch of cows features I think ten of them now here are 30 of them scaled down by a factor of F the number of features here are well actually I would you donate up to 30 because it's already a very smooth right and in the limit of infinity it becomes this sort of well it actually becomes this weird Gaussian distribution over function values for which I can't plot features anymore because there's so many features that I can't plot them they're infinitely many features but the distribution of a functions has changed very little right it has become a little bit broader at these ends but it's actually a distribution that has infinitely many parameters because I've let the number of features go to infinity all right that's confusing so what what well actually by the way this is what this is what the FASTA via looks like that's confusing what what what's happened here again so I've just started with fanta many features and added more and more and more features until I notice that I can actually do this summon a much much more efficient way and certain terms just drop out and I'm left with a covariance function which which amounts to using infinitely many features so at that point maybe I was a little bit too hand-wavy right and a little bit of maths is in order now conveniently I don't actually have to do that math because Bernhard is already kind of done it for me last last week on Thursday and says that there is m this concept of a kernel which can which which is a function which takes in two inputs and returns a real number and it's a kernel matrix or it's it's what is called a Mercer kernel a positive definite kernel positive semi definite kernel depending on who you ask if it has the property that if we evaluate it at a fixed number of of input locations sort of Square right on a set of x1 to xn with another set of x1 to xn so if both of the inputs are vectors then out comes a matrix which element elements which contains the which are the evaluations of these three of these kernel functions that all these pairs of points is a positive semi definite matrix that's a very complicated way of saying as turns out and this is called entirely non-trivial and non-obvious that this function can also be written as the inner product of certain sets or functions in fact it can be within various ways as various kinds of inner products that's where a lot of the complication of these models comes from but we've just discovered one way of doing this for this particular kernel this particular RBF kernel you can start with finitely many Gaussian features add more and more and more to the sum eventually end up with an integral which is like an infinite sum over infinitely many RBF features now at this point I've stopped talking about this plot that used to be on the left-hand side when we did linear regression the distribution over the weights right and now you could ask what is that distribution over the weights so in a very in a linear linear plot there were just two weights there was the intercept and the bias then in the subsequent plots I had more and more and more features there was it was a 16 dimensional Gaussian and 32 dimensional Gaussian a and so on right and the corresponding distribution over the function values was a essentially Gaussian distribution with of rank 16 or rank 8 or rank 2 for the linear function now what's happened to this Gaussian distribution in the process of moving to infinitely many features icepick comment it's become an infinite dimensional Gaussian distribution and that's a weird concept which is why someone came up with a name for it it's called a Gaussian process so and as we just all I'm going to say about this concept of a Gaussian process for for today is literally just saying that's this thing that you essentially have to assume exists if you want to think about function about distributions over function values which are infinite dimensional so a Gaussian process is a probability distribution over function values of some function f which max from max from some input domain to the real line such that every finite restriction to function values F capital X which is sort of a vector of function values from x1 to xn is a Gaussian distribution with some parameters and what if these parameters need to come from somewhere so we need to give them name this thing here is called the mean function 6g just a function that you evaluate at lots of different locations X and this thing here is called a covariance function or the kernel and it has to be a function which returns positive definite matrices for choices like this and this is exactly a Mercer kernel so that's exactly the object that Berta talked about last Thursday right so there is this info dimensional object and it's called a Gaussian process and that's all they're going to mention in this direction the computations for a computational a lot of aspects you don't need to vary water at all we just need to just use the code for used on previous slides there's nothing infinite nothing sort of complicated an intractable it's the same 20 lines of MATLAB okay now yeah yes exactly so I could also just continue to do these computations like in this way right will become ever ever more complicated to do this as I increase the number of features and it I'm just lucky as it just happens but from our perspective I don't have to take you're not lucky there's a lot of structure behind it but right from our perspective of deriving this we're really just lucky that this particular sum becomes like that which turns into an infinite sum and integral happens to be something that I know how to evaluate I know how to solve this integral because I can just look it up on a book well this one I don't need to look up because it I've learned it in first year like calculus but maybe there are other integrals for other choices of features which you're just lucky to find out what the integral is in fact here are some examples so let's go back to some of the features we had before here's our step functions again and it's the type of that function that starts at zero and moves to plus one so it's the kind of function which has to be zero over here that's the only way of like the only rated the only value the function can have because there's no feature over here it starts at this point okay and this is what looks like for for features so it's just four steps right this is what it looks like for by the way I'm moving up here just changing a number from 4 to 20 so we have 20 different features you can see that sort of the distribution becomes a little bit more rugged but it also sort of approaches a limit which looks like this little trumpet on the side right here's this with a hundred features it's all my own already sound looks like a very nice and smooth function sort of here have you seen this kind of structure in a plot before in the past week maybe in the previous talk by a while Bobby Fischer Bobby Fischer had this kind of shape on his on his uncertainty about this fracture why because this is a model which assumes that at every time step any of these infinitely many time steps there is an ever so tiny random step up or down because there's a tiny function value of a step function being added and that step up or down is independent of all the previous steps so it's an independent change up or down as it happens this corresponds to this kernel which is the minimum kernel that per head Colonel quiz this is called the Vino process and it's the generative process behind the Kalman filter because as it happens function even given evaluations of previous function values this fact this particular kernel has the property that future function values are independent of previous function values given intermediate function values it's called a Markov process because if I fix this node then this function value in this function value become independent why are people so keen about this well there's a really cool thing about this and that is that the gram matrix the kernel gram matrix disk a big X big X because of this property has some very interesting algebraic properties which makes it which make it very easy to invert and one way of writing down this inversion the inference process is the Kalman filter so that's why people like to use it because it makes computations extremely easy it makes inference in this in this model linear in the number of points along this line rather than cubic which it would generally be if you have to invert a matrix okay this is what these wiener processes look like they start they have to start at zero somewhere often in plots you don't see that part because it's very far out on the left end of the of the x axis and the intermediate interpolations or the mean function in between which is this red line is always a linear function piecewise linear up to the evaluation points and then it sort of moves up and down and it's stationary in its function I shouldn't call the stationary the function values sort of the the extrapolation is a constant prediction and it always has this kind of trumpet-like square root but a functional behavior in the uncertainty and if you see it in a plot like involves before then it's because they're using exactly this model okay yet another feature set we had these other steps step functions step functions that start at plus one and move to minus one all right so by the way so I'm obviously I obviously didn't come up with all these features right so I'm mostly taking them from a wonderful book by Grace Baba and various other people a lot although the Vino process obviously this from is from Norbert Wiener himself and other people have come up with various other cool kernels right so um let's do these other step functions and now we have again four so or actually five I don't know - nobody BB drones are the fifth one is over here so there are five features now now we have 20 step functions which move from plus 1 to minus 1 you can still see them right you also get this kind of step up and down where you still kind of have the step step u-shape in the samples now we use 100 you can't actually see the steps anymore because the functions of the resolution of this plot is now below the resolution that you collect the number of features is higher than the resolution of the plot so you can't see the steps anymore in the in the plot and as I increase the number of features more and more this kind of irregularity in the samples becomes ever more drastic and eventually I have functions which are actually not differentiable anymore almost everywhere they're very very rough but they're always continuous because there's a continuous number of features sort of sort of the number of features contributing to a function value grows continuously ok this is what the prior looks like under this particular kernel it happens that you can also solve this integral over these pairs of feature functions this is what the kernel function looks like it's the absolute value feature kernel if you like and this is a famous kernel because the posterior mean are linear splines and you've maybe heard about the new splines in I don't know high school all right it's a very straight forward to straight the straightforward idea of just drawing straight lines from one data point to the next assuming you've observed without noise interesting V though if you look at the samples from this distribution they don't look anything like straight lines right so the the mean here is a straight line the samples aren't at all we need to think about this more tomorrow as something complicated going on here but this model is called linear splines something more so now I just wanted to speed up I said before we don't have to use piecewise constant functions we could use piecewise linear functions and we can increase the number of features as well you have more and more piecewise linear functions more and more piecewise linear functions eventually become very dense and the functions the samples here you can move this see a proof of move up and down quite a lot they actually become more than linear they become polynomials third order polynomials they're called cubic splines this is looks weird but actually it's just sort of the result of solving this integral again I've just had just compute the sort of infinite sum over infinitely many features which are all of this sort of V structure form and this inner product just involves some quadratic terms if you take the integral over them you get some cubic terms of these things are called cubic splines and if you know something about computer graphics then you know a lot about cubic splines okay and everything I'm so the cubic splines obviously are these are the set line in the middle of the posterior mean the samples around them aren't actually cubic splines there's something else okay at this point I'm going to speed up you can do this with various other kinds of features as well you're just a few examples conduce with polynomials but you have to restrict how you weight individual polynomials and you get another kind of posterior out another kind of now kind of pie or another kind of posterior why am you show it we're not showing you all these different plots because I'm trying to point out that there's more than just one kernel and I hear from for example from Peter and Francesco that in the afternoon sessions which you are going to very soon now people tend to use just one kernel because they've learned about this one kernel that they have to use all the time and there's an insane number of it actually an uncountable number of kernel functions out there which you can use to build more expressive models I'm hoping that you notice that all of these posterior distributions look different right if I change the kernel function I get a different prediction out and this is something you have to choose to get a good model it's not something you could just rely on on it working right you can also do this with Fourier functions get another distribution and you can hear now we come to the question from up there and the last last row we stopped listening and sort of looking at this computer that's unfortunate okay here's your chance for for your step functions which are just hats so this is the almost final one I'm going to show on every stop so you could think from the previous plots that I showed you how do I design features such that this model becomes very very flexible and can deal with all sorts of shapes changes in the function there's this annoying feature that we saw it's very hard to get this kind of bit right right what can I do to get a more expressive model but may maybe I just use these sort of hat like features right and then if I make if I use more and more and more of them and just make them ever ever ever smaller eventually every function value becomes independent of the previous one this makes sense right so we get restless lots of flexibility in the model it's great so then we can learn every function right because it's sort of everything becomes independent of each other and there's actually a kernel quotation mark that does that that's the white noise process it looks actually doesn't look like this because you can't see it it's just dust every single function value is independent of the previous one and people argue over whether that's actually your kernel or not it's very difficult to say what that actually is it's called the white noise process and it's more sort of a conceptual property that people tend to use to derive certain methods it's also the derivative of the Vino process for example this looks like this there's a problem if I want to learn a function from this kind of object and don't learn anything why because these observations I've just made here they're independent of all the other function values so the function value at a point that's arbitrarily close to the evaluation I've made is entirely independent of the function value of observed there so there's really and in this kind of framework there's no point in doing that right and it's a feature of these kind of models and it's sort of a feature this is an instance of this no free lunch problem that various other speakers have sort of mentioned that if you're trying to make no assumptions you end up not learning anything because you're too uncertain okay finally before we stop just for fun I tried this well after Bernhard mentioned that the greatest common divisor is also a kernel I thought let's try and see what this looks like this is what the prior looks like maybe I shouldn't have done this animation because it's sort of giving you a giving you a feel that there's some regularity in there it's actually a very very complicated model lots and lots of irregularity in here but I'm trying to point out that so these samples actually have a very complicated shape right so the mean here looks very boring under the prior and the prior uncertainty looks has some interesting shape I'll leave it to you to think about what that shape is at the samples have a very complicated irregular correlation with each other this is what happens if you observe just one function value at location five why does this happen exactly so 5 is a prime which means that its greatest common denominator with all the previous numbers is one it is the matter of definition you have to choose what the greatest common denominator between two primes is but it's one which actually means that it the evaluation we made here tells us something about the function value at every point so just a little bit because with all these points we have greatest common denominator one and then there are all the multiples of five for whom the greatest coming down at is five so they will learn a lot more about those functions right so now we're lot more certain in sort of at this multiple so that looks actually kind of interesting and I'll leave it to you to think of an interesting application for this kind of framework if you observe function values at various weird combinations of primes and positive it looks very complicated right because all these effects add up in very non-trivial ways but it's maybe that a reason I wanted to show this picture is to give you a feeling that there's a lot more to these kernels than just showing that there are kernels all right so in Bernard kernel crease you could get the feeling that oh if it's a kernel I find I'm sort of this is a current model I can use it for everything you also need to think about what this actually means in terms of assumptions you make about the function this is a very complicated kind of shake you get out various cases before I take your question let me just finish because my time is up so I started by saying that gaussians are important not because they are somehow fundamental nature but because they're fundamental to algebra they map linear algebra to inference performing probabilistic inference with gaussians corresponds to doing linear algebra because captions are closed under linear projections on a marginalization neither the sum rule sorry which is all the same thing there was a linear projection to linear restrictions on the conditioning and marginalization or under the sum rule the product rule so they allow us to use all the entire framework of probabilistic inference and stay within this family of coefficients they provide a linear algebra of inference because they map inference to linear algebra and we just notice and obviously 3/4 of the room already knew that I've just sort of pointing it out again that by choosing nonlinear maps nonlinear feature Maps we can map a regression problem into a nonlinear space and use Gaussian algebra to learn nonlinear functions and we'll talk about tomorrow we will talk more about how this connects to other kinds of ways of doing regression in fact it turns out the number of features can be infinite I'm not telling you anything you hear Bernardo we did this last week and what comes out is this kernel framework of regression or in the probablistic formulation this is called nonparametric Gaussian process regression or discussion process regression this was today afternoon and it was just sort of lots and lots of pretty pictures tomorrow I'll have to tease you a little bit more with maths and we need to think more about what these models will actually have as properties how do we design these models well how do we get them to work on general machine learning problems and also there's a theoretical question of how powerful they actually are what are what is the set of functions they could possibly look alright thank you okay let's take a couple of questions this and no because the function value is at every point on the excellent hair so it is the marginal distributions that every point in X has to be Gaussian and gaussians have full support there are various approximate tricks of doing this and I'm not going to be talking about them but people have tried it but it's always approximate probabilistic what testing ah well you have to compute the greatest common denominator to get to get an answer also not sure I don't think so so actually if someone comes up with a good use of this I'll be very interested I'm sure Pernod is understood as well but at the moment I'm not saying that isn't any I'm just it's probably it something complicated number theoretical there hmm there's a question in the observation room yes go ahead hello yeah thank you so basically they dig kernel function encoder you fire about the space of function that you are going to use for your infant right yeah and but essentially did the kernel encode the prior in a very complicated way and so what so it might be more difficult to to to to prior compared to the base and different problem in for example in the the finite dimensional space so what I would use the case that we do in practice in order to to see what prior is did very very good question it one we'll talk about that tomorrow so I'll have I have essentially almost all of tomorrow is about how to design these models well so I think I'm just going to move this question on for tomorrow but it's it's the obvious question that we need to address tomorrow yeah so Crick's question was like Sharon is asking me to repeat the question all right maybe for the recording I don't know the recording is how do you choose this kernel how do you choose the prior and actually I'm very that someone's asking this question because all of these plots were sort of meant to evoke that question I sort of obvious that there's all of these features to choose they all produce different answers which one should we choose
Info
Channel: Max Planck Institute for Intelligent Systems
Views: 17,019
Rating: undefined out of 5
Keywords: Machine Learning (Field Of Study), Gaussian Processes
Id: 50Vgw11qn0o
Channel Id: undefined
Length: 90min 31sec (5431 seconds)
Published: Fri Dec 20 2013
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.