Sparse Gaussian Process Approximations, Richard Turner

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
of themes through the literature that enabled you to deploy Gaussian processes and in particular in particular areas at larger larger scale than the vanilla method does so here's a couple of sort of motivating applications for this this is an example of audio modeling where this is an example from my group where we're trying to use Gaussian processes to model audio time series data so this might be regularly spaced data that you might pull out of a wav file it has a large number of data points so ten to the five to ten to the seven data points to your you're sampling this 16 kilohertz and you have several seconds or maybe some minutes of data so you quickly get up to tens of millions of data points here and we'd like to Train Gaussian processes on this data this is an example of some spoken sentences of speech this is the waveform we're going to use a Gaussian process to model that waveform learn the statistics of the utterances from a particular speaker and use it to impute missing regions so you may find it's hard to see but on this plot below is about 15 milliseconds of missing speech and this Gaussian process model which is fit to the whole speech waveform is able to impute a guess that's the the red line at what the missing speech looks like in that section of the ground truth is shown in black it's doing a pretty good job here's another section where we've missed 50 milliseconds out and it's able to sort of reconstruct that section and the third case below where it's perhaps doing less well so we'd like to treat and train these gaps and process models on large time series data sets here's another example on the same vein so this is an example of a Gaussian process model that's been trained on sounds recorded from a camping trip so this is the sound of synthetic fire this has been generated by a model trained on fire sounds and then you can hear my footsteps so it's been trained on my footsteps we're going past a stream then suddenly the wind gets up and you can hear it starts to rain it's about just not to rain but I go into my tent against the zip and then now you can hear a training so some people have come up with alternative explanations so what happens at the end there but I'll leave your your mind to figure that out for yourselves but the point is here we're taking a Gaussian process model we're training it on environmental sounds you might call them sort of textual environmental sounds but each one of those sounds has of order ten million data points and then were generating new synthetic sounds purely from the generative model from the learn Gaussian process model and they sound to you like real audio textual sounds that you you might encounter in the environment and you can use this to do smart things like removing noise from speech or removing wind noise from mobile phones for instance and things like that so that's a that's sort of a motivating example at the moment given the current tool set you have from the from the summer school so far you can't scale Gaussian processes two data sets that big so we're going to need to introduce the methods for doing it here's another set of motivating applications a completely different domain now we're looking at nonlinear regression so something which neural networks are really good at but Gaussian processes are also really good at that too and here I'm going to show a plot of performance of methods for lots of different data sets so down here is one of the standard data sets called Boston it has 500 data points which are 13 dimensional and over here is the data set which has half a million data points called year and 90 dimensional inputs ok we're going to try and do one dimensional regression in these sort of large data sets and I'm going to compare how Gaussian processes do to turn neural networks so this is the best-performing deterministic neural network that we could find on each one of these data sets and this is what's called the held out log likelihood score so some measure not just of how accurately you predict but also how well you get your error bars and here's the sort of the state-of-the-art non-deterministic neural network method so method which uses sampling or Monte Carlo in computing it's it's estimates they tend to do a little bit better than the deterministic ones but not in all cases and here's a vanilla Gaussian process so this is just a Gaussian process applied to these data sets so notice in data sets GPS can do better than your networks these are multi-layer neural networks these ones but a single G single GP can do better of course this involves scaling Gaussian processes to half a million data points the 90 dimensional input spaces say it's not trivial to do this and in fact the purpose of this work was then to compare this to deep Gaussian processes so this is a Gaussian process whose output then gets fed through another Gaussian process who then goes through another Gaussian process and so on and so forth so the the Gaussian process analog of a deep neural network and as you can see the deep GPS in many data sets do better so six or seven out of the the ten here but fundamentally one of the main challenges in getting all that to work is how how did you get these these gaseum process methods to scale such big data sets so that's going to be the the goal for the next hour in a bit to talk you through there are a whole slew of methods for doing this and I've picked out probably the two most influential ones in the literature which we'll go through in some detail and most of this is not my own work it's it's work that's been done by other people so there's my take my take on their work okay everyone happy so far motivation wise any questions at this point no do answer questions all the way through because it'll help me calibrate to what you do and don't know yet we have a very diverse audience II I'm sure you've been told this before but do not be afraid to put up your hand and ask a ask a question okay so we're gonna start with Gaussian process regression everything I say can be generalized to other settings we're gonna start as we always do with the absent processes by looking at regression and then at the end I'll talk about how that can be how the techniques can be generalized to scale other other problems like classification for example so here is an example Gaussian process regression problem I'm just going to get my notation sorted out because you probably had to see ten or five different people's sets of notations so this is the notation I'll be using Y is the output at any point X which in general is a vector is the the input of each one of the data points and in crosses here you see the the data plotted in this sort of one dimensional case and one of the fundamental things we want to do from Gaussian processes is predict at an unseen input location what the value of y is likely to be and of course you've been discussed at length how you can do that using a Gaussian process regression model you put a prior directly on the underlying function that from which you imagine this data were generated so that's the purple thing that's behind the the crosses here and then you imagine that the day data are generated by taking that function and sprinkling some noise onto it according to a likelihood function so that's P of yn given the underlying function f and the input location xn and possibly some hyper parameters which control things like the observation noise and the covariance function also depends on hyper parameters wealth so things like the length scale in a in a squared exponential kernel for instance now what we ideally want to do of course is inference and learning so these are the two the two fundamental operations that we want to do in pretty much any model we want to infer the underlying function given our observed data and we also want to compute the marginal likelihood for estimating hyper parameters right you're happy with this notice here I've separated prediction is it maybe slightly different from some of the ways some people present this I have separated you usually you see separate steps for a prediction and inference but here I'm going to talk about inferring the whole underlying function in one go and that automatically does inference at the data points and prediction for the function elsewhere okay it's like there may be a slight different from some other people and as we said this is intractable it has order n-cubed cost in general because of these matrix inversions or cholesky decomposition x' that you have to compute there might also be analytic and intractability stew we'll talk about how to handle them as we go through as well in particular this likelihood function might not be a nice gas in thing you might have heavy tails on your noise distribution maybe you have a student T likelihood function for instance so that's another source even in the Russian case of intractability okay so the methods I'm going to talk to you about today use a really simple idea but maybe was obvious when we looked at this data that we had here the the observation is that well these data are like over sampled to use a signal processing term they're really close together and it's kind of like a join the docs exercise to figure out where the underlying function is so maybe what we can do is we can summarize a big ginormous 10 to the 7 data points by a much smaller number of effective data points the green the green guys in here so we could take every tenth one or or just make sure they're spread throughout the space but sort of lump them down in order to to summarize the big data set and then we could fit a GP through the small data set and maybe if we sort of recalibrated the error bars maybe that wouldn't be such a big a bad approximation if we the mean would be roughly right right if I plotted a Gaussian process prediction through the green guys the the error bars are likely to be a bit too big because it thinks I've only seeing 10 data points when really there are 100 so maybe if we calibrated the error bars a bit down a bit smaller we'd have a great approximation for the for the underlying Gaussian process here now that's sort of the main idea behind the approximations that we're going to talk about today I'm gonna summarize these data sets by a small number M data points so Capital m throughout will be the number of the pseudo data points and and the methods that I'm going to show you come up with really smart ways of effectively choosing where these points should be in input space and where they should be in output space and smart ways to sort of calibrate the uncertainties so that you uncertainty you end up with is reasonably well match the uncertainty of the underlying GP there's going to be more principle than just randomly selecting some some subset of the points putting a GP through it and then squidging down the error bars it will do that automatically under the hood in there in a sort of a theoretically well well grounded way okay but that's sort of the motivation for what we're going to do great okay the literature here is a little bit hard to pass so there's been a huge amount of work because it is one of the central limitations of GPS that needs to be addressed for practical situations there's a huge amount of literature on so called sparse GP approximations and unsurprisingly over the last 15-20 years the perspective of the field on those approximations has changed quite radically and as with everything that's gaseum there are multiple different perspectives as well you know there are multiple different ways of thinking about Gaussian distributions so so reading and navigating literature can be quite challenging with each paper sort of capturing you know the spirit of the time and the Spirit has changed over time so I'm gonna try and walk you through in in sort of big picture terms what the trajectory of this part of the yasm processes literature is now one of the sort of first ways that people try to attack this scalability problem is a bit sort of counterintuitive here's what they did they said well we have a nice pretty Gaussian process model that we'd like to apply to our data but that's computationally burdensome so I've got a great idea let's forget about the model they actually believe in let's write down a simpler model which has some special structure in it ie sparsity which I can track tably fit to my data set that I care about but let's calibrate that approximate model so that it's sort of roughly similar to the actual model I care about okay so this is kind of a funny funny spirit I mean it often we do this sort of implicitly but here we're sort of making this an explicit way of going about things you've got the thing we care about the model we care about we're going to write down some approximation to that model which is calibrated and in particular we'll talk about how we can calibrate the approximate generative model Q to the true model we think really underlies the data by minimizing some distance measure between those two objects and then we'll carry out and do of exact Bayesian inference in the approximate generative model okay that's the first sort of one of the first veins rich people they approach this particular problem and there are many ways of doing that not all of them involve pseudo data this idea of using few a small number of data points to summarize the the big data set but there are there are approaches that we'll talk about which do both of these things they use pseudo data and an approximation to the true generative process in order to come up with scalable algorithms for deploying guessing processes so in particular I'm going to talk her through the first half of this talk about a famous approximation called the Fitzie approximation by ed Snelson and a bunch of other people that's sort of one half of sort of the main research in this this area okay and there's a really nice review paper which takes this perspective called the unifying review of sparse approximate Gaussian process regression by Joaquin and Carl Rasmussen that summarizes a huge number or a large number of models sort of in the middle here that take this this sort of attitude to scalability so I recommend reading the paper it's a really really good paper the other side of the coin is a slightly different approach which is probably more along the mainstream line of a standard approximate inference so here the philosophy is different the philosophy is we're not going to approximate the generative process we're just going to write down the assumptions which we think generated the data and then we're going to carry out all of the approximation at inference time ok so we're going to keep ad model that writes down the assumptions that we think generated the data sacrosanct we're not going to mess around with it and try and approximate it we're going to then try and form the posterior distribution the probability of the underlying function values given the data and we're going to approximate that object using divergence minimization so we in a client come up with some approximation to the posterior distribution rather than a proxy - the generative model okay and that's much more in the line of standard approximate proximate inference so you might have heard of things like Markov chain Monte Carlo or variational methods or ep etc etc they're all doing this sort of stuff they're making the approximations at inference time rather than sort of messing around with with the model so the most influential work in this area was done by Michaelis tickets in 2009 so this paper was about 2005 and in four years later or so Michaelis came up with his method that was done in close collaboration with Neil and he applied a method called variational for energy methods to to solve this or to come up with this method for scaling GPS and that's really one of the methods which is driving a lot of the advances in the field of Gaussian processes at the moment so that's going to be the second method that I talked about today okay and we'll compare and contrast it to to the fitzy method and great and I won't mention what these other ones are however I will sort of flag up most many of you coming tomorrow who put your hand up you're coming tomorrow okay get excellent so this talk is part of a double bill where today I'm going to tell you that this is rubbish and it's doing something stupid and then tomorrow I'll say ah actually there's a completely different way of viewing these approximations which is actually as preserving the exact generative model and carrying out approximate inference inference time and so it turns out that you can also derive these models from an approximate inference perspective so tomorrow I'll sort of talk about that alternative view that perhaps wasn't known until quite recently by most of the practitioners in the field a few people knew about it but not many people okay so questions at the moment if your questions about the philosophy now's a great time to ask questions because you know the philosophy here really is important and the distinction between the two so if there's distinctions not clear in your mind do you ask questions or if you disagree with what I'm saying that's also fine yep yeah yeah so in here there are definitely methods so I guess some of the stuff that seemed Osaka talked about yesterday you can view as approximating the generative process of the GP but it doesn't involve pseudo data so so there are various methods which approximate the true generative process but don't use pseudo data over here so you could you could there are certainly things which exist in this this place there are methods that live in here that don't use pseudo data but use approximate inference so I don't know how much you know the literature but if you take the original work that was done on gas in process classification the the goal of that work was to deal with the intractable likelihood function that classification brings to it and they used ep to approximate that likelihood function but they didn't use any sort of sparsity in order to scale it so there are a bunch of methods in in here as well but yeah we could populate them we could populate around yep like compress sensing and things like that there's there there are I don't know work about linking it to compress sensing other people can dive in here if you want there's definitely work link it to random matrix theory and like for expansions orthogonal random expansions which is kind of linked to to compress sensing as well actually and I guess that would actually fall in this part of the plot as well so there's there's like the Raheem Ian wrecked paper on using botanist theorem to approximates spectral spectrum of GPS for instance that that sort of falls in here as well yeah I've not seen it linked to GPS although someone can can link you can there's definitely Bayesian compressed sensing that's definitely completely valid interpretation but I don't know in GP land any other questions okay there are literally hundreds and possibly thousands of papers on this topic so it's hard to together completely complete view okay so I'm gonna tell you about the Fitzie approximation first so here's the flow just to recap we're going to talk about this one from this perspective the approximate generative modeling perspective I'm gonna talk about this one from the approximate inference perspective that's I go over the next hour and in order to talk about fit see it's really helpful that you know about a concept called factor graphs it's generally good that you know about factor graphs so I'm going to give you a two line introduction to factor graphs and then I'm gonna ask you just to go away and solve a quick problem based on factor graphs and gaussians so pay attention otherwise you're just gonna catch your neighbor when we when we try and do the problem so a factor graph is a way of viewing a function and it's we're viewing the structure that's in a function and factor graphs can be used because probability distributions are functions to view the structure in probability distributions okay so here's here's a simple probability distribution that's a function of variables X 1 X 2 and X 3 let's imagine for a minute that it just involves a single term in which all three of these variables interact with one another so we can just write it down as a single function of X 1 X 2 and X 3 that means the corresponding factor graph representation of this function is to draw the three variables with circles around them these are called variable nose so it is X 1 and X 2 and X 3 because they're the three variables that appear on the left and then we draw a single factor representing the functions that were present in in the probability density so here there was just one function G so we draw this factor in here and I could have put a little subscript G here to show you that it was representing the function G and I draw a line to each one of the variables which appeared inside that function so because all the variables were appearing together we have a we have all three being joined up okay does that make sense here's an alternative way so maybe someone says well actually I don't need to group them all together in terms of a single function maybe I can separate them like this maybe the probability of X 1 X 2 and X 3 can be written as the product of a function which just depends on X 1 and X 2 called G 1 and a function which depends on X 2 and X 3 called G 2 and the factor graph says oh I can I can show you that structure visually let's write down the three nodes as before X 1 X 2 and X 3 and let's draw a factor here which represents G 1 and a factor here which represents G 2 and put in the appropriate connections okay is this is this clear how to map from this decomposition to this thing yeah okay great now the nice thing about factor graphs is they tell you about the can you can automatically read off the conditional independencies of variables directly from the factor graph okay and the rule is as follows if you can pick a set of nodes so here let's pick X 2 which separate all the other nodes in the graph so that there's no way of following the edges to the other node without going through the selected nodes then you know that those two things are independent given the nodes you selected so here X 2 blocks all paths from x1 to x3 trivially because there's only one path and that tells you that x3 is independent of x1 given x2 okay so you can translate quite quickly from factor graphs and the factor graph representation to the conditional independencies which are present in your distribution on the left hand side here okay so that gives a nice way of reading off the conditional independencies great questions at this point not so far okay so there this is all a bit murky and mysterious why is it useful what I want you to do now is solve this problem so I want you to figure out what the minimal factor graph for a multivariate Gaussian is so minimal here means I could have I could have lumped these two functions together and called it G and written down this factor graph but it's not minimal because actually it decomposed does this product so this is the minimal factor graph for this for this example so I want you to write down when you just just chat your neighbors sketch on a pad you've all got great grass gas a Gaussian process some school paper that needs to be put to good use so here's a great use for it so we're gonna think about this Gaussian this is a four dimensional Gaussian X is four four dimensional with a mean and a covariance the mean can be arbitrary and the covariance takes this form and the precision matrix takes this form okay I'm gonna give you those two pieces of information that's enough to solve the problem so you should be able to write down a graph some factor graph which represents the structure of this okay I'll give you a couple of minutes any questions before we start now the I'll walk around the room you can grab me a few if you have questions which ya talk to your neighbors and you guys happy just the Emily happy everyone happy here you does it make sense the question okay so which starts yeah okay so think and think about the form of a Gaussian distribution yeah that that looks like e to the quadratic basically and the form of that quadratic is like what's my variable X transpose times Sigma to the minus 1 times X roughly speaking right so if we if we wrote out that quadratic form each term in the quadratic form depends on because it's a quadratic form depends on one X and other X and then the corresponding element of the precision matrix Sigma to the minus 1 right so we could write down as e to the power of X is x one of the Sigma 2 minus ones plus another pair of X's to the next signature minus 1 sorry so course and of course because e e to the sum of things is equal to the product of e to the terms in the sum that make sense and we have to write this out so that means you can write you can write the Gaussian as a product of terms which each involve a pair of variables so it's a bit like case to here that you never get all three of the X's sort of interacting together so so at most we're going to have factor nodes square square dark nodes which connect to just two of the variables yeah and now the task is to figure out which which pairs of variables are connected because not all of them connected and that's why I've given you Sigma 2 minus 1 and there are some there are some zeros and there so some of those terms go like actually make sense I inverse variance matrix will give neighbors that's right yes yeah that's right exactly yeah so so the precision tells you about the conditional independencies in your model you can read them off directly from the position matrix it's a good way of getting intuition so what Precision's mean do you need more time who's who's finished alright give you another minute who has questions who wants me to come and just talk to them yeah yeah that's a good that's a good line of thought yes there's a question here okay yeah yeah yeah there's more information than you need on the slide yes ok let's let's break there some people have got it some people on the right lines so who wants to volunteer volunteer and answer yep great our fantastic here's your pen wonderful exelon and mm-hmm so what and how have you where does that come from just a bit more detail it's correct to it do you mean correlation don't you mean this matrix here you're you've got your eye on this element haven't you so this is the precision this is the inverse of the covariance matrix yeah okay yeah so we're gonna could you say why we're interested in this this is unfair and putting you on the spot bit fair well why have you ignored this why is this the important object carry on you carry on it and I'll fill in that detail in a minute so great so we're going down this column or equivalently this road so we're saying there's an arrow between is that there's a factor node between one and two and then one and three yeah and then there's a zero here so we're not going to draw a line between X 1 and X 4 one empty and we've already taken care of this one because we've done all the others great yes so this is the correct vector graph thank you very much perfect let's just talk about a little bit about why this was the right approach so the approach the approach distills down to looking at the precision matrix here that this is the covariance matrix this is the precision matrix the inverse of the covariance matrix and the the rule that we've come up with here which is the correct rule is to look down each column say and if there is a nonzero entry between each pair of variables so this is the x1 column and this was the x2 row that says if there's a non zero in here to put an edge with the factor node in the middle of it between x1 and x2 and then put an edge between x1 and x3 with a faction out in the middle okay and then do that across rule variables why is that approach well let's write out what this Gaussian density is and write it as a product of factors okay so P of X here is equal or is proportional to e I'm just going to forget about the normalizing constant we're just interested in the dependence on X here looks like this thing right and now I can expand this out let me use index notation to expand that out that looks like e to the minus 1/2 X I minus mu I Sigma to the minus 1 IJ XJ minus mu J sum over I J yeah just rewriting this instead of using matrix algebra just using index notation everyone happier that and I can take this product out the side now out at the back end so this equals product with apologies to everyone over there it's probably struggling to see this e to the minus 1/2 X I minus mu I Sigma to the minus 1 IJ XJ minus mu J and that equals so each one of these things in here is one of our G's that appeared up here ok so I can write this as a product over IJ g IJ X I XJ you're unhappy or that so what have we learned here well for the Gaussian because it looked like e to the quadratic we can decompose it as a product of factors G which just involve a pair of variables so there were no no factors involving three variables here you can't get factors involving three variables in from a Gaussian okay make sense there are just pairs involved and the pairs involved depend on the value of Sigma to the minus 1 so if Sigma 2 minus 1 is 0 the factor just becomes equal to identity and it doesn't depend on the value of x I in XJ so anytime Sigma type 2 minus 1 takes the value 0 then that kills that factor in the factor graph and you don't have to have all of the the pairwise connections appearing here roughly makes sense yeah great so what the reasons for doing this are twofold firstly factor graphs make it easy to read off these conditional independencies which is just useful to know about because they let you do let you carry out computations more quickly it also gives us some interpretation of the precision matrix here you now know that if precision matrices have zeros in it that means there are conditional independencies in in your model i it's not a completely general description of the variables you are making assumptions about some variables being independent from from other variables and the third reason is what we're going to do in a minute is we're going to take covariance matrix matrices which don't have sparse structure zeroes in them and then we're going to come up with ways of approximating these using more precision matrices which have zeroes in them a sparse structure and that sparse structure is going to let you invert these matrices efficiently effectively because they've got lots of zeros everywhere that saves your on computation and we know we know how to construct these by building conditional independencies in okay so we're going to flip this on its head we're going to build conditional independencies into our approximate generative model that we'll leave to spar structure in a covariance inverse covariance matrices are Precision's and that will lead to efficient okay roughly make sense any questions let me shine great so I'm glad you understand that perfectly or so bewildered you can't be bothered to ask a question excellent there was the solution okay great okay so let's go back to let's do carry out this process now so we're going to take our original Gaussian process and here is the factor graph so remember the full general Gaussian process will have all all pairs of variables so all variable nodes will be connected to all others in a pairwise way okay that was what we that's what we figured out down here that's this result so here's the factor graph for a Gaussian process projected down to just one two three variables and I'm gonna call the number of variables T because I'm still thinking of this time series example to start with where we were trying to scale gas and press just a long time time series so that's why this is T usually called N and we're going to carry it we're gonna we're gonna carry out this specification process in two steps okay the first step is to augment the model with M pseudo data points okay these are these additional data points that we're going to use to summarize the the actual data and we're going to we're going to use fewer of them than the total amount of data we've got because we want to summarize the total date data we've got with a small number of pseudo points so M is smaller than T and we're going to write down the multivariate Gaussian over F and u under our Gaussian Gaussian process prior so and for simplicity I'm going to Majan that prior has zero mean most of the stuff you've looked at probably has zero mean yeah so we've got a multivariate Gaussian over the combined variables F anew we've got the covariance between the FS we've got the cross covariance between U and F and F anew and we've got the covariance over the years okay and this was all be completely correlated in a normal Gaussian process yeah this makes sense so this is exactly like in a normal gas in process regression context you might have T observations of the underlying function and you might want to make predictions at em other points and if you were carrying out that computation you'd want to write down the joint multivariate Gaussian over the T observed function values and the end points which you want to predict at this is nothing more than than that's exactly the same object hits that is the Gaussian prior over all of those points okay everyone happy so far great now what we're going to do is you want to introduce sparsity because we want to speed up the computation so what we're going to do is we're going to go in and delete some of these edges in here or in effect introduce conditional dependencies in conditional independencies into our model and in particular we're going to we're going to delete all of these direct connections between the EPS so all of these links that are in here we're just going to remove them and come up with something like that okay the reason why we're doing that is now the F's are all conditionally independent from one another given the whole set of views at the top if I take all of the users at the top that blocks any path between the F's obviously because the we've deleted the direct connection so they had to go through use and that means there's going to be sparse structure in our resulting precision matrix which will feel asila Tait fast computation okay and notice to this sort of accords with this idea that the in the variables you the pseudo sort of data points you are going to summarize our data because F's can't directly connect to one another they have to go through the use so the use are going to be used to sort of capture all of the interesting correlations and dependencies between the between the apps okay that's the way they're going to end up summarizing the data this is a very high-level in a bit wishy-washy but I'll show you the equations yeah in a minute okay so now we need to carry out this calibration step we've come up with a way of approximating or sort of a functional form that approximates our original gaussian process it's to take the original original fully connected use and then F switch just depend on the use and now we need to take that sort of family of distributions and make sure it's something similar to what we started with and one way of coming up with something similar to what you started with this too is the KL divergence okay does everyone know okay other bonuses put your hand up if you don't know what okay although variance is great okay somebody needs to remember the number eighteen okay great all right so this is this is this is gonna be very fast apologies this is a quick introduction to the KL divergence this is in the discrete setting so imagine you've got a a pair of distributions over a discrete value of a discrete variable Z so Zed can take one of K settings the KL divergence between distribution one and distribution two is the sum over the settings of the discrete random variable of the first distribution evaluated at that setting of the random variable times log of the ratio of the two distributions okay this is the definition of a KL divergence it has a number of important properties the first one is it is always greater than or equal to zero okay so this thing is always positive or it's always non-negative and equality occurs when p1 is equal to p2 so you can sort of see that's going to happen in here if I if I say if I let p2 equal to p1 then we'll just get a log of one appearing in here and log of one is zero and so we'll end up with the sum of zero multiplied by this thing which is just zero okay so it's trivial to see this as a a minimum value where p1 is equal to p2 and you can prove that it's bigger than zero everywhere else either by taking this first and then the second derivative I'm showing that it's got sort of positive curvature or you can use a thing called Jensen's inequality which I've referenced here I'll just take it as red it's it's nonzero everywhere else one other thing to notice is it's non symmetric so it's not true that the KL divergence between p1 and p2 is equal to the KL divergence between p2 and p1 you can't just flip the arguments so this is why although it this attractive property which makes you think it's something like a distance between two probability distributions because it's zero when the two distributions are equal and it's bit gets bigger sort of the more different they get it's always positive it's not symmetric and so it's not really like a distance function you can think of it like a distance function yep it does not obey the triangle inequality for reason for this reason yeah he's actually it's not a distance measure there are there are divergences which are either distance measures but this is not one of them in what sense this other people were able to come in here but but in this setting now I guess is the it depends exactly what you mean if you're unconstrained over your optimization of P 2 here and P 1 is fixed then there's a unique optimum and similarly for the other way around if you have constraints then it will not have a unique option yeah does anyone else want to clarify anything I said so far or add a comment yeah you mean if p2 takes a value 0 and P 1 doesn't take a value 0 yeah great question and so so the questioner is imagining that p2 might take a nonzero value in a region where p1 takes a 0 value or vice-versa and from that you can sort of see different properties of the KL divergence so p1 if p2 takes a value 0 or a really really small value then you're gonna get log of something really really small which gives you a sort of big negative number and that in the limit if p1 put some mess put some mess there where p2 doesn't have any mess will give us infinite KL cost ok so this this the distance between two densities if p1 puts mass where pitot 2 doesn't put mass that will be penalized infinitely it's not true the other way around so you will not get an infinite cost if p1 has zero density and p2 takes a really big value or nonzero value because zero hits this thing and kills it so that's sort of the source of this asymmetry which I plotted the bottom but I won't go through because that's putting a lot of time in outfit so so in a minute we're going to be talking about minimizing KL divergence ease in order to make one distribution similar to a target distribution and you will get different answers if you minimize the KL divergence one way round versus the other and one of the properties one of the important properties here is this sort of zero forcing or the zero forcing property that if p2 has zero probability mass somewhere and you optimize p1 then it will also have to put 0 probability mass in that space so it's the zero forcing property whereas if you do it the other way around you don't get that property a lot of stuff fast and furious ok the people who didn't know about kale divergences somewhat happy it's a yep yeah you replace this with an integral you just need to replace it with an integral yes I should also say this is this is also sort of the mainstay of information theory so if you code it to bution p2 using a distribution p1 this is the extra length that your message will be for coding that stream using the wrong density so it's the number of it's the communication cost the price you pay in the number of bits and your messages if you code according to the wrong density okay any other questions so it shows up a low the place it's really fundamental and really important great okay more questions nope cool okay what was the number brilliant you remember one thing from this talk it would be where we broke to go to the back of the slide deck great good ok so here's what we're doing now so we we've taken here's our true generative distribution this was just a Gaussian process prior over our observed function values F and inducing or some pseudo data points you and now we're gonna write down our approximate model our proximate model has this factor graph which I can express in terms of the conditional independencies remember we said all the user correlated together because they contain all of the pairwise connections so I can write that down as just Q of you and all the FS are conditionally independent given the use because of this property up here it's exactly the same as this property so hopefully all that work we did to start with should convince you that this is the correct way of writing down this joint distribution that we came up with here Q of U times the product of independent FTEs given that use okay and now we're just going to optimize this KL divergence in order to find out the best fitting family within this family which has this sparsity structure to our original model where do our min over the curve use and the Q of FTS given use you do the maths use calculus of variations to sort it out and hey presto you probably get the result you assumed you'd get in the first place which is that the optimal new prior on your use is the true prior and the ways way of figuring out what the FTS are given the use is just to use the their prior probability under the original model so it's just the prior probability under a Gaussian multivariate Gaussian distribution of the FTS given the use let me since the original conditionals would have in a model questions nope okay so we've introduced the sparsity structure we have removed by removing some of the dependencies and now we've calibrated things to come up with this new model let's just talk through this a little bit more and I haven't put the observations in now so let's put the observations into this new model as well so this was just the graph we had before at the top this has come up with a prior that replaces our Gaussian process prior with something more tractable and easy to handle and now we're going to hang our observations off the bottom here so this could be regression observations or it could be custom occasional sebaceans and because typically in these models in Gaussian versus regression the corresponding Y just attaches to the function value at that point where you've seen the input there's just a factor node between that Y and that F you know it's not connected to any of the other function values okay and similar for all the other observations great okay so there's an alternative way of thinking about this model that maybe makes it obvious why things have become simpler to do learning an inference and that is all of these F's are just Gaussian variables and we can just marginalize them all out analytically because everything everything's Galston in the gaussian boasters regression case so we can we can integrate out these intermediate things and we get to a graph which looks like this and we can just push these up here and end up with a thing like this and that says we can now think of our model as our as being observations y1 2 y 3 generated iid auto-generated conditionally independently from these use and so rather than up here originally in our original model we have to deal with T function values T Gaussian distributed function values here now we get to get a deal with just about just their use which there are M of them and so all the covariance matrices involve will just involve M by M covariance matrices rather than T by T okay let's go through that in a little bit more more detail let's sort of go through this this process a little bit so we said before that Q of U is just equal to the prior which is equal to a Gaussian over the pseudo observations with zero mean and the prior covariance k u u that's just just like it was to start with and the conditional so let's think about the form of the conditional so what do people think the expression for these things looks like what is the probability of FT given you any ideas this is this is an object or a similar to an object which you've met time and time again in the previous lectures anyone see how to write down that probability what's what kind of distribution is it what family does it belong to while guessing gaussian brilliant gaussian is in the in the gaussian prices summer school that's the only answer if someone now see what distribution it means I've I've I've set the bar incredibly low we're asking that question okay so it's a Gaussian distribution so this this to remind you is is just I've got it up here it's just the conditional distribution of ft given you under the Gaussian process prior okay so this this is imagine you're trying to make prediction from a Gaussian process you say I'm interested in the value ft and I'm going to give you the value use what's the form for that predictive distribution of an F T given the use you remember remember what the mean looks like roughly no maybe I'll just reveal it to you so this should be very familiar from Gaussian process prediction so here's a helpful thing how do you make predictions well the probability of y 1 given Y 2 is given by a Gaussian distribution in Y 1 where the mean has this form remember and the covariance has this form yeah so we can just plug that into our expression above so Sigma 2 2 here remember is the covariance of Y 2 the covariance of Y 2 here well the use are playing the roles of Y 2 right so in here I have K u U for this one and then this was the the cross covariance between 1 & 2 that's between y 1 and y 2 so up here it's the cross covariance between F T and U so I just plug in F T and u in here yeah and Y 2 as you say is playing the role of U so we have you here and then similarly for these ones to get the same thing so Sigma 1 1 that was the prior covariance of Y 1 that becomes the prior covariance of the F's so we just substitute that in there that's how uncertain we would have been about the F if we'd seen no data and then of course we get a little bit more certain because we see the use so this minus some stuff over here is sort of the bit we've learned about the use the bit that has reduced our uncertainty and that has this form it's sigma12 which we already talked about I'm signatu to some ice 1 times this thing ok so it's just like reading off the gas in process prediction prediction formula ok yep that's right exact so you as multi-dimensional as M dimensional ft is 1 dimensional so you're right that is a row vector yeah yeah exactly and this is a column vector on the other side when I flip the indices exactly yep yep this is all univariate because each one of the F's was condition independent so we just need to deal with one dimensional things here yeah so the so that's important because notice that the only thing here or the hardest thing to compute is these inverses involving k u u right so we're just inverting matrices which are now M by M naught T by T there's no there's no inversion of a T by T matrix so far we haven't finished yet and it won't be I'm gonna I'm gonna call this D TT for reasons that were coming parents just gonna do some bookkeeping and I call this here the uncertainty in F when we see the use that as a scalar dt t to do some bookkeeping great ok so this is just finishing the specification of the model this is just get if you haven't figured that out really this is the prior on the use that we've written so far we've now written the probability of the F's given the use that was this section and now we just have to write down the probability of the Y's given the use where we just take our normal in this case Gaussian noise assumption that the data are generated according to the underlying function plus some Gaussian noise and the variance that noise I called Sigma Y squared okay that specifies a new generative model these are our new assumptions that we think the data regenerate generated according to and now we're going to do exactly an inference in this approximate model ok that's that's this view of doing approximation of the generative generative model so it turns out the cost of computing the likelihood is now going to be order te times M squared because we haven't had to invert a T by T matrix at any point but we do have to sort of carry out this computation in here so the overall cost is TM squared for those of you who knows about factor analysis anyone know about factor analysis one person so I won't bother mentioning that there's a connection to factor analysis here I can talk to you about it later if we write out the lightly hood it has this form we've got time to do this let's just quickly go through this this will give you a bit more of a flavor of what's going on under the hood here so this this is if you wanted to do hyper parameter optimization this is the object which you'd now are optimize for your hyper parameters to do learning okay and I'm going to talk you through where this form comes from because it will it'll just show up and up your Gaussian maths a bit which is a useful exercise to go through so this distribution has to be Gaussian everything up here was Gaussian Gaussian Gaussian Gaussian and the only dependencies between these different levels of the generative model were linear right the F's linearly depend on the use and the Y's linearly depend on the apps so there's nothing non Gaussian in here as you as you would expect I think okay so everything is just linear transformation of Gaussian variables yep question that's right yeah well you can always integrate our F of the generative model it gets a bit trickier yeah maybe I should defer that discussion offline but everything I'm said here can be used to scale the Gaussian process classification models as well yeah okay so everything everything here is Gaussian and linear so the distribution over the Y's will also be a Gaussian the mean of the use is zero notice yeah and the F here just linearly depends on the use so what's the mean of the X zero because the linear transformation of a zero mean variable is still a zero mean variable and then the Y's are equal to the X plus some Gaussian noise again a linear transformation so the mean of the Y's is also zero in here okay so that's why this thing is zero everyone happy with that don't we to write that up in more detail let's do the hard one let's try and figure out what this covariance is if you can do this sort of maths then it makes everything easy for you and it's one of the sort of pieces of mathematics you need to sort of get comfortable ways to do GPS so let's work out what the expected value let's let's go the whole way of the Y's I didn't want to do wise government let's do the wise so we know the mean we know the mean of the wisest is zero so the covariance of the wise is equal to the expected value of y y transpose and we're completing the expected value here over Q of U and then our Q of the F's given you okay that's the that's the calculation we're about to do happy at this point and that so that's the cheat that's the shortcut for computing this you don't want to go through and write down integrals involving gaussians and try and solve those integrals and complete the square don't do that that's going to take you three weeks to do that calculation the quickest way to do it to say this is Gaussian let's just compute the first moment we just did it the first moment was zero now we're gonna compute the second moment and then we know that's the solution to this this thing rather than doing integrals over densities which take ages and completing the square and so nightmare this is much easier okay just compute the expected value of Y times y transpose okay so let's let's try and write that down so that's the expected value let's just let's just handle this bit first the first layer so the what Y is equal to F at each of the cross wounding FS plus Sigma Y times some noise all right you have here that an equivalent way of writing this distribution here is to say Y is equal to the value of F each of those input locations plus my ID Gaussian noise so the epsilon here is drawn from norm zero one yeah happy going between those two representations she you blank places questions no questions almost believe you alright it's getting getting hungry okay so I can I can substitute this in for Y here and write this is f plus Sigma Y epsilon F plus Sigma Y epsilon transpose like this I'm just substituting this expression in here and now I have to also average over this thing so I'm going to average over you F and it's silent and this thing is equal to the expected value over F F transpose that's just getting those bits plus Sigma Y squared I why did the cross terms go why was there not some expected value of f times epsilon transpose here yeah because we're independent yet so islands are white noise so I can just get rid of that say happy that everyone great okay so now we're going to focus there up our efforts on this object and let's just write that up down here so notice we've we found this bit of this long expression now this bit has popped out the observation noise and now the claim is that this bit will give us this junk in here okay so expected value of F F transpose that's equal to well we can use the same trick again remember that the F and this the so these are the tricks you need to get good at this F is given by this thing plus some noise times this thing so I can write up I can write it up F T is equal to 2 and K F T you get this as a subscript kayuu to the minus 1 times u Plus this is vector plus this thing okay I can call this D right d TT this is this notation that this thing is called d TT times epsilon where this X like this is a different assignment for that epsilon prime okay so f is just the linear transformation of the use plus this diagonally using this makes it so much easier again using this form of it rather than using the density just use it as a linear transformation of Gaussian variables and now I can plug this into here so when I do I can I can plug the F into here and F transpose and maybe you'll let me skip a a little bit of a step here and let me write this out as k fu k u U to the minus 1 u u transpose k UF sorry K u u2 minus 1 k UF + D okay so so this is formed by this time to the transpose the cross term goes because you and epsilon are independent from one another and we get this coming out happy laborious but hopefully giving you a feel for the mathematics and now we can take this expectation in here and do the expected value of uu transpose what's the expected value of uu transpose louder sorry con here k uu great so I can just replace this with k uu and hey presto we've got here great yeah so that the reason why goes to is laborious but there are really long ways of doing these calculations they get you stuck and there are shortcuts like this which get you there much quicker and hopefully this is a helpful thing to do Neil that's that's probably true [Music] I only had been in that first lecture yes yes yep yes yep yes so I think I think Neil's saying I could have just combined all these linear transformations together and done it all in one go yeah that's what I'm using yep yep yep yeah I don't think I'm saying any different from that by the way but yes yeah I don't know I waste all this time so this was obvious to everyone great I'm so happy that's fantastic yeah yeah I totally agree I totally agree yeah all right everyone happy now so it's crystal clear obviously good and then at the end I have just canceled this thing with this one and we have we have this okay so that's the form as the the log-likelihood notice again so the whole laborious big-picture reason for doing this was this thing just involves K you you inverse it does not involve anything T by T matrix that you have to invert any any point and so the cost is order T M squared okay and sort of practically you can use the matrix inversion lemma here to handle the inverses computationally quickly and that's where this this comes from but this is all due to the sparse structure that we've been we've been talking about okay great okay so any questions on that I'm gonna try and move on quickly to talk about the second approximation left hood I'm gonna give you a demo here this is from EDS Nelson who was the guy that came up and gave came up with this he was the second person that came up with this maybe but wrote a lot about it thesis so here's an example the purple dots are our observations here inputs along the bottom observe values along the side here we've initialized the Gaussian process with some hyper parameters he set the vertical length scale to be really really big the length scale to be a bit too long and the noise to be a bit small that's why these error bars look a bit small at the moment and it looks like it's too stiff and too the length scale looks too long and he's he's had to put something I've swept under the rug so far is that you have to choose where to put your inducing point locations where these you use live in function space we have to choose some X's associated with them but the tactic is going to be to splodge them down randomly to start with and then optimize this quantity here the probability of the data given the hypothesis and the locations the input locations of the use to make the data as probable as possible there's highly problems we're going to optimize this with respect to our hyper parameters and with respect to these inducing with point locations which is called x-bar and to begin where they're all bunched up and what you're going to see is as we optimize them they magically spread out and cover the space and the amplitude comes down the length scale gets a bit smaller and we get a nice GP fit through there through the underlying function here it looks kind of sane and sensible but we haven't had to do anything which was cubic in in the number of data points we just had a relatively small number of these new variables that we got to play with all right questions yep yeah it's a good question is a very high dimensional optimization problem because you've got in general you'll have to optimize each one of these and each one of those might have multiple inputs if you have a multi dimensional input problem so you're you're talking about a big optimization you've got relatively small numbers of hyper parameters but you have lots of inducing point parameters to to optimize which might get you worried so there are optimization problems potentially associated with this you can always initialize quite well by just taking a subset of your data as as your initial starting point but there are a local optima and it's definitely not convex and it can be in certain circumstances subsets of data works fine and actually optimizing them can make things worse but I think empirically in most settings I think it's fair to say optimizing the inputs makes things better if you look at the predictions the predictions get better but there are problems that will come to associate us with this yeah again it's like a dark art so you look at your computer but yet you figure out how long you're prepared to wait between fits and you go from that yeah I might have to go over at all where it's you or not no that's fine if it's yep yeah that's you if they're gonna kill me I can well I can very quickly answer that question with a with a plot I don't want to go through this too too much but here's here's your question how do we select the number of pseudo data okay so here's a here's an example with a very very long time series and we're gonna try and pick the number of pseudo data and here's what happens if we if we choose for pseudo data points at these input locations and we optimize them looks dreadful right and here's what happens is and at the bottom I'm going to plot the error versus the compute time okay and the blue thing is the actual GP that we've managed to fit so we've gone past 10 or coming up to I don't know how many reducing points now we're getting loads of inducing points so did you do what do you see what's happening so to start with we're getting very long length scales with too few inducing points we're getting the trend right but we're not getting the Wiggles and eventually as we as we go past like hundreds of inducing points we start to get the fine grain structure and then around about here we start to get enough inducing points to really trace out the function so if you're just interested in accuracy then roughly speaking what you need is the same number of inducing points as you've got typical Wiggles in your function so for every typical wiggle here every wiggle we need a new juicing point or maybe two so in long time series if you if your characteristics length scale is towel over your Wiggles you'll need big T over tau and using points and in this setting that means M has to scale with big T so you're sort of implicitly back to order T cubed here it could be sort of how do you how do you recover the original GP maybe you could use some of their math here I don't know yeah okay that makes sense to people all right let's do you variational methods then okay at breakneck speed oh we got we got a decent man time okay so fully independent training conditional before we go to the variational method it is now parametric we are now effectively fitting a parametric model that was our P of Y this thing down here is now a parametric model involving all of their input locations like we talked about so we sort of lost the GP here somehow we're fitting a parametric model to the data it's a clever parametric model because it's a bit like the GP but it is a parametric model if I see more data am I allowed to add new pseudo data well from the generative modeling perspective is a bit weird because you're supposed to specify what you think the generative process is for the data and commit to it and don't change it and then do inference whereas here which we sort of want to change if we see more data probably need more pseudo points because there's been more structure revealed in in by the sort of the observational process so this is sort of a bit bit weird we've lost this separation of the model and what we believe in and inference and making approximations at inference time there are some extensions that I won't talk about so the next method I'm going to talk about sort of fixes those problems the first the first two bullets up here in a really elegant way this is what Michaelis did and Neil has contributed to massively and it relies on a thing called the variational free energy method okay and this is a general method I think this was originally developed by Fineman and colleagues and then has been brought over to be sort of very influential in generally in machine learning and I think the most successful application of variational free energy methods is exactly this this is the thing which works best of all variational for energy methods I think though Dustin might might disagree with that for GPS it works spectacularly well in general I think okay so the idea is as follows let's take our log probability of our data given our parameters and we're going to come up with a lower bound of this quantity so this is the log marginal likelihood I'm gonna write that in red because it's intractable because we've got too many days points and we want to avoid dealing with that and the the way of computing it is to take the log of P of Y given theta and that's equal to the integral DFP of Y and F given theta okay this is how you compute the marginal likelihood you integrate out the latent function values and through everything here that I'm about to write remember what we said to start with the curly F is the entire process this is the entire collection of random variables all possible input points so it's actually a function okay so I'm gonna really I should use measure theory and stuff like that to write this but I'm not going to do that you can just think of this as an integral over sort of account of the infinite number of variables in a way that makes sense so it's the all possible function values that we're integrating out here okay so I can multiply and divide this by an arbitrary distribution Q that's non-negative okay I guess nonzero in this case as well so I'm multiplying and dividing by the same function Q and you can use a thing called the instance inequality the cur using the convexity of the log to take this Q outside and replace log of an average by the average of a look okay so it turns out that this will give you this thing is a lower bound on this thing and I'll actually prove that to you in just a second using the properties of the KL okay and so if you haven't seen this before take this as read and I'll prove it's the case in the next line that this thing is indeed a lower bound on this thing okay so schematically the way you can think about this if we make a plot as a function of theta of the true marginal likelihood here L of theta the log marginal likelihood and this quantity F which is called the free-energy it looks like this at any point in theta space for fixed Q it will be a lower bound on the true log marginal likelihood this kind of useful property so let's just see that explicitly like I promised by rewriting this in a slightly different way so notice the form of this it looks like the log of the Joint Distribution here over Y and F divided by Q and I can I can take that Joint Distribution and I can rewrite it using the product rule as P of Y given theta times P of F given Y and theta this is nothing more than then you know the product rule P of a comma B is P of a P of B given a yeah that's just rewriting it that way the reason why it looks kind of different in this context is hey that's the true posterior in there and this is the marginal likelihood the thing we started with and this marginal likelihood in here doesn't depend on F at all that's the whole point in it didn't depend on F so I can pull this outside because when I do this integral I'm gonna get rid of that so let me just let me see what I write and then I see if I can explain it yeah okay let's let me just prove that line to you so we've got integral D curly F Q early F log P F given Y and theta [Music] / qf and I'm going to use the additive property of the log log of a product is the sum of the T log of the terms okay so just I've just taken this bit out this bit does not depend on F so this integral here will be one in front of this so I can just remove this bit yeah and then everyone will remember the slide we had on the KL divergence then I'll slide on that this thing is minus KL divergence between Q of F and P of F given Y and theta just in here so and the KL divergence was always positive that's what we talked about at length right so this proves that the f is a lower bound on log P of Y given theta because we're subtracting off it this KL and that means it will bring it down below that line and yeah there we go that sort of proves proven the quantity you want again I'm being a bit naughty here this is a KL divergence now between two stochastic processes between Q of F and the true posterior so again to do it rigorously you need to invoke more serious mathematics and uncapable of it and it all goes through okay so right we're getting somewhere now almost done so what we're going to do is we're gonna now so if I've just written down something very general okay in fact the optimal solution to this problem if we now optimized f of theta with respect to Q that would give us back the true posterior because we know that the KL divergence is minimized when Q is equal to P of F given Y so so far we done nothing useful we're now going to be able to do something useful by introducing some special structure into Q of F that will give us sparsity through the idea of pseudo points and that will lead us to efficient computation but good ways of summarizing big datasets okay this is what Michaelis did with Neil I think so let's look at Q of F let's go into our big uncountably infinite set of function values on the Gaussian process and split it into a finite dimensional subset you which are going to be pseudo data then they're going to summarize our real data eventually and all of the other function values so we could just partitioned the function into two into two just disjoint sets they use and the function elsewhere and now we can use the product rule to write down the Q distribution like this and now we're going to make our assumption so so far we've not made any assumptions what we're going to assume is the relationship between the function values not at the pseudo data given the pseudo data are going to just sit at their prior this is a bit like we did in fitzy where each one of the F's was conditionally independent given use here we're going to assume that all of the function values not at the use are Li given by their Gaussian process prior and then we'll let q of you be free the distribution over the inducing the variables you be free and will optimize q of you okay that's going to be our in the free parameter to optimize them because this is just a gap an M dimensional variable we just have to invert m dimensional matrices to calculate this notice this is not exact so some people like this have you really made any approximation here this looks this looks like it should be exact the optimal setting for Q of F of U given u if we carried out the optimization would be to say it equaled P of F not you given the data and you and in particular what we've done is we've stripped off the dependents on the data here we said in the approximate Q distribution this thing can not directly depend on the data the data has to enter through Q of U so Q of you is going to summarize the data points gonna come up with a sort of approximate posterior distribution over the use and then we just use the prior to predict out to other parts on the function so the user acting as a bottleneck for soaking out all the stuff that the data tells us about the function and that's sort of how they're going to end up summarizing summarizing the data set because they're forced to by this approximation okay this is pretty fast now but we're a little bit running out of time so let's just have one more look at this a couple of different pictures now so here's here's one way in your head of thinking about what's going on we've got this free energy f of theta which is the log marginal likelihood minus the KL divergence between our approximation to the posterior over the process and the true posterior over the process f the true posterior is on the right here shown in red maybe this is what the true posterior distribution looks like on the left is our approximation where we take Q of U and then predict out of it using the conditional approximations when we optimize f as we are going to do in a minute for the Q of U and for the inducing point locations what's that's going to do is it's effectively going to change this process over here so it looks as closest to one on the right as possible okay it's going to be trying to make this approximate process given the constraints that we have up here be as close to this true posterior in a in this chaos sense great and one way you can intuitively think about this is in terms of these points you is we're going to be wiggling around effectively the input locations of the pseudo data points in such a way that when we fit a Gaussian process to them sort of again in inverted commas it looks as close as possible to the thing on the right so going back to this motivation we had to start with that we want to summarize the posterior with a small number of data points here we sort of cut we've codified what summarized means in terms of minimizing the KL divergence between the true posterior and this new one obtained from just this small number of pseudo points great ok notice here also that the inducing points so the input locations of the pseudo data actually call them our variational parameters they are things which just show up in Q of F they're not anywhere else they're not in our approximate generative model the generative mall is the true generative model that contains all of the F values but these inducing point locations so the locations of the use in here are just a parameter which affects Q of F in particular it affects this conditional distribution in the prior and so optimizing these things can never do anything terrible because it's just going to try and make Q of F more like the true posterior in the original fit C case that wasn't true because we wrote down an approximate generative model and it could have it could have just tried to over fit the data by moving the input locations around whereas here we're just doing we're just trying the goal of the inducing point locations is to approximate the posterior process so it can is protected against overfitting in a sense by this variational KL divergence okay any questions at this point I've got one more slide left facing me nope okay this is just the slide we had before and I just want to show you that the free energy is now tractable so one thing I haven't shown you just to keep the context in their mind is I haven't shown you that this particular choice of Q distribution leads to a nice tractable F that we can optimize the hell out hell out of using f88 favorite gradient based optimization algorithm so I'm very quickly going to do that before we break for lunch so all that all we're going to do here is we're going to substitute this form into this expression here okay so we're going to take this and plug it in for the Q of F here and for the Q of F here and this is where the magic happens so here on the right we get integral D F Q of F log this is the probability of the model and notice what I've done here for the probability of the model in here I've written it as the probability of the use times the probability of the F not use given the use this is just splitting apart the Gaussian process prior like we did in infancy actually and then we multiply it by the probability of Y given F and theta that's our likelihood terms and then we substitute in for the Q down below and hey presto so a magic happens we've got this prior here canceling with this prior term in the in the top so the prior term in the Q this thing which we enforced into the Q distribution cancels with the prior term in the model and that's the infinite bit so it's great for the infinite bits gone away this is the bit which involves countably uncountably infinite variables those two things go and we end up with something beautiful and tractable you can work this through and you get this form now with that cancellation you get something which looks like f is the average of the like log likelihood over your Q of F minus the KL divergence between Q of U and P of U and this is just the KL divergence between multivariate gaussians which is nice simple and this thing if you're doing regression the log probability of the data given the S is just a quadratic form so the average of a Crytek quadratic form under the Gaussian is something which is analytic and and easy to do so you can go on further and optimize Q of you analytically it has this form it curiously relates to some of the approximations we've seen earlier and relates to the to the fit C model without the diagonal contribution but I won't go much into this because essentially you now know enough to go and read all the literature about variational free-energy methods in in in GP land so summary of the direction for energy method we now have much better guarantees for this method it can never take the pseudo inputs to silly places because their variational e protected we're not approximating the generative model we're approximating doing all of our proximation zat inference time or approximating the posterior and so we've now separated modeling assumptions which are sacrosanct from approximation methods which are all done at inference time which is fantastic and there's this useful rule of thumb which we'll return to later which is that if you want estimates better estimates of the mean function you should use the variational free energy method for reasons that I'll come back to you tomorrow fitzy tends to give better error bars tends to be more care more about getting the error bars correct and I already answered this question okay a few minutes later foot almost brought you in on time enjoy a lunch I guess or if you have any questions which are probably your question doubt any questions brilliant enjoy your lunch have a good time [Applause]
Info
Channel: Michael Smith
Views: 6,116
Rating: undefined out of 5
Keywords:
Id: sQmsQq_Jfi8
Channel Id: undefined
Length: 90min 22sec (5422 seconds)
Published: Mon Sep 18 2017
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.