Deep Gaussian processes

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
news uh neil lawrence who will be speaking to us today about deep gps so um i'm really pleased to be able to introduce him because he's someone who doesn't really need an introduction um no really uh i sort of knew him originally from when i started at sheffield where i was leading the machine learning group uh which he led for many years um and that's where he started the gathering process summer school that's still going strong all these years later um and then after that he moved to amazon for a few years before returning to academia and um taking up the deep mind chair at the university of cambridge so um he's i'm really glad that he's still like coming back to the gathering process summer school every year to make sure we're doing it right um so uh thank you very much um so this is all about his work on deep gassing processes um cool thank you very much cool neil thanks mike um so i should add i'm still a visiting professor in sheffield as well right yes of course yeah and it's uh nice to see i'm sad that you can't all be there this year but actually you couldn't have all been there anyway because there's like 155 of you and we normally only have capacity yeah um so uh yeah as uh thanks to will he um puts up a link to where you can find notes for this talk so uh let me just do some screen sharing um if i can share the right place so um hopefully you see that um and let's just see what will is yeah that's that yep so um if you also want to find will's just provided the link and the link you're going to is this when you click on this it'll take ages to load because there's so much html in there but the other thing you can see is this is loading is there's a link at the top which is to a google colab notebook i've spent a bit of time this morning trying to make sure that works so um if you want to follow through those very same notes are also available in a collab notebook which will also take time to load um and you can follow through what i'm saying and rerun some of the examples on that notebook it will load eventually it's just very big great so i'm going to kill that one actually because it's just wasting processing right so you actually heard a little bit about deep gaussian processes already um from uh carl hendrick i believe yesterday that's right yeah carl uh yeah exactly it's composite gpc and uh yeah so composite gaussian processes yeah composite stochastic processes and i think you also heard um something about the sparse approximations um to gaussian processes for making them efficient so we're going to do a little bit of review of those but i check their slides and so what i'm going to be saying is it's the same stuff but in a different way which i think is one of the things we always try and do on the gpss so deep gaussian processes well oh the notes just to say that you'll be using gpi in the examples plus um a piece of software that andreas damianu wrote when he was developing deep gaussian process as part of his uh phd called pi dpgp but it's all there for install via git um hopefully it should all run on collab okay so um one of the things i like to say is the universe isn't as gaussian as we might like it to be but it once was as far as i think standard laws of physics understand uh fully gaussian in fact i think the standard model of physics um suggests it was fully gaussian and they keep looking for non-gaussianity in the early universe to try and find new physics but at the moment my understanding is they haven't found it so this is a an image of the cosmic microwave background which is uh the moment i think when the universe stops being a plasma and condenses and so light is free to travel and we can see the echo of that sort of due to rapid inflation uh some of that light is still reaching us um and this is an image of what that light looks like across the sky and when you study this signal you find that at that moment at the moment of the plasma when it condensed the universe was a was a gaussian process and the parameters of the covariance function were things like how much matter there was in the universe i guess how much antimatter i suppose i'm making this up a bit because i'm not a physicist um how much dark matter and so on and so forth and that kind of dictates the universe we see today but of course it is non-gaussian today and the reason it's on gaussian is at that moment um non-linearities entered the physics in terms of i guess asymmetries and all sorts of effects um uh that meant that things sort of condensed in a non-gaussian way so the universe we observed today say our earth is actually just a non-linear function of some latent gaussian process back in the history so the idea in a deep gaussian process is well if this function is itself a stochastic process so it could be a gaussian process then this would be a function which is a gaussian process that is a function of a gaussian process so the input to the world we see today is given by a function whose input is itself a stochastic process so this means this is a stochastic process in the end but if this f is also a stochastic process we get this idea of composition of stochastic processes um which is the key idea behind a deep gaussian process and indeed we could say this function is itself a deep gaussian process so perhaps it's a little bit too simple to assume that all of physics is well represented by one function that is itself a gaussian process but it might be conceivable that we could get more complex functions by composing those gaussian processes together and that might represent physics so so the the argument what i'm basically saying is yes sure the universe is no longer a gaussian process although it once was you try and find a period when it was a deep neural network it never was um it once was a gaussian process as far as we can tell and today you could argue that what we see is um a gaussian process it isn't of course but that's how we're going to approximate it now you've heard a little bit about already how we do approximately one of the challenges in gaussian processes is the computational uh demands of the models in their sort of standard form and you've heard a bit about what sometimes called uh sparse gaussian processes but the a better term is low rank gaussian process because they're not sparse in in the strict sense they're actually low rank approximations to the covariance function and this always reminds me of this um this cartoon that kai i or kumuran uh put together adapted so this was at the time this came out i was we were working at amazon uh myself and several people on gaussian process and this reminded me of sort of amazon meetings and what companies were doing so people asking for new machine learning projects in the age of hype um and this kind of feels a little bit like where we are in terms of suggesting these new methods so i feel like this person on the end oh let's do sparse gaussian processes and uh yeah they're not apparently the fashionable things you feel like you're being turfed out of the window when you mention such uncool things in companies who have read wired magazine articles about hey how ai is changing the world but actually um there's an enormous number of relations between a sparse or low-rank gaussian process and single-layer neural networks as we'll elucidate a little bit so you've seen this already we've got this challenge of complexity being n cubed and storage being n squared um and then we get the low rank gaussian process we get to a point where we can have complexity order nm squared and storage order nm and of course um you may also i think have heard about the sort of stochastic variational inference for gaussian processes um which do slightly better you get each iteration is sort of order m cubed and then it's just becoming like stochastic variational inference so what that's showing is actually we can now run these gaussian process models at the same type of scale in theory as you can neural networks there are some factors because of course their algorithms are very simple our algorithm's a bit more complex so um this is the scaling of those algorithms it's not you know there's pre factors that mean we're still a bit slower and we're not doing billions of data or the equivalent of gpt three where the gp does not stand for gaussian process um so the approach that we do through this is uh something that um michaelis titsias worked out when he was in manchester with with the group which at the time also included maurizio and we call it variational compression and you'll have seen a bit about it and the use of inducing inputs um which can be viewed as sort of pseudo data so i'm not going to spend too much time introducing that although my slides aren't advancing come on um because you've seen that uh by jen wendy who was actually just referenced on that previous slide but the um i don't know why my slides are moving slowly so maybe just the sharing okay so um in your earlier introductions on gaussian processes one of the things they may have shown you and i'm assuming they've shown you in the top of this slide is that you can go how you go from parametric models to non-parametric and what do i mean by this so one of the beautiful things about and i think one reason i'm so focused on gaussian processes because to me this field's so important is the non-parametric nature of the model so if we're in a world where uh we're absorbing new data over time and we only have a fixed number of parameters in our model then what eventually happens is in the standard bayesian approach we use bayes rule so here our parameters are doubly if you're a statistician assume this is a beta um our parameters of w and our data is some labels and some input if we're doing some regression and over time we build up this um posterior distribution over our parameters that we obtain from our training data now if we've got a fixed training set then that's great because we know how much data is here and we spend a bit of time designing our model around how much data we have but in the real world you could keep on absorbing more information and what tends to happen with these posterior distributions if you have a fixed number of parameters here they they collapse in certainty so as as you absorb more data if your likelihoods a lot concave then the entropy of the posterior distribution reduces every time you observe a data point which means it stands to approach something like a delta function um now when you're then making predictions on your data this is the thing that on the on the left that i always like to um highlight that this object here is the thing we always care about because this object is the probability of a test label given a test input and all the training data we've observed and that's the fundamental sort of probability distribution of all predictive work if we want a predictive probability we're always trying to construct this joint distribution now people talk a lot about priors and blah blah blah and likelihoods and apologies for this is a spare bracket ignore that bracket um but really what's going on in bays to my mind is just a way of constructing this distribution so you have a likelihood of a data point given your parameters and given your input data and that's the sort of classic likelihood and then you have a posterior from your training data which is including all the information about your training data in this parameter vector w now the way you make this prediction is you choose to decompose your probability distribution in this way we talk a lot about priors and likelihoods whatever but this is explicitly what you're saying you believe that this is true and what you are explicitly saying is that there is a parameter vector w that is fixed length that is always sufficient to take all the information from your training data to make a prediction about your test data okay so i pause there to let that settle in what we're saying is that we believe that there is a parameter vector w that is a fixed length where that length is sufficient to take all the information about the training data and pass it to the test data now if we're working in limited circumstances that that's perhaps a viable idea it's feasible but it just makes no sense in general that you could at the design time of your model decide what that parameter vector length is going to be and as you observe more and more data that's constantly going to be sufficient to be the only thing you need that fixed length vector the posterior over it to represent everything you know about the universe now what's going on in deep learning is of course the length of that vector is really really big and and that's why they're you're able to use data sets with many billions of values and so they're of course they're going in the same direction of what non-parametric says that effectively there is no such vector that we can represent this model in that there is no way of decomposing this joint distribution with a fixed vector in this manner yeah is the bottleneck also applicable in the case of non-independent training data like when the likelihood does not factor over the data points yeah absolutely that just means if it's if that's a great question will if the likelihood doesn't factor we're just looking at one data point here if the likelihood doesn't factor over data points that just means that when you complete this posterior you're dealing with non-factorized likelihoods okay it's still you still have this problem this decomposition problem because and of course you're jointly going to predict you're going to have to well if the likelihood doesn't factor our eyes over data points then if i've got some of those training data in here then i'm going to have to include those training data on this side a few a few of those training data if it doesn't factorize okay let's step back from that statement if it doesn't factorize at all then it's like something like a gaussian process and then this doesn't necessarily apply right because you can no longer write this down but i thought what you meant initially but as i'm thinking about is something like what if i've got some sort of likelihood where like a ranking likelihood so a certain typical likelihood that doesn't factorize is um if there's some relationship between uh different outputs so that um i might observe so for example i might observe two things that are effectively about the same thing so i might decide um that i've got uh an observation of a person's height that's very noisy and then i might have another observation that is whether or not someone has classified that person as tall so i perhaps have the classification yes that i think they're tall and i might have a sense of what tall means tall is defined as they're over six foot let's say or whatever one meter 80 something um but i may also have a noisy observation with the gaussian variance of what their actual height is now those two like those two observations are no longer um uh independent um so yes in the extreme case of what you're saying we basically get non-parametrics will but um yeah if i in the other case where it's just like oh there's some relationship between these variables this is still true that was that clear yeah there's a there's a follow-up just uh so the the ask yourself i think aren't my questions i'm i'm curating from the audience so yeah i want to take too much credit but um the uh so in the setting we're assuming that there's a single true w which i don't know yeah and other ways then to assume that there are kind of an ensemble of true w's lovely lovely question yeah yeah it's actually so implicit in bayesian inference people even forget it so when bayesian inference you're sort of putting a prior over the parameters that is basically saying look there's a single true w and i only put that priority once right that only appears once so if i think about what my likelihood typically looks like i've got a p of w and then a product over the likelihoods which is n equals one to n um now that means that what i do is at the start of the whole game if i'm thinking of the sampling game i sample one parameter vector and then i use that vector to sample data points independently right we actually you're entitled to sort of have several parameter vectors right and then your parameter starts looking like a latent variable so so we did to work with magnus rattray on models that look like this in genomics where where the idea was that you sample a different weight vector for every linear regression in effect so every time you want a linear regression you sample a different weight vector so but now it starts looking like a latent variable model not a sort of standard bayesian model so every w you sample a new w for every single data observation but it's being sampled from a probability distribution that is not a simple zero mean variance spherical variance um yeah thanks great um but yeah so i'm assuming in a paradigm that this is one sample of w right there's yeah there's something interesting about latent variable models about whether you consider them non-parametric or not they sign it kind of interestingly non-parametric in a in a sort of in a way that's interesting but perhaps a bit too simple for our needs um so but once you go non-parametric we get this sort of result that i think you've already seen already the and this is what's beautiful um the that you get these covariance functions right if you if you assume the gaussian relation so if we assume second order only relations between the observed data points so if we're looking at this system when we say okay when i build my joint model over the data i'm going to assume that jointly gaussian then we can go non-parametric and within that framework you can also express parametric models they end up looking like this this degenerate covariance function which you'll have seen that if you take a parametric model and express it as a gaussian process you get a coherence function like that and that's why you get these advantages in terms of computation so the most well known is the linear kernel obviously um now um in non-parametric prediction so i'm moving from dot y to f i hope that's okay but i see y as is a corrupted version of f so f is this i call it the fundamental variable why y is my observed variable so so you know for example in the height example f is the actual height of the individual and the two different observations were one is the observed height with some corruption noise and the other is a thresholding of the height as to whether they're tall or not so f to b is somehow a fundamental variable but they could be corrupted in our observation [Music] and uh what we do in order to make these predictions about new points of the unfundamental variable it involves combining the training data with the covariance function and potentially a mean function as as you'll already have seen could you just clarify what was degenerate uh degenerate is the covariance function if um if you have a covariance function that is not full rank so it's not rank n um at all times that's known as the degenerate covariance function so if you have a parametric model where you've got um uh k parameters in w then your rank of the covariance matrix will be at most k and uh but you could observe more data so that's a degenerate covariance function it won't it will be full rank as long as n is less than k but as soon as n goes over k and it's that degeneracy that is showing that is the way this problem this bottleneck problem manifests um great so uh where were we um so parametric is the special case where the conditional prediction of f star given f can be summarized in a fixed number of parameters so parametric is the special case it's a massive assumption non-parametric is the general case but the nice thing about that is the complexity of the parametric model remains fixed regardless of the training data set right so you have to remember when people are um saying oh look the galaxy processes all these competition problems yes yes it does because it's not making this horrific assumption it's not making a horrific assumption about your data and if you don't make that horrific assumption you don't get to do this sort of thing like oh well i can represent the whole world in a fixed number of parameters anymore um and what we're going to do what i view inducing point methods is doing is is reinserting that that assumption but not as an assumption but as an approximation right which i think is a better thing and that's why i love those methods because you're no longer saying this is an assumption you're saying i'm going to approximate what's actually going on with a fixed vector and that turns out to be the inducing vector so um just to sort of just to highlight that so what goes on when we introduce inducing variables is we actually augment the variable space so i'm not going to introduce this notation for gaussian processes because i'm assuming you've seen it so what i do is is whereas before i had the fundamental variables f um i like to think of the the u as standing for these are the variables that you are going to store about the situation so f is the fundamentals the things you care about that's going on in the world you is what you are going to store about the world okay so you're going to introduce these variables here and the nice and you decide how big u is and we tend to denote that with uh m now this means we have a joint density and new joint density we also have this situation where we don't observe the fundamental variables directly we observe some corrupted form um and that is means that what we actually observe the joint distribution looks like this where we're now integrating not over w integrating over the fundamental variables where this is a massive big covariance function over f and just think we'll one of the questions have mentioned that will highlighted the non-independent case it's covered by this as long as we don't as long as we have uh fixed clusters of dependencies it's just simpler to look at the independent case but what we're saying now does apply even if it's not fully independent if there's some dependencies so in the auxiliary variable approach we can introduce this new set of variables u i can write the joint distribution of this in any way i've chosen to decompose it in this way but what's kind of fun about decomposing in this way i can decompose it in this way so i've explicitly represented you by introducing it and now i'm integrating it out and integrating it out leaves me with this sort of interesting object which is the probability of why the actual observations given what you are choosing to store what you will keep in your head no longer given the fundamental variables okay and then what you have is a prior over what you are choosing to store now why this is interesting is because this itself looks incredibly similar to your classic parametric model so your classic parametric model would say well actually y is going to be some think of this again as beta if you're a statistician some beta transpose phi plus some corruption so we're assuming this is uh just going to be gaussian corruption in this case but of course in the general case it could be non-gaussian but it would give us headaches when we're doing our linear algebra so the joint distribution of y is now the integral of the probability of y given w and the prior distribution over w if we're doing this in a bayesian way as we described before sorry what are the parameters that we're talking about here mainly are they like features from the data no the parameters are um they're going to be the weights this is a sort of generalized linear model set up so these are basis functions and the parameters are the weighting of the different basis functions so these could basis functions could be quadratics so phi it could be a set of bases which is 1 x x squared x cubed and um then the weights would be the sort of polynomial coefficients yeah um but in general it could be any set of basis functions okay so very often we would choose a gaussian prior p of and then we've got this likelihood is actually being expressed by this generalized linear model here this would be the classical statistical setup for a generalized linear model but not even generalized just a linear model and then as we said before we've got the bottleneck but but look how similar that is to um to this so here the price over you what you choosing to store and here the price over w which is also what you are choosing to store because it's the parameters you set but importantly u is variables and w is not variables it's a parameter okay so um that's actually what the next slide says u is not a parameter so this looks a really cool way because now we seem to be able to compress things in you part of course there's no free lunch you don't get a win here there's no way you don't suddenly get oh so we solved the problem by just uh doing a bit of product rule in in maths no no we didn't no because what actually happens is p of y given you is itself intractable that's intractable now and that's intractable because p of y given u actually is involving all these f's that we've integrated out and integrating out those f's there's n of them is going to give us the same complexity it's giving us the same complexity as if we just tried to make the prediction directly so although this object is a very cool and interesting looking object you haven't won anything here because you've got a load of work to create this object so the whole idea to me in inducing variables is that you actually approximate that object so what you do is you build a variational bound on p of y given you and you've seen these variational lower bounds i've seen carl hendrick's slides did a nice job on doing jensen's inequality on them and what you do is in effect you say okay well i'm gonna instead of computing this thing exactly which is the integral over f which is is tractable but it's gonna it's gonna give you all those um complexities you know then you instead you choose to variationally approximate it and that's the core idea um behind uh uh richarliso's paper that um he wrote when the minister the bulk of the group was in manchester including carl henrich and maurizio and and others um so that's kind of what happens i'm not going to go through the details of that because i want to get onto deep gps but this trick is at the heart of deep gps so this trick is what enables the approximations to work for dpgps downstream i mean it's a sort of beautiful trick that enables a thousand things and the reason it enables those things is because it's starting to get your um what was a uh intractable non-parametric model um to be a parametric approximation something that looks like a parametric model it's not a parametric model it's a non-parametric model but it's something that has the form of a parametric model that is an approximation to the non-parametric neil could you just what's the benefit of you being a variable then rather than a parameter okay so the most beautiful benefit of that is that i can change the size of that vector at runtime and what i mean by runtime is i mean once i've decided what my model is um there's nothing to stop me changing the dimensionality of you while i'm in in the middle of absorbing data so i could i can have my model i can absorb a thousand data points from the world and be making predictions with a by setting you to a hundred and then i might think oh crap you is looking a bit small it's not big enough for me to represent what's going on in the world anymore let me let me expand you to 200. yeah and and i carry on and because because this is an approximation to the non-parametric truth where truth isn't real truth it's a modeling truth i can get away with that it's still a lower bound it's just a tighter lower bound by increasing you i can't reduce the quality of the bound i can only increase the quality of the bound so i can make decisions about how much compute and how much i want to store about the data i'm seeing live while the data is coming in now that seems less i mean it's not apparent why that's useful if the whole world is about training i'm going to train i finish training i never look at my model again i then use the model in practice but that's not how the world is i mean the world is a dynamic moving place where you're absorbing new information the whole time so that's massively important and i think no one well very few people seem to get that point they just look at the mouse and go but i think that's because they don't think about what it would mean to be running a model in practice like what you would have to do in a parametric model is sort of add in new parameters and you have to think of some heuristic way of dealing with those new parameters i'll tell you there's a model that sort of does that that's called um it's called the rvm star by joaquin quinoa candela and carl rasmussen it sort of introduces a new parameter every time there's an observation and it does that in an effort to try and get good error bars for a fixed basis function model but it leads to a lot of headaches so this is a cleaner way of doing that okay so i'm not going to say too much more about that because what i wanted to do there was just highlight that these models are given at giving us um an ability to uh what i actually had a nice thing here oh they're like pseudodata these use new variables you can do all sorts of imaginative things by them and this is a totally unexplored area sorry everything's going very slow i'm not actually clicking it's just moving of its own accord [Music] what i wanted to say there was that in particular when you've got this representation you can all of a sudden apply this beautiful idea from matt hoffman david blair and others called stochastic variational inference so that's the sort of gaussian processes for big data paper that um we wrote with james hensman and nikola fousey in sheffield and that's a massive thing like i mean everything kind of is going to stem from that and it is stemming from from that notion that you can combine this this inducing variable well the inducing variable idea isn't originally due to mccarless but he noticed that there's a variational interpretation but this variational interpretation of the inducing variables this nice idea of stochastic variational influence from from the sort of team at columbia and then you get stochastic variation instruments for gaussian processes and all of a sudden you can do things at enormous scale and then like there's a host of papers saying oh look we could do this and we could do that but but those are the key ingredients so there's a principle ideas principal component ideas behind what's going on now so and that's sort of super important and it was this year sort of 2013 when when james was doing that in the lab was also the year when we were working on gaussian processes so everything sort of comes together so there's some examples about how these inducing variables work um in the notebook that actually comes i think from an original set of examples by james that i've recreated and you can play with those um to try and get a sense of what inducing variables are doing since then um tang buoy and rich taylor and others have written a beautiful paper that i recommend everyone reads that sort of also looks at these frameworks for pseudo-point approximations or inducing variable approximations and um shows that there's a sort of the relationship between other approximations these fits the approximations and explains what that relationship is and we're not going to talk about that any further although one could do and rich has in the past you can find videos of him um at the gpss talking about that and we're going in this direction andreas demiano who is phd student in sheffield with me who worked on one of the ideas that stems out of all this mass which is deep gaussian processes so we're at half an hour or a little bit over let me just pause there and check again if there's questions before we move on to the deep gaussian processes um and i may even stop sharing for a moment so you can see um yeah so there's one i mean there's one question on dbg because i'm going to say that for later when you've actually introduced them yeah um and then the other question is is could you use this kind of approach for modeling time dependent distributions such as diffusion in physics so modelling what dependent distribution dependent distribution oh yes right yeah no that's really oh wow wow it's hot today with the questions yeah um time you can use them but they're kind of limited for that and actually rich turner's worked on this quite a bit why are they limited because um we don't so these time independent things aren't really low rank there's a number of ways okay so i get made fun of for saying this but i'll say this and again i mean most things that a lot of people are doing can be viewed as a gaussian process right and but they are different ways of making the gaussian process efficient and things like carbon filters are just gaussian processes but the reason they're efficient is because you're exploiting the sparse structure in the inverse covariance of the gaussian process okay the standard parametric generalized linear models are also gaussian processes if the likelihood is gaussian but the reason they're efficient is because the covariance is naturally low rank because it's degenerate yeah so you can try and use these inducing variable methods for time but the interesting structure in time is actually the sparse inverse so the inverse covariance function the precision matrix is sparse because it's representing conditional independencies and these diffusion processes often had sort of markov properties in them so that they have limited dependence over time and that's the property everyone's exploiting when they fit diffusion processes now there's really great work by arno sullen and simosaka in how you convert gaussian process to those structures and fit them extremely efficiently the challenge is making these two approaches compatible so so this is great because we can do time efficiently we can do spatial processes efficiently but what you really care about and if we could do my goodness if we could do large scale efficient approximations to temporal spatial gaussian processes i mean what do you think is going on with covid right at this moment that would be a latent function for estimating r if you could do that efficiently you could get really great estimates of covidon on a local local region but people don't they keep it separable do they do the time separately from the spatial and the reason they do that is because there are nice approximate methods for time and there are nice approximate methods for spatial and they're not necessarily compatible other questions or should we push on uh i i mean unless anybody urgently has a question oh uh do i get the same result if i start out with m inducing points and then add one as opposed to starting out with m plus one well if i start out with n and add one so so n is the data set size and i know so sorry starting out with m m yeah and then adding one as opposed to starting out with m plus one oh wow that's a good question it's it's going to depend how you introduce the inducing variables because now you've got an optimization problem for these inducing variables so in theory you can get the same result but in practice you're probably going to have a bunch of heuristics for how you choose to introduce and optimize the new inducing variable where you may not want to optimize all the previous inducing variables yeah so the math says yeah well it's the same you've got the same representative power as long as the nature of the inducing variable is the same but in practice the reason you're probably doing this is because you're going to be having a bunch of heuristics you're trying to um you're going to have a bunch of heuristics that you're trying to use to make your whole system run efficiently and probably those heuristics mean that you don't find the absolute best solution for the inducing variables you just find a better solution than you would have done um and i think the one thing that that depresses me a bit is we're not at the stage with gaussian processes where it's easy for people to try out their heuristic ideas for how to introduce new inducing variables and that's one thing that's limiting our ability to deploy these things at scale and and i'm okay maybe some of that's going on when people are using for gaussian processes for machine learning type tasks like i'm trying to use a gaussian process to recognize imagenet or something but actually i think the more interesting domain is when you're trying to use a gaussian process for these more traditional tasks like i'm trying to estimate the amount of covert across the uk or i'm trying to estimate the amount of malaria in uganda or something like that which which those models are far more widespread than people give credit for um just because we don't talk about them in europe's i guess okay um um i better move on just because uh otherwise i'm not going to get through it it'll get all good yeah as i say at the end maybe a few people were asking for different references so at the end um just direct to references all right let's carry on um great let me share screen again okay so this this this thing you know when i joined machine learning community back in 1996 i used to see talks about gaussian processes and they would blow my mind because i was at aston where chris williams is who's being cited in this um paper by david mackay and this is a quote from dave mckay gave in europe's tutorial in 1997 on gaussian processes that i went to and i didn't understand gaussian processes but they blew my mind i mean everything they were capable of doing at the time we were i was working on bayesian neural networks or trying to or starting to um and that the capability of these models to do the things we were desperate to do with these neural networks that i was working it just to be honest it totally blew my mind so um david gave this talk which kind of i knew about these models and i was already interested but kind of blew my mind and then at the end he said this thing well have we thrown out the baby with the bath water and what he sort of says is gaussian processes uh so neural networks are supposed to be intelligent models and discovered features and patterns in data bear in mind this is 1997 right so you can see how little is new in terms of what people are saying about neural networks because people have been saying that since 1987 as he is saying here gaussian processes are in contrast simple smoothing devices how can gps possibly replace neural networks what's going on and i think the answer and this is what david's hinting at is that you know they're not doing everything that say a layered neural network does and we'll see some of that coming up in particular this point about them being smoothing processes is something that is very obvious and there's a limitation of the gaussian process formalism and one of the things we're going to see when we introduce deep gaussian processes is that and one of the things i really like about them because it shows well something really different is going on is that they allow non-parametric discontinuity now why do i say non-parametric discontinuity well because i can easily introduce discontinuity in my gaussian process that's no problem i can just say i can introduce a a single basis function the mean function which is a step i can put a gaussian prior over the location of that mean function i can i can say at some point there's a threshold across which the processes become independent that's fine but every time i do that i have to introduce a parameter to represent that discontinuity pausing again to let that sink in so what i have to do if i'm going to do that is i have to say there's going to be a discontinuity at time capital or time tau right and i have to then introduce i end up with a tau in my covariance function and i maximize the likelihood with respect to tau it's great but it means a priori i have to know how many times they're going to be in the system so i'm back in this world of a fixed parameter vector how many discontinuities there are going to be in the system so if we take something like the financial markets that's like saying before i model the financial markets from now until the end of time i'm going to know how many crashes there are in the financial markets because i have to have a parameter for every crash doesn't make sense the non-parametric processes that allow for jumps is a set of processes called sort of markov jump processes i mean well ones that are tractable that we can use and if you combine markov jump processes with gaussian process if you add them together you get something called a levy process that allows for both so why aren't we leaving that using levy processes well levy processes apart from the specific case where they are gaussian processes or markov jump processes they are intractable you can't write well you can't write down the likelihood in closed form you can only represent them in terms of their characteristic function so deep gaussian processes are an attempt to get some of those characteristics again they're not tractable we can't write down their likelihood but it turns out we can approximate their likelihood we can lower down their likelihood and that's a major motivation for stochastic process composition so so it's coming really from um david europe's tutorial 1997. you know the guy blew my mind and then said oh and there's a limitation okay so deep gaussian processes try and address that limitation so let's look at deep neural networks which is the sort of thing and this is just a fully connected deep neural network and i've put the inputs at the bottom which is um which upsets yoshua benzio um first time i did it i didn't know that that was against standard but once an upset joshua i like to tease joshua so i do it all the time so here's the inputs to the neural network at the bottom so this is often some image or something like this in a fully connected model that would be a load of convolutional layers if it was an image but then they're just going through um uh well let's get the maths up what you can see is what's going on is that what's going on at h is you have a set of activation functions so you have x at the bottom and you multiply x by a matrix w so if we're looking h1 the sort of lowest row on the screens um let me just show you there we've got h is being is given by a set of activation functions so i'm thinking of them as basis functions but these would be values rectified linear units in neural network parlance but they're just simple basis functions that we can build gaussian processes out of as well and the input is x and that's being multiplied by a matrix w to give us the entry to the activation function and the output of these activation functions h is a non-linearity around those inputs and then what we do is we feed that output into a set of inputs to the next layer so instead of the next layer having taking input as x it takes an input as h1 and we just do this one row after another until we get an output and i'm doing a regression model so my output is just a linear weighted sum of um of these basis functions so the standard model i was showing you before would just be this layer here plus this layer plus a set of inputs right so these top three things that's that's the model i was talking about before okay so this is what's going on mathematically and the reason that the number of parameters goes up so quickly in a neural network is because each of these matrices w is um the size of the hidden layer so they they use very large so if i have a thousand hidden nodes in this layer and a thousand hidden nodes in this layer this matrix here w is a thousand by thousand so it's a million and they do use you know dimensionality of a thousand in these places so they start getting enormous number of parameters very quickly and they kind of want that because they've got the same you know i don't think they view it this way but from a modeling perspective they want that because they've got the same challenge we have with the non-parametrics and they're kind of solving that they're solving that in a very interesting way they're saying we're just going to have an enormous number of parameters way more parameters than we need and then this whole problem i talked about before effectively doesn't come up but that's odd because you need some sort of regularization and and there's all sorts of really really interesting work around how these models are regularizing and how they're fitting and this kind of this concept known as double descent and if you're interested i urge you to look at it but actually one of the things it turns out that these models do is when they are fitting well to and generalizing well that you turns out that these these matrices w tend to be low rank right so it turns out that the effective number of parameters the model uses when it has a good fit is not the full rank of the matrix w and so what does that mean that means that in effect you can parametrize the solution with the single value decomposition where you're setting lambda to be much less than k1 and k2 where those are the size of the hidden layers so you can sort of set um you can oh i mean you can that's the single value decomposition but you can conceptually just combine the square root of lambda either side and just write it as a u times a v and the reason i like to write it like that is because you can see how the matrix is being generated so i've colored these matrices these are actual numbers but i've colored them to give you an idea of it's just a low rank approximation to w which is not a symmetric matrix so you need two matrices the type of low rank approximations that come up in sparse gps um are using kuf the covariance between u and f to generate this thing and they generate symmetric matrices but here here we don't but of course one of the funny things about neural networks is um and this was true in the 1980s and early 90s and it's true again now is the way that people ignore the fact that it's linear algebra underlying it um and then they talk about bottleneck layers so one of the things they'll say in neural network parlance is oh let me introduce a bottleneck layer and a bottleneck layer is uh the same as one of these layers but the activation function is linear so it's like i'm going to have a linear neural network but but that is effectively that turns out to be just this so it's a sort of low rank constraint so when you're introducing bottleneck layers you're placing a constraint on your deep neural network that is basically you've got a low rank matrix here and by the way i also learned this trick in david mckay's group meetings back in 1998 in cambridge in the cavendish lab so we can actually write the mass for what we're doing now in this way so if we've substituted u v in for w we can rewrite the mass and introduce this new variable z so z is now v transpose h h is um u transpose z and so the sort of v is appearing in there if we substitute this term into here and then we recover what we have before we've got the low rank approximation in now mathematically that can be written in this form so um by the way this formulation i i talked about in 2013-14 and i'm staggered by how slow people have been to dawn on them that this how close this makes deep neural networks and deep gaussian process together and more recently a lot of papers have i think twigged on the same thing and started talking about the relations using these formalisms i guess we made the mistake of never writing a paper that talked about this but it seemed so obvious to me that it didn't seem worth writing a paper about there you go stupidly um so the network can be written mathematically in this form now um and this is interesting because if i look at any of these particular parts here this z being multiplied by v um on h and then going into here and coming out to a set of variable z this little triplet here is a generalized linear model on its own of the classic type that's used in statistics and if i take this could be next slide i suspect so i can actually i can view this as a cascade of neural networks where i have this sort of odd starting point where i take my inputs and i multiply them by a matrix um but you know that matrix could be identity if you don't like that and then i start going through a series of neural networks and then at the end i take the output and i multiply it because i want just a uni-dimensional output by a vector but i can combine w4 and v3 if you don't like that so what you're seeing is that a deep neural network if i introduce these bottlenecks is just a cascade of single-layer neural networks pausing there and we already knew back in the day and it's in makai's tutorial and everything else that gaussian processes have an interpretation as neural networks where i take the number of basis functions in the hidden layer of the neural network or some neurons as they would say to infinity which is why i was so into them i was like why am i messing around with these neural networks when i can have an infinite width neural network because at the time in the sort of 1990s we were mainly working with single layer and i could just go for the gaussian process directly and get all these sort of advantages of non-parametric nature so the deep gaussian process is just the same idea so instead of that we'll just replace each neural network with a gaussian process and therefore well i don't want to go through those deep learning slides because i don't probably have got time you referenced it what whose tutorial was it sorry david mackays so it's it's in my citations at the end of the slide um so you can either get the slides online where you'll see these uh citations are at the end of those notes um in fact let's just um let's just review that because those things will have come up now so uh if you clicked on the collab you should be getting with a bit of ugliness that i haven't sorted out because these slides are auto generated and this is like for just downloading the all the bits of software you need to run the examples you'll see me saying in text with citations to the papers pretty much what i'm saying here now for some reason i don't know why but for some reason if i use um plan dockers i'm using some of the citations the year doesn't come out but hopefully i haven't double checked this hopefully at the bottom we should find okay so for some reason colab likes to hide cells so you just have to click on them i don't know why it's chosen to hide like so this is where we just got to right so this is this is what you're about to see coming up although i'm going to skip this section um hopefully at the bottom of these slides it's a long way down you should find references oh hidden yeah yeah see all that okay other questions at this point before i skip the next section um yeah there's a question so in practice uh in cases where data dimension is high is it good to do uh like having a deep kernel rather than a plain gp yeah it can be it's a major motivation actually the bottleneck neural network yeah great great question yeah whoever asked that 12 points um yeah because one of the motivations for why you might want to be doing this is is kind of i mean the cursive dimensionality is a croc it never exists right never exists in real data i mean it exists mathematically obviously but the simplifying assumptions you make like the all your features are going to be independent and all your data points are going to be independent if that's true we call that noise yeah pause on that if your features are independent and your data points are independent you've got noise now we definitely know that noise exists and normally we're using i mean where the noise whether it exists with that level of independence i'm not quite sure but we're using that type of term to mop up the bits of the model we don't care about so yeah in high dimensional data this sort of thing intuitively the two things could be going on where the structure which kind of relate quite closely to what we just said before you're either going to have conditional independencies in that high dimensional space and that's the sort of thing that people try and address with graphical models so you've got some high dimensional data but it's not but there are some conditional independencies like in a time series time series is very very high dimensional because times infinite but the the structure is that you've got in dependencies between different time points and you know um that's one way of dealing with it the other one is kind of the manifold assumption that you believe that while the data is high dimensional there's some true underlying and mathematicians hate that type of terms you have to be careful but there's some embedded diet manifold where the data is sitting which is of lower dimensionality right and it likely fills the space because there's a bit of noise and when you're building a model like this you can argue that what you're doing is let's say this is a thousand dimensional inputs you can argue that you're projecting it down to four dimensional inputs um so and when you're doing at this layer here you're going a thousand ignore this bottleneck let's just look at this bottleneck you're basically saying i've got a non-linear function mapping myself down to four dimensions and then i'm going to fit in that space that's a very very sensible approach to high dimensional data and when you do it with the gaussian process you're doing it non-parametrically and for me that was another major motivation for these type of models other questions at that point uh nothing specific so uh you just move on cool okay i'm going to skip this which is like a little bit of a deep learning motivation but um i'll probably i just i apart from just to sort of say because i'll have these plots later that if you can think of deep learning ah bloody hell and click and be patient deep learning is a bit like pinball so um what do i mean by that well imagine i've got one dimensional input so this is my x and then my dimension on this space is one dimension so this is all going to be one dimensional bit boring i can have a bottleneck layer going down to a one dimensional output right so so what conceptually each of these gaps in the pins is going to be a one-dimensional output for my previous bottleneck layer where i've got a sort of neural network in between right so i like to describe things like this because everyone says oh it's magical and you can't understand deep learning but really you know this is kind of all that's going on um and then of course here i'm thresholding the output because i'm classifying um and of course the start point here causes the ball to fall in a certain way by the way i've tried simulating this and i can't get a pinball machine to actually look like this but you know conceptually and then the idea is that the fitting of these models these deep neural network models is the movement of the parameters which in their case involves moving the basis functions around so so they end up in each layer moving the weights to change the location of these basis functions which in my case are pins um and of course with bottlenecking here at each layer so they don't tend to do that right because they're going they stay high dimensional when they get between these layers but conceptually you can think of a bit like that but of course what's going on really is it's not a one-dimensional input it's a massively high dimensional input like it's an image with millions of dimensions and then it's going through this set of basis functions that you can't think of as pins the pin is like if it's a value what is the hinge point of the value but you can think of them as sort of high dimensional planes hyper planes so it's like um so deep learning is like playing pinball in hyperspace i i kind of think um but you've got this cascading effect going on but that's not what you must say to mathematicians do not say that because but what you can say is it's a composite multivariate function which is basically what it is it's a composite multivariate function and then you might say so any mathematician will look at that so well what's the point of that why did i just write down g well the point in that is that i can get a very complex function g by composing simpler units yeah now in the neural network case the trick is all the um these simpler units are differentiable and have these particular computable properties that i can put through a gpu give jensen huang enough money to buy the most important semiconductor company in the united kingdom because he's got a simplistic enough processor that does this computation very very well it's a bit depressing now isn't it whereas in the gaussian process case what i care about is that i can put uncertainty in each of these layers so there's two sort of things going on there's a related reason why we like the composite multivariate nature one is that i've got these units where i can do automatic differentiation i can do the chain rule to propagate it through the multivariate function i can do my fitting in that way and i can get a more complex class of functions i'm fitting that i can compute relatively efficiently but in the gaussian process case it's like well look if these are stochastic processes this is a crazy stochastic process that i can't compute analytically but if i have approximations that allow me to compose these stochastic processes together i get some very interesting structure in my final stochastic process so it's a much richer idea i would claim obviously much much richer it has all the facets of deep learning plus it's got well i'm going to push uncertainty through this thing and therefore i'm going to get some funky uncertainty out on the far side um and its disadvantages of course is algorithmically it's a lot harder to do so that's why we struggle to sort of scale to the scale size of data sets that those neural net people can do i should say i should pause and say something briefly about bayesian neural networks because i worked on bayesian neural networks back in the 1990s i'm i'm not ruling out the idea that bayesian neural networks end up offering us something useful that we get there through them but the errors they're falling into in that space that we established were problems in the 1990s and they keep on hiding what's going on uh astronomical because the main problem is this it's a different approach right in the gaussian process approach you say wow look at my wonderful uncertainties i care about them i'm not going to throw them out with the bath water i'm going to keep them and compose the uncertainty together the neural network bayesian neural network approach is more like oh i'm going to throw out that uncertainty try and use these nice algorithms and reinsert it later um and that has problems and most of the problems are around how you're reinserting that uncertainty and whether it gives you the nice effects and and we'll see an example of that later and one thing you really have to look at when you look at papers from bayesian neural networks is not just the quality of their error bars which are often shocking but also when the error bars are good the quality of the fit because what tends to happen is when the error bars become good the fit becomes shocking and in your and gaussian processes you don't tend to see that and you can see that gaussian process is still the dominant formalism because they keep inventing new methods and they keep comparing to gaussian processes right so if there was a dominant formalism that was doing the same thing that gps can do they would compare to that there isn't they keep comparing to gaussian processes in the baseline saying they've beaten them and you can tell that they're not doing better because they don't there's no sort of new thing i mean certainly they do they create some very interesting models and i'm not saying dismiss it all but basically the hype is well in excess of what has been produced which i think is not leagues ahead of where we were 20 years ago okay so after that little rant um just just uh are you saying that if the influence of the prior will rapidly decrease in a dgp and would you consider that an advantage or disadvantage if the influence of the prior would rapidly decrease in a deep gp you mean the prior is a gaussian process prior i guess yeah um that's kind of not what happens and people keep assuming that that's what happens what this does is actually create a much more structured prior so oddly a gaussian process prior is one the as we know it's a universal function it can fit anything it's just there's a bunch of interesting functions it doesn't like to fit so what actually is happening in these things um and you have to think about it quite carefully is the space of functions that this starts to support will um for certain parameterizations i should say if you don't change the parameters is a sort of more restricted space of functions under the prior but some of them can be things like they have uh discontinuities in them yeah so be careful this doesn't like you can't sort of like it's a bit hard to sort of conceive but when you're composing probability or maybe it's easier to conceive when you see it this way there's sort of a product of probability distributions going on right if you put something down in this probability distribution when you're weighting it by these other probability distributions if they're log concave this final product of probability distributions has lower entropy than this starting point right which is what they tend to be i think i mean would that be right it's not universally true it's not always going to be true but often that's going to be the case that can't be universally true because these are quite complex because each of these is a gaussian process or what i would say is a gplvm um so i think that people get the wrong idea when they see this and think that the prior is becoming more diffused i think it's just changing it's not even it's not even becoming more concentrated necessarily it's just changing and it could be becoming more concentrated i guess sorry that probably wasn't a great answer but structurally as probability distributions we can write them down like there's this chain and this is the way we would have thought of it in the days where graphical models were popular that actually all you're doing here is writing down a chain relation where each of these distributions itself is a gaussian process now the reason i write it like that to form the deep gp is to sort of highlight that this is the least interesting thing you can do with a deep gp right so if we put it around that way to make it like a deep gp in here to keep um uh joshua happy we put the x's back at the top um oh it's sort of be i've talked about that actually but it's in there in case i forgot to i'm talking about this new way of but what i really wanted to sort of highlight if it is on my next slide is this i mean it's the simplest thing i might imagine doing here i've said okay i'm going to have a side observation from somewhere up the top and i'll have a slide at the end if i get there to sort of show what i kind of think of this means so here i've got one observation y which is going through the whole deep gaussian process but then i've got some side observation y2 which is hanging off the side because to me this is the sort of structure you want to get the worlds of so there's a sort of world out there which combines all the advantage of graphical models where we're putting in those conditional independencies so this is hinting back at the conditional independence structure so inside here we've got the low rank structure you can't see that anymore because the new way i've represented it but we're also sort of saying look there's some sort of conditional independent structure and here we're saying well this y2 is is based on earlier latent variables coming from x than this y is so we're saying something about the probabilistic relationship between these two things and you know it's remarkable not that we've done enormous because actually i do like deep gp still but i really just enjoy it when other people work on them and come up with creative things on them and i'm kind of hoping people do things like this one one area where you see this being done is like multi-fidelity learning in uncertainty quantification there's quite a lot of work emerging on that including work from people who've attended the gaussian process summer school in the past but this is a bit of a nightmare because what happens is if we're going to put if we're saying that we've got some x and then i'm going to put it through a gaussian process as i've done here to some y so i've got two dimensional x i'm going to a three dimensional y when i've put it through this non-linearity and actually there's another paper david mckay's which was the first paper i saw to do this sort of thing called density networks um if i put it through this non-linearity if i've got a gaussian distribution in this grid here it becomes non-gaussian in this space and and that becomes intractable and that's the sort of challenge this is why these things are intractable here's another image to try and highlight that this is in the 1d case so this is like um the pinball case one layer of pinball if i if i drop the ball in at the top layer and then i put it through a gaussian process to see the position of the ball at the next layer so this is the location along the x-axis and this is the location along the x-axis even though it's pure y the next layer then with just a sort of non-linear function here this becomes massively multimodal and complex so this already i mean and this is kind of a little bit what's going on with gans and and all these generative models um you've got a gaussian that you map through a non-linear function and if this is a non-parametric function i've already got a non-parametric probability distribution and and the point in deep gaussian processes you're going to do that multiple times so so just quickly to try and show um what one has to do is and and this is why i sort of reviewed the um uh the low rank gaussian process approximations because if i try and deal with this i i get a variation approach so here my z is my input to the gaussian process because that's what was going on in the hidden layers and p of y given z is my full covariance gaussian process and if i if i want to try and say well i put a prior over zed then here's the prior i can build this variational lower bound using jensen's inequality this is the standard variational thing i might try and do but it fails because when i write out that likelihood so bear in mind the likelihood here p of y given z is we're going to need a log of that and the expectation of that under q of z which we'll typically may assume to be a gaussian so we'll make an assumption about q of z gaussian would be a standard one um and when we do that let's look at what log y p of y given z is well it's the classic quadratic form of the gaussian process that you all know and love now after three days or maybe you're sick of it but it's the most wonderful thing and and it involves x is appearing inside this k f f but worse than that x is appearing inside k f f after an inverse so so if this was just um if each element of this matrix k was just a function of a pair two pairs of x that might be fine but once the inverse happens the x's get distributed across this whole matrix and the expectation becomes extremely difficult to compute um so so that's why the standard variational approach fails but when we go for the collapse bound and remember i said the beauty of the collapse bound is it makes everything start looking like a parametric model again one way of writing that collapse bound which i didn't get up to but it's in other papers is is here's a new likelihood where this is the expectation of f as expanded out here under this conditional distribution so here this looks like a likelihood of our data y given this expectation of f a gaussian noise and a prior over u now it turns out that that expectation of f because it's just a gaussian process is uh or perhaps i don't show that okay don't show that but this expectation of f is linear in u that's the important bit okay because it's the mean of a gaussian process right because this is a posterior of a gaussian process so it's linear in u so it's easy to deal with so this whole likelihood on the left is linear in u um so what i do is okay now i quickly don't integrate stop don't do the integration introduce z and now integrate over z first so now if i introduce the z that i want to get rid of i can now do the variational lower bound and i'm not going to go through the details it's sort of in the notes but um i can do that variational lower bound and a number of terms come out and that's why the inducing variable trick is so important now when we do it and certainly when we did originally with uh andreas and and also i mean mccarless would have been fully aware of this he was definitely involved um what turns out is you need these expectations of the covariance function these all turn out to be computable and we call these the size statistics and we did them analytically in the early days but nowadays people do a lot they're doing them through these sort of um doubly stochastic methods that mcalis i think is also one of the proponents of um through sampling these and and andreas did suggest that in his thesis but he didn't go deeply into it but uh hugh salimbani and mark dyson roth had a lovely paper on that that makes these methods that you don't even need to do this analytically you can just put any covariance function in there you like which is quite nice i'm rushing through a little bit to make sure i get to some of the examples there's other approaches as well so actually i worked on these models way earlier with a guy called andy moore not the there's two famous andy moores that aren't here maybe he's famous another way but two famous academic andy moores who who he is not um and then people have looked at it with hamiltonian monte carlo and expectation propagation that's tang buoy he does a lot in this space i think miguel hernandez lobato and i think maybe rich turner but just to highlight something because i said this before and this is taken from this paper and i mean no i don't know it just it just makes me a little bit cross sorry that this is this is a fit where people are sort of saying i can't even remember which their method is but just look so carefully at these fits this is a gaussian process fit notice something about the gaussian process fit the functions go beautifully through where the data is right very tightly around the data this is i think the sort of standard method that isn't doing the new thing in this paper and this is the standard weakness of these standard methods is that what they tend to do is they don't go high variance outside where the data is okay so this would be a bayesian neural network approximation and everyone freaks out about that right and they say oh i need new methods i'm going to cover new methods and they're so obsessed with getting these error bars up here that it seems to me that they don't even notice that the model has stopped fitting the data so this is a really really important point these error bars you gain nothing in your likelihood for getting them right pretty much nothing right so that when you're doing a variational approach it's all about optimizing the lower bound and getting a tighter lower bound and you sort of might think well if i get the error bars right i will get a title bound you will slightly but why'd you gain nothing because they're arising from the kl divergence term they're arising from the fact that the kl divergence between the prior and the posterior as soon as you leave the region where there's data everything is dictated by the prior but that term is very non-influential in the whole thing it's a small term that the amount you score is very small because it's at the scale of one data point so getting these error bars right always remember this intuition just think about what you're doing and remember this intuition getting these error bars right is equivalent to at best getting one data point right something like that order that in information theory terms yet these approximations what you can see they all do in order to get better error bars here is they destroy the fit where the data is and pretty much the only thing you know about calibrated error bars is what those error bars should be in regions where you've seen data right everything else is your prior and you see this error again and again with these bayesian neural networks and reviewers don't pick it up and it bugs the hell out of me that yes we would love to get these error bars right but please look at what's going on in the region of the data when you've modified your method i mean look at this one here where we've got error bars that are way above where they should be in the region of the data and way below and what you'll see if you can zoom in which you can't in this case is that the mean function has started stopping fitting i mean this is also what's happening here this is the most chronic case but i think that this is a this isn't an example that's being proposed but every single case apart from the gaussian process you are making severe compromises to your data fit to try and improve the quality of the database error bars outside the region and until you fix that basically a number one approach is still to do something with gaussian processes so i want to pay attention to that type of characteristic as we move forward with the deep gps and show that the composite uncertainty in the deep gps continues to do this and if i sound slightly frustrated it's because i am because this has been known and obvious for the last 20 years and yet a large section of the community is ignoring these characteristics i you know i'm not saying that these aren't methods you should pursue and they're not interesting they are but please evaluate them you know correctly don't sort of go blindly after trying to reconstruct this this piece why am i also frustrated about it because you know i went through this in my phd thesis basically because i went through the same process in my phd thesis trying to build variational approximations that could capture what a gaussian process was doing um and i couldn't and i knew very well why i couldn't and the reasons are pretty fundamental they're quite hard to get around and i think people aren't acknowledging that sorry big rant for me there okay so um i'm gonna skip this to make sure i get onto the sometimes i would emphasize the way that these models do abstraction but it's the sort of thing that i care about but i think the community is caring about a little bit less so because i have to be a bit cautious about how fast i click because everything's going a bit slow any questions while i'm clicking away here yeah just just a quick one i think because we were talking about the error bars a few people are curious about your opinion on visualizing the marginals like what you've taken it because i know carl has uh strong opinions and he he talked about that yesterday yeah i know it's great that he has strong opinions and i agree with him yeah he you know he's right to have strong opinions and i'll show you how i'm gonna visualize i kind of it's good to ha i think i probably made the error of finding it obvious that this was there was a problem with visualizing these marginals so i haven't emphasized it enough in the past but yeah it's a big big issue um and and i'll show you how i sort of try and um i think the nice thing with these models is you can skip through um very quickly and um sample from them so the sort of thing i try and do to try and pick up on that criticism and it's definitely something worth working on i know carl hendricks done an amount of work on that um is to put samples through as a sort of just to remind people right rather than um rely on these error bars because yeah totally true and for a while it took me a while to realize that that's what people were complaining about i don't know what's going on here in terms of why it's so slow i don't seem to get through it either come on it seems to be going in cycles okay i'm just going to not touch it but we'll i'll stop sharing and and talk about and i may end up actually having to restart the slides yeah just another question uh somebody's curious about uh what happens if your observations aren't normally distributed are these are these models still yeah they they deal with non distributed observations now that doesn't mean that you should just do um gaussian likelihood there's some interesting philosophical stuff that comes up now and it's sort of a bit there in some of the examples um but yeah the uh if they're not sorry i'm just partially sorting out my slides in the background so i'll sound slightly distracted um while i answer that question but if they're not normally distributed yeah that slide where i'm showing the gaussian going into the gaussian process and then you're getting a non-parametric clustering distribution out on the other side is um is kind of highlighting the fact that um the uh they are um it's possible to sorry the slides are really confused so i just have to focus on them for a moment um [Music] just restarting the slides um so it's possible for um you to come up with very non-gaussian things but one challenge with that is um that uh you should still ideally be doing that in your likelihood if like if you've got a classification label don't be going oh i can use a deep gaussian process with a gaussian likelihood well you can you can do this i mean do what you like but you know there's an issue with saying i can use the deep gaussian process to model the non-gaussianity of that likelihood right because it's an incredibly powerful way of building non-gaussian distributions and you know what your likelihood is it's classification so it seems a bit more sensible just to introduce that explicitly rather than rely on the deep gp to handle it for you um because it's you know it makes more sense just to say it's a binomial likelihood yeah other questions i think it's working again now so i'll reshape it i'll let you guys i'm saving a few questions for the end so i'll let you do the slides and okay so i'm going to quickly i think what i'll do in terms of just finishing some week because i do like it when we get lots of questions etc have i shared the right screen it's analysis of dgp's is this perfect yep sorry it doesn't give you great clues zoom sometimes okay so there's a nice i just highlight there's some nice work around from david knows i hate this title because it's not a pathology the thing he talks about ironically those slides we got stuck on i think those slides when i showed them in engineering back in 2013 well what inspired i believe david to look at what he called the pathology but it's not pathology and i think he would accept that now it's a property of these processes um but what they really did nicely in this paper is they were able to show that as you drop down the cascade the derivative distribution of the process so what do i mean by that a gaussian process the derivatives of a gaussian process are gaussian themselves they're jointly gaussian distributed with the process that sort of means they're marginally gaussian as well obviously and that means that when you look at the your function you know there's quite light tail derivatives and this immediately tells you you don't get discontinuities right because it says well the probability of a very large positive or very large negative derivative is very low but in this analysis and i think the thing i object to in this paper is they say that this always happens it doesn't always happen you have to set parameters correctly for it to happen they show that under certain parametric assumptions that the derivative distribution of the process will become more heavy tailed so in as you go through the process you're going to start getting um the possibility of jumps of steep derivatives occurring which is kind of something you want so it's not a pathology it's an advantage uh there's another paper which also does some analysis and this does it from the basis of markov chain mark dunlop another great analysis again be a bit careful because both these papers have simplifying assumptions and here in this paper they assume that every layer of the deep gp is using the same structure the same covariance function that it's like that the layers are identical like so you know the the form of the probability distributions which is interesting because it needs to ability to analyze it but it's i've never seen it done in practice um so but still interesting so the thing that they look at is how deep do you have to make one of those models before you've effectively approached an infinite a model of infinite depth which is an interesting question and the reason they can do that is because they're viewing it as the stationary distribution of a markov chain which of course it is it's the nice thing about these models that they're a bit more analytic so i'm going to look a little bit at a couple of data sets now with this formalism to sort of show how it performs just to finish off and i'll do it very quickly so we get time for questions but there's olympic marathon data which i like using because it's got these sort of facets here it's just one dimensional input one dimensional output and what happened here is the marathon runners got lost so if you've seen me talk before i'll often talk about that how they got lost so i'm not going to talk about it now um but i like this question so you know alan turing was a spectacularly good marathon runner he this is his time listed here you probably can't make it out it's two hours 46 minutes in the uk marathon championships now that's was in 1946 and there's a question there um by the way that's i can't run five kilometers at that pace it's just he was extraordinarily wonderfully fast i mean like to me mind-blowingly fast other people i know probably not so much um so if we do a gaussian process fit to this data what i'm interested in here is is this year 1946 where there was no marathon run what's the probability alan turing would have won the gold medal and his time 395 is somewhere up here so according to this there's some possibility he would have been the goal winning the gold medal winning person because the error bars are showing me that the the times could have gone up there but there's something wrong because this outlier is pulling these error bars up so it's not handling the outlier correctly and this relates a bit to that question we could of course put a heavy tail noise model in but we can also fit the deep gaussian process and this is the deep gaussian process fit so if one gaussian process is feeding another and now it's pretty confident that he look he likely there's some possibility but it's not within the 95 bounce that alan turing would have won the 1946 olympics such a nice example um with that time because now the error bars are coming much tighter because actually what's happening here and this is one of the things i think is very interesting it's got all this power it's got massive power to wiggle how it likes but notice it doesn't overfit and this is what the model does it uses uncertainty and this is the composition of uncertainty that um carl henrik was talking about last time so first of all to the previous point here's the samples right so this is trying to get around so it turns out that these marginals are pretty pretty pretty good right not that good because look we're getting a lot more points below here than we might expect and the distance of these points so this is clearly very heavy-tailed right the model thinks that there's some possibility that people would run quite fast which actually makes sense doesn't it it also thinks perhaps incorrectly that those people could go all this fast um but how is it managing to do that well this is the input year and this is the latent variable z which is one dimensional so this is like the first layer of the pinball machine so what it's doing is that is basically linear so this latent variable is it's anti-correlated right with year right so it's flipping it it's saying that that you're going to go the latent variable increases so it's correlated with the year the um latent variable is going to flip it next time it it's um the latent variable increases with year but very interestingly up until about 1980 when the latent variable just flattens off so it's it's put some fairly fixed uncertainty in and then it's mainly straight but then it flattens off and then here it's obviously trying to return to the prior as you would expect and then this is the from that latent variable to the actual pace so this looks like a tied up version of the curve doesn't it this is what the model has done very low noise because this is just the uncertainty in the process it's not the noise this is the noise is low but what's happening is all those those those runners from the 80s onwards are being mapped to the same point in the latent variable so that's how it's causing the flattening out to happen now here what's happening is that the input so i'm just showing the uncertainty in each day i'm not showing the overall uncertainty when i show these i should have been clear here but because the input uncertainty in these years here is quite high that means that the input here uncertainty is high so so things are wiggling around here and so because the curve is steep that's leading to a lot of uncertainty in the output okay so pinball what was the point well here's the pinball plot this is what's happening in the pinball machine basically in the first row of pins things are falling almost straight through with some changes and then until you get to the most recent years when the pins just all fall into the same plot and then they get bent around the machine like this i'm also showing by the way these um uncertainties which i'm modeling as as the ball falls through um which is very reminiscent i've been doing this for a few years but of course it's very reminiscent of the gp gaussian process flows i can't remember the first named author but sammy caskey is one of the authors on those very interesting papers which are like continuous deep gaussian processes so the pinball plot here is like a discrete version of the gp flow which in galaxy process flow which which these models generally are i'm going to skip through this example because it's not very interesting and it's mainly there in case i did one of the intros which i didn't do this year [Music] and then i'll probably just show two example results quickly and then stop because we're over time anyway so i'm having this step function data okay so here's the step function with the deep gp right so this is step function fitted with the gaussian process doesn't like it doesn't like the discontinuity thinks the variance is high everything else here's a step function being fitted with a deep gp loves it now to the point about error bars earlier we'll have a look at what's going on here because it looks alright but no the error bars are fine if it's 95 percent confidence intervals and mean which is how i interpret them i think it's other people don't interpret them that way and if you look at them a lot maybe we should plot them in a different way but i'm always aware that the underlying distribution could look very different as it should do in this case did that move as it does in this case here's the samples through the system i think they look gorgeous so outside the region where has seen data what we're seeing yes the mean is in here and the variance is high but what's actually going on is it thinks that the output of the system is either zero or one so with the deep gaussian process and we've used four layers here we've managed to get a model which which can basically fit non-parametric discontinuity these these these points in the middle are pretty much where transitions are happening right so it just happens to be hitting transitions or things it might be hitting transitions very nice um and then the final example and just to show you the different layers so i think carl hendrick also showed an example of this it's composing sigmoid-like structures together in order to do that which makes a lot of sense and i love the pinball plot so it's sort of basically the pinball plot is just pushing things like wherever you land there's some possibility to start in the middle on the input there's some possibility you could go either way but it looks a little bit like a spacer invader from a 1980s game which i used to play so i like it the final data set this motorcycle helmet data which is interesting because it you know relating to that question about likelihoods before it has this heteroskedastic likelihood and the gaussian process deep-gassing process can easily handle that so here with the with the um regular gp it um the variance is too high at the beginning what's going on with this data set it's a it's a helmet on a on a rig that is being is traveling along and then hits an object so in this bit here there's it's got acceleration sensors on it right so there's no acceleration because it's moving constantly but it hits the object this is the helmet moving back and forth right and the noise goes up and everything goes up so um this is the deep gp fit okay it doesn't know what's going on here but we'll see what the samples look like but yes it's sort of very flat here and then there's some underlying thing okay you might worry about these error bars but let's let's dig into that by looking at the samples and you can see actually there's no data up the top here right so that this stuff is is some is very artifactual right very artifactual these things are heavily skewed down the bottom so this is really where what carl hendricks saying about these error bars is totally wrong these error bars right they're not actually 95 confidence intervals because they're symmetric and they shouldn't be symmetric yeah well i don't know maybe they should be symmetric i'll have to come back to that let me think about that more i made a statement without thinking about it and if we look at what what the gp is doing underlying it's doing something very similar to the olympic case where there's a pseudo-linear function but here notice here the curvy bit is at the beginning that flat bit at the beginning so it's using the same trick it's got these tricks up its sleeve to try and get that flat bit at the beginning that it's mapping from time a load of points to the same place and then it's injecting some uncertainty and then i really like the next it looks like just like a chirp a very clean chirp so this you know there's no guarantee that it should do this but i really like the fact that this looks more like what you might expect in a in a physics experiment to have happened rather than a dirty real world experiment with a rig and a helmet um it seems to it seems to have like this this bit cleans up and says oh now it's just a physics experiment and the reason you're getting the skew here is because basically there's input uncertainty on the timing here so that's how it's modeling that this input uncertainty is is causing this whole chirp to move left and right which gives it a heavily skewed distribution which is not apparent in these error bars right it's apparent in the samples it's not apparent in the error bars the reason it's skewed is because that chirp is being swung left and right to form all these data points okay i'm going to stop there neil just just a quick thing on this on this example um what are the likelihoods for the data here gaussian all gaussian yeah great question you see the examples are all in um the um let me just stop sharing the examples are all in the notebook you can rerun them they should run they should run out of the box there may be little errors because i i was fixing big errors this morning so i may have missed little errors there's also a lab as well and yeah so you can you can have a play all gaussian um when it comes to non-gaussianity um you know once i think we got this far with the model and you knew you could do it with stochastic variational inference i think everything becomes possible and it's just a waiting game for everyone to try all the different combinations um i think the thing that i'm a bit disappointed we're behind on is the quality of software we have for cobbling these things together we now have like eight different softwares um everyone likes to write their own software i get why but it's extremely frustrating because um what we really need is is one piece of software that gives you all these capabilities in the same place because then we would really make rapid progress about what is useful in real applications which is i think what we're still missing an understanding of what tricks are useful to deploy in which interesting applications so i'm going to stop there but my conclusions are that's a deep gaussian process and you've heard me write about stuff uh okay yeah so i don't kind of two major themes in the question so i'm going to combine them just for uh so um a lot of people are asking on the combination of the the pros and cons of well the pros of deep gps and also neural networks and basic neural networks and the combination of those so could you use kernel learning and design aspects of gps and basing your networks and can you use bayesian neural networks to design kernels for deep gps or gps you kind of those two things i mean you can sort of do anything you like it's um you can do all of that and i think early on we we've understood you could do all of that early on and i think what i find very disappointing about a lot of papers that do it is they don't explain why i mean you see so many papers where it's like yes i'm gonna do this and put it into that okay that can be done you know i can do i can certainly put neural networks into gaussian processes i can certainly put kernels into neural networks it's just composing once i accept i can compose these things then there's some work to do with the you know what formalism i'm going to use to do my approximations i think the thing that the pres i think it's fine you know i'm we're in sheffield virtually and this is what i claim that these works are there's a wonderful museum in that image that will's got up just up the hill from there there's a wonderful museum on your way into sheffield which shows like pieces of work that come from apprentices of knife making in sheffield and they have very intricate metal work on the knives it's very beautiful but it's actually a museum of very very impractical knives because these were apprentices who were just sort of qualifying for their apprenticeship um by doing intricate metal work on a knife that wasn't required to make a practical knife um so i view all those papers as pretty much that oh someone's demonstrated technical capability to make a very impractical knife and it's an impressive thing and i'd like to hear a little bit about what troubles they experienced with it is it a revolution no you know has it changed the way we think about machine learning no what i would really like to see is such intricate metal work being used to create a practical knife so you know the knives in that museum you would cut yourself if you tried to cut anything with them and what i care about is how we deploy in the real world and what i want to hear when people are doing that is why are you doing it and i don't think it's good enough to say because i can beat this benchmark but it is good enough because the whole of europe's has just become an exercise in impractical knives so apparently that's fine um but it's not good enough for me to be excited about if you don't explain what this gets us that we didn't have before and i think one of the challenges is it's quite a lot of work to make these impractical knives as you can imagine and um you don't have enough time to make an impractical knife and think about the type of application you should be deploying your knife for cutting and that's why we need this more unified tool set so that we can combine some of these things very easily and you know there's efforts in that where there's pyro there's um there's edwards there's various things but they don't give us the full flexibility of the gaussian process that say gpi does so you sort of want something that's a combination of say g pi and pyro or g pi and you know that care i think g pi cares more about your ability to introduce what you care about in the covariance function than a lot of those toolkits sorry long answer yeah i i'm just interested i'm going to ask one more question um you mentioned uh gp flows but a few people are kind of they've seen parallel between the way you uh introduce dpps and normalizing flows so kind of how do they compare in terms of principle yeah i think the big difference um i mean normalizing flows cover after deep gps but the big difference about normalizing flows is that they're trying to keep the mapping uh bivariate so the dgp doesn't care about that and also they tend to be deterministic functions being applied to distribution so that's also different but yes it looks similar in that way and it's in some sense a shame that a lot of the literature around those things and gans doesn't know about say david mackay's work on density networks which is the work that inspires a lot of the things that i've done um because yes there's enormous numbers of overlaps um and i think there's great ways of including these things together um but but not a lot and the normalizing flows has some very nice cool new ideas in it um but yeah a lot of the core ideas have been around for longer yeah cool thanks neil that's um i wish we could keep going but i'm aware like people need a few minutes to be able to like stretch your legs i thought it was amazing i i really enjoyed as well your uh uh so very um emotional uh kind of like interested uh emotional yeah exactly yeah getting uh still still uh yeah what's the word you know excited placing try and get a drama about it you know make it a bit more you know in the end i don't really care but i like care for the audience that was great cool right thanks again for that um people everyone feel free to ask your questions and stuff on the hoover app because um like we can discuss this a bit with uh carl and javier in the afternoon session as well in the um round tables um cool thanks again neil i think that's like i don't know such a frustration to stop but we're going to have to stop um thanks again for over running yeah but it's really good great questions which was so nice yeah you have in one of these it's so hard no no talk to a screen for so long without any so will you did a great job also and there's still 177 people out of the max 200 that i saw so you did a great job of managing to listen in unless of course you've all got it switched off well hopefully you've copied the question somewhere so we can discuss them in the afternoon um right we better stop um but i should get some canned applause but i haven't got that so just you can imagine everyone cheering to say thanks cool thanks again yeah thanks everyone for watching uh we always start in 13 minutes with javier to talk about bayesian optimization so um
Info
Channel: Gaussian Process Summer School
Views: 2,614
Rating: undefined out of 5
Keywords:
Id: apq-hrzB-sM
Channel Id: undefined
Length: 105min 29sec (6329 seconds)
Published: Wed Sep 16 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.