[DeepBayes2019]: Day 4, Keynote Lecture 3. Deep Gaussian processes

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
thanks.thanks neither for the introduction and thanks Dimitri for inviting me it's it's always a huge pleasure to be here in Moscow and to this great summer school and thanks for joining me today on this lecture on deep Gaussian processes so for those of you that don't know where sophia antipolis in France is I put this slide I always like to put this one to make people jealous I lived in Scotland many years so I you know this is my return after many years in the UK I'm from Italy originally but I work in France and you can imagine where this place is so it's my Institute is round about here okay so in case you would like to visit in the future just bear in mind that this is this is where you can you can go to visit me alright so I I would like to keep my presentation always to a very low level because I like to for people to participate understand what I'm talking about and not you know sit through one hour and a half without you know understanding anything or blanking out so I will start in order to for us to understand VBR some process I think it's very important to start for something simple and think for me the the most basic thing that I can think of is basically linear models so if I understand basic linear models I think we can understand much better what Gaussian processes are and what deep gaps and processes are because most of the techniques that have been developed to treat deep Gaussian process or make inference and approximations for the Gaussian processes are really the techniques that we developed in Gaussian processes so so the idea really for me today is to give you a brief understanding of what Gaussian processes are so you've had a lecture this morning but I'll try to cover part of the things that are not in Afghani talk so I will talk more about what kind of approximations we can do to make custom processes scalable and I will try also to give you a flavor for what Gaussian processes can bring today in in the world of base and deep learning okay so I will try to motivate what are the challenges I try to talk about what we can do to solve these challenges and how we can position the research that we're doing of some processes in the context of based on deep learning so I think in order to do that what I'll do I'll talk briefly about inference and proximation z' for the gaussian process I would like to also to talk about convolutional deep chasm processes because I think it's something that resonates with a lot of people working on image problems these days so I think it's important to talk about that and then I'll talk briefly about some recent trends some cool things that we're doing with the awesome processes and I'll try to draw some conclusions at the end by all means whenever you have any questions just feel free to interrupt me okay I'll try to stop every now and then to checkpoint all right so let's start I don't think I need to talk too much about this but I always like to start my presentation with some motivating examples so imagine we have some data so we have a regression problem here so we have some inputs some outputs here and the labels are continuous in this case so I generated this data from a function that I know and I generated noise of observations but equally here I have another data set where I have generated a function which is between zero and one which represents a probability of trust label triangle in this case and so where it's higher you have a higher probability triangle and the job of people like us is to try to reverse-engineer this process and try to figure out what kind of functions we can kind of explain the data that we observe and why I think machine learning that in a Bayesian way is fascinating well because the kind of picture we can get are very beautiful okay so we can get this sort of family of functions that fit the data that we observe equally we can do the same for classification where the functions now are between zero one so I guess we're all here because of these sort of pictures that we like very much and then this is very important because we want to communicate to people things we want to communicate to medical doctors environmental engineers whatever then we would like to be able to make predictions with confidence attached to it okay so not just point predictions but we like distributions of our predictions which are already a richer way of communicating information and in predictions okay all right so there is this possible way you can think of to do this task of Bayesian machine learning is to think about functions f of X in this case so I'm thinking about X as a multivariate object and we're going to take the family of business function so here the NOTAM by fee okay so and I'm gonna wait them up with a set of weights which I call W so for ease of notation throughout the talk I try to be consistent orange is things you observe green is parameters of your model and blue is anything that we consider a function okay so throughout the talk I'll try to be consistent with this notation so we have a set of fixed basis functions let's say so we can have sine cosine and so on and imaginary have a regression task so we can have is that we model our distribution over the observations given the parameters and the inputs now X and some noise variance lambda as a Gaussian which now what I'm doing here I'm evaluating the basis functions at the inputs I observe okay so I create these femaie tricks which is also called the design matrix and with this design matrix as you see I'm evaluating the function the basis functions that all the inputs okay I have n inputs and so here I have this feed W which represents the mean of my what I consider my written function let's say around which I assume that there is some noise okay and now my task is to try to figure out the parameters W okay okay so this is again to fix the notation and to be all happy with the colors that I'm using so inputs for me are this matrix X comprising vectors okay actually I like to think about n by D kind of matrices okay labels are laid out in a vector in this case I'm assuming univariate outputs or can equally extend this to multivariate weights W okay we have deep weights and of course we know this connection between loss and likelihood so when we think about quadratic loss for example so we want to minimize some sort of samos square error then this is equivalent of maximizing a likely okay but now our muda likelihood what we can do is to be Bayesian so the cool thing about this is that this in this illustration number we may have some sort of prior over parameters and as a result of looking at observing data we constrained in the possible values that the parameters can take and all this configuration of the parameters were the probability of the parameters given data is high is these configurations where the parameters explained while the data okay so of course this is a cartoon in the sense that we know that in based on linear regression we don't have multimodal posterior everything is nice but in general well when you work with deep learning for example this is very much the case you're gonna get some posteriors which is multi model and multi dimension and we all know the problems with base and inference that relies on this possibility of estimating outcome computing exactly this integral which is the model evidence which for most model is not possible okay but it'll basically regression everything is nice and easy because the likely this Gaussian the prior is Gaussian and therefore the posterior is gonna be Gaussian easy way to see it if you have X both quadratic form times like Expo for quadratic form this is gonna be the expo of another quadratic form okay right so in order to proceed with the awesome processes then one easy and super easy way to I try to find the shortest and easiest way to explain Gaussian processes again so this is just a refresher what we've seen you guys have seen this morning is the following so now if I think of a prior over W which is a Gaussian with zero mean and covariance s if we think of a F which is the realization of the function given by my model as V times W then what I get is that the prior of F is going to have this structure which is again Gaussian ok because this is just a Gaussian multiplication multiplication of a Gaussian run of Gaza random variables x matrix of coefficients fee okay and what we're going to get then is that okay as is the covariance of F and it's pretty easy to compute because it's zero mean okay zero mean because I chose a P of W which is zero mean therefore the covariance is simply the expectation of this quantity here of DW feel W transpose okay and this is gonna get something like this and of course the expectation acts on w w transpose so we're gonna get something which is V as V transpose okay so now we we can think in two different ways we can think of solving basically in a regression by saying I have a like to it over for y given W X and out and lambda which which is the kind of thing we've seen before and a prior over W or we can think of a light dude now which is not in terms of W but in terms of F so we can think of given the field W is equal to F we can think of observations as being noisy versions of this latent function which I called F okay and now the prior over F is going to be 0 K where K is V as V transpose okay so we have these two different ways now to think about it so if for those of you familiar with the support vector machines this would be thinking about the primal or dual kind of thing and I mean we haven't done anything so far okay we've just introduced this new quantity which we call F and then we we just rewrote the same quantities as before so why is this useful well in Gaussian processes sorry ending basically in a regression you sort of have to it's a salt problem okay I'll show you the equations in the next slide I mean Basin in that regression is a salt problem we know how to analytically compute posteriors make predictions and so on and so forth so what's the deal I mean why do we do Basin deep learning or do we do anything else well the problem is that we need to specify basis functions and of course if you start thinking about number of possible ways in which you can combine possible basis functions for each dimension and so on and so forth then you can start thinking polynomial trigonometric functions and so on for each dimension or for products of dimensions so you know this becomes factorial e an exponentially complex problem of deciding which kind of basis functions you should try okay and of course you know you could do model selection because in Bayesian linear modeling we can also do we can check the model evidence 2d to determine which models are the best so we could compare different models but the the number the amount of comparisons you would have to do is exponentially big right what we can use instead therefore is other techniques okay and this realization that we can work instead of working with the parameters we can work with these functions now which have this covariance K it's quite interesting because instead of thinking about fee and the choosing the basis functions first and then this would kind of determine what kind of covariance you have for F now we can do the other way around what if we choose some sort of function that acts on inputs because this K matrix is evaluated if you look here is the covariance of the of the latent variables F so what if we choose a function that computes this covariance okay and we call it K and it has two arguments because this is going to be evaluated between different inputs okay so we instead of thinking that choosing V allows us to determine K we do the other way around so which was K first and this is gonna induce some fee the cool thing about this is that we can choose a function K such that fee is infinite-dimensional okay and this is quite cool now because it allows us to work with really potentially an infinite number of basis functions without really having to do it because what would the only thing we need to do is to deal with computations that involve this kay function which takes only the inputs you care about okay so this is really I think an easy way to establish what Gaussian processes are okay super simple exam is the size it to show that for this kind of kernel it's actually true what I just said before that the induced fit infinite dimensional mapping is in fact a mention and the reason to just the way to see this is that if you see the expo of this form here doesn't allow you to separate out the contribution from X I XJ okay and I what I showed you here is that you sort of have this fee that enter in a scalar product kind of fashion yeah and here you cannot really disentangle this relationship between X I XJ and the only way to do it is to do some Taylor expansion so if you do Taylor expansion of this function then pretty easily you see that you can actually separate out and see this as a scalar product of things that are infinite dimensional okay of course there are not this is not the only one the Gaussian kernel also called the radial basis function kernel or squared the exponential I think they call it also but there are many other kernels that satisfy this property ok all right so this is just the two slides I have just two so that you know when you want to read the rive your base in linear regression you can check back that all the derivation is correct and also that perhaps my intuition is correct okay so when you have listened in a regression everything is everything is Gaussian so the likely this Gaussian prior is Gauss and then the posterior must be Gaussian therefore the posterior will have Gaussian with certain mean and covariance and the kind of form that we get for the covariance in the mean is pretty easy because it looks very much like the least square solution except it now this least square solution already is regularized by this covariance matrix for the prior okay you can do also the the predictions and then magically you see that the parameters disappear and we know that this is good and we know that this is possible thanks to the base and treatment you see the predictions don't involve the parameters anymore because now we are averaging out across an infinite number of parameters which have this distribution here which is compatible with the data okay all right then if we think about F now instead of thinking about W then we're gonna know again that the posterior of F must be Gaussian and if you do your calculations again so you take the prior and the likelihood and multiply them together arrange terms what you're gonna get is that the covariance over F is gonna have this kind of form which is kind of ugly to look at and also that the mean as this other ugly form here okay but again you see that the mean similarly to what you have here you have the labels that appear in the computation of the mean okay here and if you look at the structure you see here that you have this V transpose V as minus 1 so somehow you would need to sort of thinking about what K is which is V transpose and side V s minus 1 F V transpose then you know somehow you can kind of figure out that these quantities are the equivalent of those just in this new representation okay and then same thing you can also do predictions in the same way so you see that in the in these expressions they have this inverse that of K plus lambda I that pops in many times okay whereas before what we had was this inverse of V transpose V Plus as minus 1 so one way to derive this and I really like this visualization of what happens with matrices is that if you want to sort of establish this connection between these two derivations an easy way to do it is to apply what so-called Woodbury identity which I never remember okay because it's a very long expression but I like this picture because shows you that you can turn an an inverse of an N by n matrix which has the structure so ever if you have a diagonal plus some sort of low rank the composition of your matrix then the inverse can be in this form so you take the Dagon on the inverse - diagonal evens and so on but then the key is that here you turn this difficult inverse here into an inverse which is much easier to do so this would be the case where you have n observations okay so this will be 4 in this case and number of basis functions - ok so if you work in in thinking about F then you have to sort of deal with matrices which are of the form of this form here so it's n by n whereas here we have to work with matrices which are d by D where these number of basis functions ok so so why is this useful well because then now you can pick the complexity you want so if you prefer to spend if it's cheaper for you to work with this representation because this is gonna cost you the cube time okay you have to invert the matrix inverting means cholesky and then do something else after that okay back substitution is on whereas if you work with K then you have the opposite so you work with n by n matrices so you have to invert that so this is gonna cost you n cube and this is already a hint to the fact that now think about Gaussian processes you know you have this complexity which is gonna hit you really hard I mean if you talk to people working in deep learning only they've never seen been around abroad before or if you tell them you know my algorithm scale and cube you know they laugh at you right I mean people don't even want to scale with n anymore you know they just want mini-batch right so this is pretty useless I would say unless you know you we come up with ideas to scale this up and so on and so this is really the content of today's talk is trying to see you know this is all good you know this doesn't processes work and so on but then now we try to we need to do something else to really push them beyond the limits and and to be competitive with deep learning ok another way I like to introduce Gaussian processes is as follows so this is a result that was already proposed in 1996 in the thesis of Radford Neal so this is a connection between shallow neural networks and Gaussian processes so you can think of Gaussian processes as this other kind of model which is as follows so imagine you have a bunch of inputs now so this is not a graphical model this is a neural network diagram okay so you have a bunch of inputs and now you have weights connecting to an infinite number of neurons okay and then these infinite number of neurons are are a funnel back into one neuron output in output by this other metric vector of weights W okay so imagine now that all it everything is Gaussian again so that all the weights of a distribution which is zero mean and variance alpha zero for W zero and then alpha one for W one okay so but since central limit theorem implies that now we're summing weight we're thinking weighted sum of infinite number of things so f must be Gaussian right and because the the the random variables are zero mean that are multiplying the the elements in nearly infinite number of neurons then f is going to have zero mean okay and now if we want to compute the covariance we have to solve this sort of horrendous covariance around this expectation the look surrenders but it's not so bad actually because if we think about expectation under P of W one then we're gonna have to sort of act on on this part of the expectation only so this is gonna pop make this alpha one pop out of the expectation and we are left with an expectation of this quantity here which is a fee of X W 0 times V of X W 0 transpose okay and now the problem is that now we have something which is Gaussian here and P of W 0 but this Gaussian stuff is going inside things that are potentially nonlinear because fee now you know if we think about neural network with activations can be Rallo can be exciting cosine and so on so generally this looks like a really tricky integral to solve like difficult expectations are solved but actually it turns out that for a number of set of business functions you can think of actually there is an expression to to compute what the covariance looks like okay and so if we do that it's possible to show that that our choice is a fee such that you retrieve the Gaussian kernel or the maternal or the R cosine Brownian motion and so on so this is another way to think about an infinite number of basis functions that were somehow combining together yeah in order to get the modeling of a function f and in here again you see this aspect of the infinity okay so in rockford Neil's thesis that is a there's a chapter about this which is really beautiful so I would if you're interested in Gaussian processes I would I would strongly encourage you to look into that okay we started everything is good so far any questions no not too easy right so far okay good I was talking to a bunch of you guys yesterday the poster sorry I realized that most of you are pretty advanced so but I also think that it's good to you know at least cover the basics so that we're all at the same at the same level all right so what are the challenge is now for the Gaussian processes and this is pretty much what I've been working on for the last I would say almost 10 years I've been working on girls and processes and and so this is really what the community has been trying to tackle for the last few years and then things developed in various ways and I'll try to tell you the whole story as we go along so but the problem is first of all the kernel design so previously we had the problem of choosing the basis functions well now we change the problem to choosing a kernel okay because the kernel determines the infinite number of basis functions you implicitly use and also you have gonna have parameters there you may want what parameters inside your kernel function so that you want to give more flexibility you want to maybe learn these parameters okay and so so this is already one of the problems yeah that we need to do some sort of good kernel design and I think this is something that I mean in a lot of papers about support vector machines about Gaussian processes especially early on people you know I'd be using the Gaussian kernel all the time just because and there has been obviously some work on kernel design but this is really like a fundamental problem because you know current design is critical for applications and we know that right I mean deep learning works because you learn representations gp's might be too expensive so we talked about the need to factorize this covariance matrix K okay which is n by n so n cube and you have to store K so we worked a lot on methods that allow you to avoid storing K and not being n cube really but just you know with conjugate gradient type of ideas but you still need to deal with these large matrices and still even conjugate gradient each iteration is going to be matrix vector product so it's gonna be n square anyway so still n square we don't really like but you piece might not even be tractable think about linear models basically linear models so if you have a regression problem everything is fine you get the posteriors everything analytically available but if you have a classification problem then you need to use generalized linear models and generalize in your models obviously you know are not tractable in a bayesian way you need to do approximations okay and the reason is pretty simple I can show it here for the Gaussian case if I would like you to P of Y given F which is not Gaussian its Bernoulli its multinomial it's delish lay its paw song you name it well this guy here is Gaussian they don't talk to each other so they don't there's no conjugacy there so this is gonna be a problem we cannot solve this integral so we need to approximate okay so in the graphical model structure is the following so we have some observations X an input we have some hyper parameters theta for the latent functions F and then the labels Y okay so it means that we do not really integrate out this level of random variables in the model okay and we wanted to talk a little bit about kernel design so this is a if you use a Gaussian kernel for example these are the kind of functions you can draw from the prior okay so you see that you get some sort of sinusoidal sort of things out of this okay and then you can of course you can play around with hyper parameters to make the length scale longer amplitude you know larger and so on and so forth then we have matern for example which you know looks very much like so these functions look very continuous very nice you know infinitely differentiable and so on and my turn already you see that it's you know more rugged of course if you have polynomial kernel then this is the kind of things you get out of it you can really see that in the number of choices of kernels is pretty large I mean this is just three examples but there are millions of kernels and you can also combine kernels by summing kernels multiplying kernels together and so on so already the problem that we were trying to solve in the first place for basically linear modeling is sort of pushed back into this problem which is kernel design okay so what if we compose composition so this is the simple idea which is let's take simple things let's compose them and maybe simple things compose together we're gonna give us something complex this is really the idea of deep learning right so you have a parametric functions you come and then a layer one and then the output of that is multiplied by other parameters and applied nonlinearities and so on and so forth this is corresponds to in mathematics to composition of functions so maybe we can we can compose Gaussian processes what does it mean so here I have an example where we have Gaussian process F here Gaussian process G so what do we get if we do the composition of F and G with the graphical model becomes something like this so you have a hierarchy now of a cascade of Gaussian processes where the input of one Gaussian process become the sorry the output of one Gaussian process become becomes the input of another Gaussian process you see that now by composing the two processes we start getting something which starts to look crazy crazy or I would say so you start getting this point where you know here the uncertainty looks pretty much the same almost all across but here you start getting points where you have wide uncertainty Nara uncertainty and and also the length scale of these functions sort of changes in a abrupt way okay so somehow if we do this idea of the deep Gaussian processes we increase the representation of power of our graphs and processes in at the same time though we kind of complicate the model okay so this was proposed in a start of 2013 by Damiano near Lawrence and and somehow this was talked about already in the thesis of Radford meal that was already thinking oh well if I can do infinite number of neurons now I can also do infinite number of layers why not all right and so this was already talked about in in the thesis and there's a beautiful paper with fantastic illustration by Leuven Oh in estas 2014 with a lot of illustrations about what happens when you start composing this processes a lot of times okay so here it's just I'm just giving a little preview of that so deep is not necessarily good but we know that right so this white people have come up with ResNet so the idea is that if you start composing these processes throughout the layers then you get some sort of saturation effect so the if you have a ten the Gaussian process ten layers in a deep chasm process at a tenth layer the distribution is going to be very much sort of saturated okay so here you can get functions that look reasonable and here is the getting things that are not very useful as priors for functions yeah so then the idea is to shortcut the sum of the inputs to the layers okay by doing that we do what ResNet is also doing somehow and then we start getting functions that look crazy now but still they make sense okay the you avoid this saturation effect from there okay so this is there's also other people by Alex Mathias that is really good I'm not sure actually this is Taylor Kyle it's been published in the news I starts I should check that but okay so these things already also were talked about in in these papers so similar to resonate I would say you know this ideas of shortcut in the inputs to different layers to improve the way you we do the modeling of these functions but then now this is the problem so we made the model more flexible it's nice but we pay for it like everything in life you want something good you have to pay for it and this is the price to pay so if you just want to compute the marginal likely to do model selection for example let's say that we have a duplication process we want to you know now we're Bayesian so we can compute them our model evidence and and get the new model selection that we would need to compute this integral here and this integral here if it was kind of hard before now we have things that are even more complex because now we have gaussians that end sorry enter these distributions which are you know sort of the functions now entered the evolution of the kernel and then you have to take an inverse so these probabilities here are pretty horrendous and now you have to sort of integrate out all the latent variables at all layers okay so this integral here if it was hard to do it for one layer this becomes even even harder okay so no wonder not many people do Gaussian process or the Gaussian processes because this is pretty hard okay so but now I'm gonna take a question first and then I'll tell you how we can kind of go beyond well sort of avoid having to deal with the sort of integrals in the previous slide you mentioned that they get not in this one the one before so they flatten out and the output gets flatter and flatter is that is that because then the subsequent GPS output start to look similar and then so why do they flatten out it's just an effect of the fact that the there are many effects that lead to that so one is first of all the length scale of the Gaussian processes so if you start composing things that have sort of wide I would say fast oscillations then if you compose these oscillations get even more frequent yeah so here maybe what happens I think I just just for fun I I I kept the same length scale so the sort of same amount of variability at each layer and if start composing what happens is that for some reason then start getting things that flatten out completely but the other effect that you get is that the distribution of the output is farther and farther from being Gaussian okay so if you look at the distribution of across one slice here so if I slice the this drawing here at any point the distribution is Gaussian and as you move inside the layers this is farther and farther from the Gaussian accessable to get skewed in whatever way okay so again if you check this paper out I think there is some fantastic illustration to really get what what what is the effect of that Thanks Thanks yes okay so I can repeat the question maybe it's easier so the the question is why adding this skip connection helps and I think it's similar to what the Devi okras net is right so you have you're enforcing your function to be not only a function of the previous layer but a function of the previous layer and the input so the input is always appearing there okay by doing that somehow you you avoid this effect of compressing and then changing the distribution from Gaussian to something which is completely non Gaussian and crazy by keeping an effect due to the fact that the input is still there so part of the of the function is also gonna be due to the effect of this f of X that is fed back into the end of the function yeah any more questions can I go on because now the fun part is alright so why people do deep learning and they don't do Gaussian processes okay and this is kind of interesting and I sorry my microphone is too loud maybe so what happened is that I had to pitch the student graduated this year and and he started his thesis four years ago with me and and I remember four years ago it was a different world there was not transfer flow right I mean it seems like a small thing but that really changed a lot of things and why is that because you have automatic differentiation these days an automatic differentiation really changed the way we think about developing new models and so on and this was also coupled with the fact that you know there have been significant advances in development of regularization so their part for example it's been you know developed in you know in a more advanced way I would say GPUs have become really commonplace you know the implementation of this deep the neural network has become really efficient because you know it's all based on tensor product TP was have been developed even you know so you can go even faster to do that kind of computations that you have to repeat over and over you can mini-batch right so you can update your model thanks to stochastic optimization you can update your model by just giving batches of data and of course you know large representational power I would say it maybe it's similar to what I was trying to say here but also mean the fact that convolutional so the fact that we can really the wow effect we got in the deep learning revolution has really been on an image classification problems where there's been a boost in performance of course there's been also Danelle be in another fields but i think this the fact that we could use some specific representation for the problem at hand really made a breakthrough for deep learning in image classification so and when we started like that thesis and you know the community of Gaussian processes was catching up really was we were at the beginning of this game of trying to catch up with us with the deep neural networks and I'm gonna say now that today we really pretty much close the gap and now we have implementations online you know if you go to my web page I'll show you the at the very end you know we've code online and so on you know now we have implementations of deep Gaussian processes that can do mini batch based learning we can exploit GPUs we everything is implemented in Python tensorflow you name it and we can also do convolutional deep blossom processes so it's been really a game of catching up with the community but we've done it ok and now that we've done it we don't know what to do anymore of it promise I'm kidding there's still a lot to work to do of course but it's it's kind of interesting how you know looking back at four years ago only it was completely different world because we didn't have any of these in place all right so I think I shouldn't spend too long in this because everybody knows the Castiglione's optimization so when you have you know you can define your lower bound and then you take your gradient of over the variation of parameters okay and when you have this property that the expectation of your gradient is the actual gradient okay so when you're unbiased well your stochastic lent is unbiased okay then what I mean by this is that you know if you start from here you would get here in conditions of exact gradient but you have a stochastic routine now that whenever you interrogate it it's going to give you a different value and the average of these guys is the true value of the grade okay well in that case then we can we know by a very old piece of work in 1951 that if you run this optimization stochastic optimization is going to converge to something which is a local optimum provided that your step size is going to decrease over time okay so this is well-established fact and then we can apply to stochastic directional inference in the sense that when we have a lower bound then we apply silastic optimization okay and so we're going to reach a local optimum of the lower bound just for fun and just because this is a PDF animation so I'm really proud of this so it's a PDF in Acrobat so yeah you can only view it with Acrobat so if you try with preview it doesn't work so what I'm showing here is that my implementation of stochastic original inference works on of course it tried a basin linear model because for this I know what the posterior looks like okay so the posterior is gonna look like this black Gaussian here and when I do my optimization my Gaussian is gonna home in into the throat posterior and we know that for basically no models this is all good because fossil should be Gaussian my approximation which is going to be Gaussian it's actually going to go right on to the posterior and we know that then the lower bound will become tight in the sense that the well over bound will be exactly the model evidence okay so this was just a fun just to show you that you can call this up in like a few minutes okay so it's pretty easy to do all right but is there any hope for GPS and the GPS okay what I showed you before is that we have these complex integrals to do and so on and so forth and what is the power of mini-batch well is that whenever you have an objective where you have a sum of functions that depend only on the on the eyes data point okay then you can take a stick a stick estimate pretty easily you just subsample yeah you just take a subset of these and then you inflate by how many there are over how many you've taken okay and this is gonna be an unbiased estimate of your son okay in Gaussian processes even if you think about just shallow GPS shallow Gaussian processes this is so impossible to do I mean I thought really hard about this but it's not an easy problem and the reason is that your prior over the the latent variables is gonna have this form here it's gonna be an expo for quadratic form you say okay aqua dark form is a some is a double sum sum over i sum over j of the the thing inside yeah but they you see that now F transpose and F have K minus 1 in the middle so it means that the coefficients that you need in this sum are the coefficients of the inverse of K so even if you want a mini batch you first need to invert K okay so there's no escape from that and so applying mini batching to Gaussian processes it's not really possible I would say okay another way you could think of is to say you know what time we're just gonna take a mini batch and I'm gonna do a GP for that mini batch well in that case you're taking a small square of dense of a dense matrix out of a big dense matrix so you're gonna throw away a lot of covariances across data yeah so no matter how you think about it mini batching like this is pretty hard so the question that we asked ourselves how can we actually exploit what you know made it plan successful for practical and scalable learning of deeply awesome processes alright so in in the interest of time but also in order to give you the main ingredients I decided to have a short part where I tell you a little bit about the main ingredients for inference in the graph some processes and the idea is to really show you how people have come up with approximations and inference methods for just shallow Gaussian processes so the idea is pretty simple and then after this we're gonna move on to showing how to apply this to deep Gaussian process so the the simplest idea is to say okay let's start from the our posterior and in general the posterior can be non Gaussian because you know we have the multiplication of a prior and the likelihood and life it might not be Gaussian so the posterior might not be Gaussian okay so in this case what do we do we approximate it with a Gaussian yeah this is what you know we can do most times with the nicest thing to do because once you have Gaussian then we can make predictions easily we can sample from girls in it and everything is nice and easy and I don't exactly know when is the first paper talking about Laplace approximation for GPS I'm sure of this which is a multi class specification so it's already sort of beyond binary classification probably the stats literature this was done in the 70s or something okay so this is pretty simple you take your posterior and you just locate the mode and then you approximate the shape of your posterior with the Gaussian okay in the matching is pretty easy it's based on the hessian everything is super easy okay so you take the log of your posterior which of course you don't know but you know the prior and you know the likelihood so you can take the log of these two you locate the maximum first and then you take the hessian invert that and that's gonna give you this sort of covariance that you should and um you inverse that and this is gonna give you the the covariance that you should take for the Gaussian that approximates in the best possible way that posterior distribution okay so this is pretty simple there is a one of the first works on another technique which is called expectation propagation which is another way to compute our Q which is done by Matias eager Matias eager has been one of the most I believe one of the most influential researchers in Gaussian processes his thesis is just a mine called full of stuff good stuff and in this technical report he was trying to adopt what expectation propagation that was published in 2000 and while the year before by Minka and it tried to apply those ideas to Gaussian processes and this is still a technical part is never published this and so the idea is pretty simple so you take your posterior and your procedure is going to be proportional to the light tool times the prior okay and now you're gonna say that your you still keep the prior the way it is but you say that you have introduced some sort of pseudo like root terms which approximate the individual likely terms in this queue guess what I'm going to be Gaussian so there is a technique to to update each of these gaussians in turn which is based on KL divergence ease and so on and so forth so I'm not gonna go into too much into the detail but really the gist of it is that you take each of the factors that approximates the light field in in a Gaussian fashion okay then there is work on volitional days by I think operation ball they might not be the first one but definitely one of the most complete works that talks about this when you instead try to compute the curl divergence between Q of F and P oh and your posterior okay so this is something that it's very akin to what people do embellishment inference on networks okay of course the people have also done a Markov chain Monte Carlo we have a sort of a review in 2013 about this so this is what I was doing before moving on to deep Gaussian processes I was doing a lot of Markov chain Monte Carlo for GPS my life was miserable because these things were free very slow it would take weeks to converge and so on and so forth so I'm not gonna tell you what my life was back then but it's worth noticing that you can do it okay and there are actually fast algorithms if you only care about sampling so getting a Markov chain Monte Carlo sampler for this guy so when you only care about F actually can derive algorithms which are pretty fast okay so this was for what concerns approximations that allow you to deal with non Gaussian ET of the posterior okay but now let's say that we want to scale these models up to lots of data points because still you know and what we had before we were just talking about you know Q of F but still you know we have to deal with this inverse of K so can we avoid the inverse of K of course we can but if we do approximations so sparse Gaussian processes use this so-called nistrim some people say Nystrom i think i have no idea how to pronounce this I would say nistrim approximation so you introduce a number of latent variables we call them U and new locations that we call Zed okay and and the idea is that you can you can show that we can do as a low-rank approximation if you like of your covariance by considering the covariance between your inputs on your data set versus the inputs that you just introduced then you have the inverse of the covariance evaluated only among the inducing for the new points that you introduced and then you have the covariance evaluated for again Z and X okay so this is a nice turn approximation in kernel literature this has lots of names also but in Gaussian processes they also called it the fully independent independent training conditionals approximation then of course you know if you do this approximation the diagonal of K is not going to be correct and you can correct the diagonal by just doing this other partially independent training conditional so I'm not going to tell you how it's done but it's pretty simple just so that you know that it exists and then later on this is really what changed the gaming of some processes I think it's this work by tedious in estas 2009 and this is the souks variational sparse Gaussian process this is a pretty cool idea so instead of thinking about only introducing approximations for F so what they do here is that or for just you they introduced it he introduces a distribution over F and you okay so you have a joint a Q over F nu and the key here is to keep the conditional distribution of F and you because everything is Gaussian even this condition is going to be Gaussian so everything is nice and easy and then focus on introducing this Q of you but the thing is that if you work out the lower bound this is a beautiful result that there are some cancellation of terms that allow you to actually express the lower bound for the variational problem when using this sparse GPS in a very computationally efficient way okay so then you end up with things that allow you to only deal with matrices which are of the size of M by M yes is that in using simple paper is that your question yeah how is that the inducing points paper like where they yes okay yes okay so then we there are also other techniques like random future expansions that have been very popular in the current literature we're going to talk about that in a second there is also a lot of work on structured exploiting things like I like Kronecker for example so if your inputs are on agreed then you can do things very fast there are connections with stochastic differential equations so in one dimension for example you can solve things very fast and so on and so forth so people have tried to sort of extend this to multiple dimensions and so on tens of structures GPS so whether you know you exploit weights in a way you exploit things in a way that when you can do tensor factorizations which are fast and so on so this is all you know I have a long list here but it's not exhaustive of course ok alright so here it's the road map really so once I told you that now this is really the what's happened what happened in the literature of deep you have some processes so people have just taken various approximations that were proposed in GPS and they ported them to deep Gaussian process and so this is this is really the list of things that happens really so if you do variational inference plus TTS sparse GP then you get series of works that have been proposed by mainly the new Lawrence's group and recently there was this Sun in barren desert which i think is one of the the most perform well the best performing inducing point approximation for deep Gaussian processes then you can have expectation propagation plus feats which is this ground in Eastern approximation we talked about before this was the ICML 2016 then people have also combined Markov chain Monte Carlo with that and this is also Nibbs 2018 then we've done some work on volitionally inference and random feature based approximations so I'm gonna try to briefly guide you through these although I mean I'm not gonna go too much into the details because it's very technical and I don't think this is the you know I just wanted to give you the overview of what's going on in the literature without diving too much into the details but this is the kind of graphical models you get when you start having you know a Gaussian process now with the two layers that say the output layer here so we introduced new random variables which we call Zed at each layer then we have hyper parameters for the kernel and now we also introduced this latent variable so the model looks pretty horrendous okay but actually it turns out that you can do things pretty fast with this also you know this you can think of this latent variable Zed as things that we optimize that's why I also color them the same as the hyper parameters theta for the kernel you know because this is a extra variables that you can kind of optimize so you can then target your Q of you okay and then you can assume well this is a very technical thing maybe but you know you can assume here that you do some sort of your Q of U is going to be proportional to the prior times an average data factor so if you remember what expectation propagation I told you before about is prior times the likelihood and the likelihood is approximated with terms which is a product of gaussians yeah so here you have an average data factor here G to the power of n which sort of reduces memory and loss factorization so the objective that you get in the end is mini-batch of all okay so you can in the end do something which is can be implemented in routines or you know like PI torch and so on and so forth in the inducing points then what we talked about before was that this idea of keeping the conditionals correct and then just focus on this Q of you okay so with target is really the joint between F and you at each layer given what happens in the layer before okay and again you know if you do this treatment and you derive the lower bound it's pretty messy you can end up with something with the where the lower bumper characterizes across training points and the other nice thing is that solo it's also factorizes in a sense that you get the marginal depends on the eyes data point across all the layers so this is another nice property that this has been exploited to really go fast so now we have any questions on this I mean I've come pretty fast just because I don't I don't want to bore you with super technical details on that I just wanted to give you the pointers in case you're interested in yeah all right all right so now I would like to spend a few minutes on this work on random features because I think it's I think it's quite cool and it opens up a lot of opportunities that I would like to talk about later so imagine that you have a kernel now that is a shift invariant okay so if that's the case then there is a botanist theorem allows us to use this idea to say that we can take the Fourier transform of so that this guy here can be written as the free transfer of a positive measure so that means that if you look at this it looks like some sort of expectation of something under a distribution of over frequencies if you like okay because now we're talking them free so then we can think of doing them sort of Monte Carlo estimate of this and you see what I'm going maybe you know you see here that have K which is written as a scalar product between basis functions okay so this looks very much like I found a way so if I have a kernel machine Gaussian process with a covariance which is shift invariant then if it's Gaussian imagine so this is gonna be free transform of a Gaussian okay so then I can draw from a Gaussian and takes a bunch of coefficients Omega and then these become the coefficients of my basis functions okay and then I you know instead of thinking about K I can think about the basis functions now so I can linearize if you like my Gaussian process okay thanks to this free a transfer relationship and in the case of Gaussian you know you get cosine of course yeah so you get trigonometric stuff because we're projecting everything you know here we would get cosine and imaginary sign but we project everything onto the the real axis and then you realize that the kind of that you need in order to get something like this has to have cosine and sine okay so just to make it simpler what I could do now is to do this idea at each layer of deep Gaussian process okay so if i do that then i apply cosine and sine element wise to multiplication of my input times these omega coefficients and so on and so forth and then remember now that we what we talked about at the beginning that we we can sort of think of these functions as combinations of basis functions yeah and so with this w to have a prior which is identity okay so effectively if we do this random feature approximation for each gaussian process at each layer what we gonna get we're gonna get something which looks like this which is now at neural network diagram okay so I am gonna get what is what we think of as a Bayesian deep neural network so here I have the inputs I multiply by some omegas which are random and drawn from gaussian let's say okay and then I'm gonna get the units here and then I'm combine them back with W okay this is gonna give me my first layer of latent functions and then I can go on again and I can re-expand them using this trick of free transform of the covariance and so on and so forth and now each of these Omega is controlled by the length scale of the Gaussian process that of the kernel parameters of the kernel and so on and so forth but now if you do this will choose something amazing which is an approximation of a deep Gaussian process as a base and it's neural network and of course we know how to deal with base in deep neural networks because we know we have fantastic tools to do so caste volitional inference fast and so on so maybe we can apply all these ideas to that and this is really what we did okay so here we have a bunch of results so here were that this is a 17 so you know these were results from two years ago one year ago and a half maybe but still I mean you can see that so this was the competitor that we had this deep chasm process with the expectation propagation this is some variation Gaussian process this is a plane deep neural network and these are deep Gaussian process with the RBF covariance which give you the sine and cosine activations and here with the R cosine you get a raloo okay so if you have an R cosine Colonel you have relu activations okay so that's the connection there you see that sort of we we achieve lower values of the error rate and tested likely on some data sets this is more striking by the time we could converge quite fast and comfortably the competitor could only do like two iterations okay so yes the state of the art of the person process a year or so and this was again this was an amnesty but no convolutions yet okay so you see that the error rate is still pretty high okay all right then we did also the the variant of amnesty with eight million images and back then this was a big deal because there were not many works that could do a mist without convolutions above 99% of accuracy so we did this we did it on this 8 million images and also dealing with 8 million data points where GP was kind of impossible right so at the same time we're also developing up other approximations to do actually do it and we also published that in the UI 217 but this is just to say that this was really like one of the breakthrough I would say in this field also now we can do because we're basing within the model selection so we can use maybe the lower bound although some people have discouraged this behavior of using the lower bound to model selection we can kind of use it to determine the number of layers so we were running on 5 million observations we were running 30 layers for you might not sound like a big deal but you know being able to do a deep Gaussian process approximate of course on such a number of data points with so many layers is something that was not done before ok and you see that if you look at the test error rate and mean- that's not likely you know the bet the best performance you get with ten layers and also the negative lower bound seems to agree on that okay questions so effectively I mean with this technique what we do is though if you think about Gaussian processes as being you know what I showed you at the beginning where you have infinite number of neurons and then back into F right here all we do is the truncation and the truncation is driven by this idea of free transform okay all right so I wanted to draw your attention now to this other work that we did because now because Gaussian processes are cool and you know you get all this distribution these nice outputs and means nice functions and so on people always kind of take for granted that they are well they're good for quantification of uncertainty but what does it mean good quantification of uncertainty that's the question and this is one of the questions we were trying to ask ourselves because I mean now we've been working got some processes for a long time and then you sort of start realizing is it really true you know you take these things for granted because everybody says you know Gaussian processes are good for quantification of uncertainty this morning you guys have done based an optimization with that why because it gives a good quantification of uncertainty but what does it mean and so we asked ourselves the question of calibration I think this is one of the many possible ways in which we can sort of assess whether we're doing a good job as estimating probabilities okay so this is a standard convolutional net I think this is a Lynette sort of structure and what you do it just put some fully connected areas in the end and you do the output you get the output so what people have done in 2005 thing is a work by rich Caruana showing that you know convolutional networks were deep neural network or calibrated talked about calibration in a second what I mean by calibration but there was this other work by war and others in 2017 where they show that actually now with this increase in the depth of convolutional networks what happened is that we lost this property of good calibration of deep neural network and what happens is that even if you using the right loss for your problem so if you're using you know cross-entropy loss for classification if you use quadratic loss for regression you still can get problems in calibration so your predictions might be too overconfident so you give your sign probabilities of 0.9 to things that shouldn't be 0.9 okay and this is due to the fact that now we're increasing the depth of these models and so regularization should be taken care of in a different way and should be something important to look at so what I mean by calibration so this is my picture on real reliability diagrams which are the way in which one could analyze the properties of classifiers so imagine that you have a classifier and it spits out probabilities of class labels okay that's for now let's think about just binary classification so it's plus one or minus one so what you could do is to take for example all the predictions for which you have 0.8 and what you would like is that if this probability is a correct estimate of the probability of class label one you would like this 0.8 to reflect into the test set right so it would like is that if you measure the fraction of positives inside you all the predictions where you said 0.8 you would like this value to be 0.8 right so this is what we mean by calculation and of course now you can do this exercise for all values going from zero to one and what can happen is that your reliability diagram looks like this so that when you say is 0.8 actually it was something higher so this means that your classifier under-confident or you can get shapes that are the other way around and it means that your classifier is overconfident and this is somehow what deep cultural networks the durability diagrams of modern congressional networks look like so there was also this so with during this summer school nobody was talking about all the problems of fairness so not you know calibration is important of course it's not the only thing right so there are other issues to think about for example fairness I think we made a good point about the use of that in order to assess the quality of classifier well of course you know now you say okay you know what this is pretty simple problem I just take I just leave out some data on the side and on that data I used that data to assess the calibration and then I recalibrate I just transform I just make a one-to-one transformation that sort of bends back my reliability diagram into something which is straight okay a straight line like this and this is the common practice this is what people do all the time but now imagine that you have a data set which is not very big you have to start cutting out you know part of the data for the test set and then you have to start cutting out some part of the data for this validation set because you need to recalibrate and so on so this is my you know you you use less data to do your task and maybe also it's a good question to ask ourselves what is a principled way of developing classifiers that are by construction calibrated so this is the kind of question we were asking we thought oh you know what you know we have Gaussian processes let's just start Gustin process on two convolutional nets and everything is gonna be fine because people told us that reliably you know that Gaussian processes do good qualification of uncertainty so let's do that let's take a congressional depth I mean it put a Gaussian process there and again you know I would like to draw your attention to the fact that five years ago this was really hard to do and nobody had done it before and the reason is that now it's pretty easy to take all these modules and put them together automatic differentiation so on yeah so this is really the stuff that we've been doing recently so do we get better quantification of uncertainty or let's put even a dps in process now because we can because we now have something which is looks like a basin deep network in the end with our approximation the answer is no right so this was maybe surprising or maybe not because the really the real issue is about the regularization of the filters in the previous layers okay so here we we tried a bunch of combination of congressional nets and GPS so we have at the time when we did this work like this year last year let's say there was this work on hybrid gaussian president deep neural network so this is really if you like taking a convolutional Network and instead of placing a fully connected layer we put a Gaussian process with inducing points approximation this was worked by Bradshaw Alex Matthews Zhu Bing government then we took Tsukasa very significant learning by under Wilson and others SVD KL and convolutional GP by Vander Wilk and others in 2017 these are all the relevant diagrams of all these methods okay this is a plain convolutional network and these are the other three models and you see that they're all sort of not very well calibrated okay none of them hits the diagonal line there right so what we did was to say well let's ask ourselves a question is a bayesian convolutional network going to fix the problem and the answer is yes so here I'm going to show you I mean I'm showing you a lot of plots but you know that the story really if you look at one says it all right so what you see is that by being bayesian about the filters in convolutional network then you recover this calibration so the story here is that it's important to regularize the filters okay and the way to do it was using multicolor pout because there were not many ways in which you could do Bayesian convolution when we wrote this paper I remember I mean this paper here in a estas 1919 we were writing related work and I couldn't find any work on by ASEAN congressional Network before the one on Monte Carlo dropout and I spoke with my colleague at Columbia John Cunningham and he was like this is impossible I don't believe there is not a single paper and convolutional networks prior to that and believe it or not there is not a single implementation of base and commercial that's before that at least to the best of our knowledge I scouted the literature a lot nothing and the problem is pretty hard because now you have millions of parameters you have to put priors you have to compute posteriors and stochastic originally in France for those of you that came to my poster yesterday if you have so many parameters over a girl arises and then you get solutions that are useless okay so Savannah Internet's for commercial networks is pretty hard I mean we're trying to solve this problem with various ideas that we talked about yesterday for example but this is just to say that it's important if you want to do a classifier that is by construction calibrated you should use the right laws and regular Rises well regular as well for example by being busy okay why not so what we did was to replace the last layer of a congressional net with our deep cuts in process B and we've been based in about everything so filters all the parameters and so on and so forth and we also get a now a deep Gaussian process which is kind of okay in calibration okay so here then I have a lot of results what I probably skip them and we also achieve state of the art and convolutional nets and results and so on and so forth so I wanted to take maybe the last 15-20 minutes to let you have the opportunity to talk about to ask any questions first and then also talk about some recent trends which I think it's cool we'd like to ask any questions now maybe yes so yeah so here what we show is that you know what happens when you sub sample the data set so one of the reasons that we Basin is that you want to be able to do good quantification of uncertainty no matter how large your data set is especially for small data sets it's going to be more and more important to be able to characterize uncertainty well and so what we're trying to show is that this property of calibration kind of is preserved across sizes different sizes of the big set yeah so the histogram here shows the fraction of the day of the predictions that follow fall under that beam okay so here for example for this classifier we have a lot of test examples that are classified with higher probability okay and less and less you know there's a histogram there that spreads out all the way down to 0.3 yeah the error bars I think we repeated this experiment many times and then we reported mean and average I think mean and standard deviation sorry I think that's the yeah yeah yeah yeah but here okay the fact that you'd see very little uncertainty a little error bar means that it's consistent across different repetitions we repeated the experiment with different there is something of the you know one quarter of the data set and one 1/2 and so on and also different you know runs of initialization and so on yeah what sorry you mean here good question I think there is still some parameters that we optimize in the model and we have lots of parameters because we also have all the parameters for the kernels on the deep Gaussian process and so on and I think that the fact that we don't regularize those effects a little bit the performance in terms of calibration we become slightly overconfident there I think that's my speculation at least and then here again you see that we're using one tenth of most of the dataset yeah so can I go on alright so there are many things I wanted to say about this so one actually maybe I should mention this because I think it's pretty cool one of the cool things were working on at the moment is that is this is the following so in in a Gaussian processes when you have this random feature expansion what you have is that you have the multiplication of your inputs by this random graph some matrix actually there's a company in France that is producing and hardware that does this with exploiting the properties of light and scattering of light so at the moment we have a way to do this computation actually this also applies a non-linearity by the way so we can do we can generate these random features at the speed of light in hardware and it's pretty cool and this is something I got a grant to work on this I got a couple of people to help me with this and and so just to say what's this pace because this is I think really cool I mean it's a completely different way of thinking about new hardware you know people have pushed this idea of transistor based technology by you know saying it was developed TP use and you know let's push the limits of what we can do with tons of products and of course this is a very specific kind of operation but for gasm process is just perfect we can do it at the speed of light another thing you could do is that thinking about this product here is that this is a product with a Gaussian random entry so what if we use a pseudo-random run the matrix so there's this work by Felix you in at nips 2016 where they replace this random Gaza matrix with the product of matrices which are of the following form you have Haddam our matrix the atom our matrix is a very special matrix with plus and minus ones it's a full matrix ok and then you have a diagonal matrix then you have another harder mark another diagonal Hadamard diagonal ok so what it does if your diagonal diagonal matrices have elements in the diagonal which are plus or minus 1 and this H is the hard amar transform what happens is that you achieve some sort of something which looks kind of Gaussian ok so this is a there's a lot of interesting material behind this but the cool thing about this is the following if I now have to multiply Omega times a vector what I need to do is to multiply by diagonal matrices which cost me order D ok so that's very cheap to do and then how the math transform apply to a vector it's gonna cost me D log D ok and I don't even need to store H so what what does it mean is that what does this mean this means that we can do the multiplication of Omega times vectors with only D log D operations and only storing D numbers instead of having to allocate this random matrix which has these square numbers and multiply the matrix by vectors which would cost us again D square so I think this is quite cool and I think this is one of the reasons why I prefer to work with random feature approximations instead of inducing point approximations because then we can sort of leverage all these ideas and so we we also have a result of this in the paper in a start where we do convolutional nets so we do convolutional networks and then instead of using a DP or some process with random features we apply this structure random feature approximation now one of the cool things that I'm really excited about is some recent work we've done with my PhD student Simona we are now thinking what if we replace now you know imagine you have a deep neural network with mid marked matrices of weights what if we replace these big mattresses of weights which are going to be if we think about Bayes and inference we're gonna have a distribution so we're gonna have a distribution of our matrix so what if we use this idea this parameterization to sort of come up with a way to obtain a distribution of our matrices well if we have a distribution over this guy for example and maybe also this guy in this guy somehow you can achieve something which is a distribution over matrices of course this is going to be sort of degenerate because we're only using D plus D plus D numbers to parameterize something which is d square okay so this is gonna be different generate well can be can this be useful for based on deep learning it turns out that actually we can do some pretty cool stuff with that and which you know competitive to state-of-the-art performance if we have this parameterization of base and dipnet so again watch this pace because this could be something quite cool to do and so I think I like this idea because this is something that we try to you know if you talk to people in deep learning they have no idea what all these kernel tricks are and in the kernel literature this is sort of stuff that people have been working on for a while now and it's kind of cool to try to leverage all this knowledge in one field to bring it back to another field and of course you've talked to people in current methods they have no idea what the convolutional network is all right so I think it's really important to try to bridge these gaps you know and then try to bring some of the techniques that we developed in one field to another some other interesting works that I think I mean if you look at the recent ICML nips apart from the works of already referenced before if you look at the recent works and in HTML nips say stars and so on there has been very little work on the deep Gaussian processes and again I think the idea is now that you know I wouldn't say that we have exhausted the possibilities of what we can do with the PGP is of course not what I mean really the one of the recent works at ICML for example applies importance weighted relational interest to the GPS which is you know importance weight additionally inference has been developed for basin dip networks you know it's been ported to the GPS right some really cool work I think it's this deep convolutional deep chasm deep convolutional gaussian process I think this is the alto group with Sammy Kazuki this is also quite interesting it improves on some previous work on the convolutional GP so here it's an idea I remember talking about this in Moscow two years ago when I gave a talk at the end X actually and so there is a YouTube where I say this actually so imagine now that you take a deep neural network where each layer is not implemented by linear transformation but each mapping between one layer and a neuron is actually got some process ok so now you start having something which is kind of a huge model and pretty big to do so we failed to actually make it work and these guys have actually managed to do it and so I'm impressed by that and it actually works and it is achieved some reasonable performance ok so it's possible also to think about in these terms of course you know we you know you can think about Lib GPS to do some sort of out encoders but eh not encoders with deep custom processes this is something we've you know people have done in the middle or SS group and in our group also we did something really cool which I really like you know this idea of deep Gaussian processes where you constrain the dynamic so you imagine you have a physics that tells you something about the functions you know that this should be you know they should have some properties of differentiability or that they should always be positive or that should be monotonically increasing or they should obey some sort of differential equation behavior right so how do you actually combine physics with deep Gaussian processes so we have a formulation of variational inference that kind of takes into account that alright so I'll just jump to the conclusions I think I'm doing ok with time I'm very happy about that so I think the biggest question I still ask myself every day when I go to work is why should we do the big of some processes what is the use of the implication process today right I mean now establish so many connections with deep network so is it really worth it to think about the GPS and as different things as deep neural networks in my head today I think I have a very blurry line between the tool for me now when I think about a basin deep neural network I could equally think about a deep Gaussian process and vice a versa and I think a lot of people now established further connections between you know limits of base and it networks when you start having you know lots of neurons and so on there's been further work on that establishing ways to sort of map one into the other but I think I mean one great use of deep grafts and process is the following if you think about the way you can determine the kernels and how the kernels determine the basis functions so on you can start thinking about maybe this could give you some good ideas on how to determine priors for your Basin deep networks because at the moment you know it's I would say it's straightforward to define priors for millions of parameters okay so maybe using our the connections with kernel methods and VPS and processes and so on maybe we have a better way to understand this and of course you know even understanding based on deep learning I mean now there's been you know tens well decades of work on understanding kernel methods understanding Gaussian processes and I think this might be a good springboard maybe make some advancements in understanding better based on deep learning inference for deep customers is hard I try to give you a brief overview of why this is the case and so we need to do model approximations approximate inference at the moment it's pretty hard to determine any anything about the impact of these approximations in perform what is the impact of these things in performance so I think I'm quite excited about this trying to borrowing a mixing ideas from GPS and deep learning I think this is a also the kind of trend I think now so we do stochastic building based approximate inference low rank decomposition we use algebraic tricks I try to give you some some ideas for that I mean if you're interested of course there is a code online you can try things up and break it play with it I think one of the strategy disappointing things I've noticed is that I was hoping that the combining GPS with congressional networks would give us something more but I mean probably you know the success of convolutional nets is just because the convolutional filters are so powerful okay and so I would say it's try to disappoint you but maybe you know we should have expected that okay so I funded by long fellowship of seven years by the AXA Research Fund and I also got this national fellowship that started in February where we're doing these competitions with light and I would like to thank you for your attention and I'll be happy to take your questions thank you [Applause] [Music] [Applause] while someone is gonna raise the hand I'm gonna also show you that we have a tutorial that we delivered a month ago in a CNN in Budapest so we have gathered a lot of things about Gaussian processes so it's a big tutorial you can check the slides there are notebooks to play around with references we try to compile a long list of references on anything that we think is useful for GPS so link in my webpage you know you can find all this right just look me up and then you'll find this link the tutorial touches upon DB of some processes although the notebooks are only about the awesome precision but we we also have a section where we talk about deep Gaussian process in the tutorial because this was a workshop a tutorial at the international joint conference on neural networks so we really had to you know make the connections with the deep networks and obviously the Gaussian processes are the right thing to do yes hi so you said earlier that the other conference was due to regularization could you go into more detail what it is about the regularization or why that cause that they caused other continents so it's a this is a great question I I still so I can answer by elaborating a bit more on the fact that we tried a lot of we tried a lot of things to try to understand better why a regularization is a problem so we also played around with this idea of changing the loss so if you have a classification problem and you apply the quadratic laws so you use the quadratic laws on your classification problem you do you can do it you get results which are good but you lose calibration so using the write loss for the problem you have is really important and this is is due to so let's see if I remember exactly the the details so I would write what I wanted to tell you a bit more about the loss but I think you know your question being on on regularization I think it should be addressed so this was I think in the 2017 paper on commercial networks this was telling some sort of empirical observation so they said ok now we take this vgg or ResNet and we do completion that we apply we apply to some data set and we show that this this doesn't work so one thing we could do is to try to you know change various types of parameters so they change the weight decay I think they change they change something else they change a bunch of things and they see that the only thing that really makes a big improvement is the regularization one could say well but I mean if you if you don't regularize properly than if you'd also ever fit so you should have a problem also with performance yeah so that's the I guess the most surprising thing yeah so that what we see is that we were affected not on performance but only on Aundre on calibration so did the theoretical reason I think we haven't ever managed to establish it yet I I would love to you know to be able to deal it down concretely my speculation is that I mean if you have a huge model and you don't regularize properly then some something's gonna happen something bad is gonna happen somewhere and if it's not performance then in this case I mean can think also like as a calibration as a measure of performance so I didn't go too much into the details but if you think about this see where I can so you can measure the area under the so this sorry you can measure the area that this curve makes yeah and you know compared to the so you know the volume here I would say yeah this area here and here and zero is good higher is worse and this is called expected calibration error okay but then there is another score which is called the Brier score okay so this is the expected calibration error this is another square which is called the Brier score which is a combination of the calibration error and some sort of error rate let's say okay so somehow this is also a measure of how good your how well you are doing so the lower the better in this case yeah and so the question of making the model not very well regularized somehow it's going to also affect performance through the Brier score yeah because the calibration is going to be affected so these are all measures of performance somehow but this is exclusively looking at calibration error this is regarding performance yeah so if you have a classifier that says always I don't know 0.5 and it's calibrated you know you got zero EC but it's a useless classifier so the brighter score I think it's a better measure to sort of balance the two so if you don't regularize well you're gonna pay for it in performance and I think this is sort of reflected in this interplay between calibration and performance yeah I don't if that explains that this is my best intuition as to why regularization really affects you yeah so so we're just talking about like there [Music] not a good question I'm sorry I was thinking about like what if we use this application like this kind of model for the text and to see what would be the the result or improvement and the second solution for the text mining or like kind of other kind of data set it's kind of this and the second question is like when we are like applying Gaussian processes in the way that we are thinking about learning representation of the last layer let's say in the deep learning you try to go through with this kind of stuff try to think about interpretability or this kind of alright okay so about the text I I would say I don't have much experience with the test data sets so this is not applications I'm very much involved with I'm sure that you know you can apply you know equally as you can apply convolutional like networks for what for images then there is equivalence for text so I guess you can you can always you know find ways to mix and match I have I don't think I've I've come across with works that use deep doesn't process for text but maybe you know um maybe I've missed something because I'm it's not really my my area of application the second question is interpretability I think this goes back to the question of choosing good priors I think you know the connection between priors and kernels prior over weights and kernels kernel parameters and so on allows you somehow to say something about maybe some of the dimensions of the problem not being relevant you know you can do this parameterization where you you you sort of throw away certain dimensions of the problem and so on and so you you could think of this as a way to move towards interpretability in the sense of saying you know this dimensions don't really count these functions these observations they don't really count to make a prediction and so on and this is sort of learned out matically when you learn your model so this is I think the most the closest we can get to interpretability with Gaussian processes if you make a kernel which has parameters that you can then read off and say like in you know people do it in logistic regression yeah so they say you know the weights are zero so that feature those features don't count we can do something similar to in Gaussian process we can have a parameterization of the kernel so for example I mean I have this you know expression of the kernel here which somewhere we're you know here you can put you can put like a scalar product with a diagonal matrix which controls these these things you know and then you can read off the importance of different features and so on and so forth I think this is the closest we can get to something which is interpretable and then other than that it's it's pretty hard I think I think we're running out of time thank you thanks a lot [Music]
Info
Channel: BayesGroup.ru
Views: 2,715
Rating: undefined out of 5
Keywords:
Id: 750fRY9-uq8
Channel Id: undefined
Length: 94min 46sec (5686 seconds)
Published: Sat Aug 31 2019
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.