Variational Inference: Foundations and Modern Methods (NIPS 2016 tutorial)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
okay good morning everyone we're gonna get started with our first tutorial in this room this morning we have variational imprints foundations and modern methods so we have three really amazing speakers this morning first we have Dave Bligh who's a professor of statistics and computer science at Columbia University we also have Rajesh Ranganath who is a PhD student at Princeton University and Shakira Mohammad who is a senior research scientist at google deepmind and I don't want to eat into their time anymore so without further ado let's get started thanks Tamra thanks for the opportunity to tell you about variational inference Raj and Chuck here and I have prepared this tutorial variational inference foundations and modern methods we prepared about a seven and a half hour tutorial so I'm gonna get started also I want to mention that two of us are jet-lagged and Chuck here is not a morning person so but I imagine many of you are in the same boat also the echo is distracting but I'll get used to it so we'll start with just some pictures these are pictures of things you can do with variational inference this is a picture of overlapping communities discovered from a huge network it's a picture of topics we'll talk about topics in more detail in a little bit found from two million articles on the New York Times using a laptop this is a picture of using variational inference to learn about scenes and to do control and reinforcement learning this is a picture of a genetics analysis large scale important model in population genetics the Pritchard Stevens donnelly model fit with variational inference a large scale this is a picture of a neuroscience analysis of a large fMRI data set and finally here are some pictures of representing how we might do compression or generate content using variational inference I know I didn't say anything about any of those pictures but I want to give you the the sense of the breadth of applications of variational inference oh and there's one more this is a picture of analyzing 1.7 million taxi trajectories using a probabilistic programming system called Stan using variational inference okay so all those pictures and the extra picture go through what we like to call the probabilistic pipeline in a cartoon this is the probabilistic pipeline where okay can you see that little green dot and you can't see it over there where we take our knowledge and our question that we know that we want to ask about some data use it to make assumptions about the data and turn those assumptions into a model okay the model has hidden quantities which are represented as hidden random variables and observed quantities then we square our data and our model together to discover patterns in the data and then we use those discovered patterns to form predictions and explore the data answer the question that we started out with I like this picture because customized data analysis has become important to many fields and this pipeline separates these key activities making assumptions doing computation and then applying the results of those computations and this makes it easy to collaborate with domain experts on problems and statistics and machine learning now you can see from this picture that the key algorithmic problem is how to take your model that has hidden and observed variables and your data and uncover the values of the hidden variables given the observed variables that's inference that's the subject of this morning's tutorial inference answers the question what does my model that I that I developed based on my knowledge and my question say about the data and our goal in building probabilistic machine learning as a field is to develop general and scalable approaches to inference we want to be able to lubricate this pipeline so that we can try many many different models answer many different kinds of questions analyze many different kinds of data but to do that we need easy ways to compute about each model with our datasets and we need to compute with them at scale moreover this picture is actually hiding a loop right really what we want to do once we've lubricated this pipeline is go back and revise our model and and work with data and models in this loop where we criticize our model revise our model do inference look at the results criticize revise and so on okay this is how this is our vision for a probabilistic machine learning okay so first I want to give some main ideas and historical context for variational inference the problem of general and scalable inference with probability models so the basics I'm sure many of you are familiar with this a probability model is a joint distribution of hidden variables Z and observed variables X okay so P of Z comma X inference about the unknown variables about the hidden variables happens through the posterior alright that's the conditional distribution of the hidden variables given the observations P of Z given X that's the posterior and as you you all know that is the ratio of the joint and the marginal probability of the observations P of X for most interesting probabilistic models that we study and machine learning that denominator P of X is intractable and can't compute it exactly and so we can't compute the posterior and that's why we appeal to approximate posterior inference that's why we use things like MCMC and variational inference you'll see this picture several times this morning so variational inference in a nutshell turns inference into an optimization problem so remember our goal is to compute P of Z given X and the way variational inference works is sorry I'm going to point at this screen is first we posit a variational family what's called variational family of distributions over the latent variables okay so this is denoted Q of Z semicolon nu all right so nu are what are called variational parameters they index this family of distributions over Zi over the hidden variables and I've represented that in this picture here where it's a set right this is a this is this circle here is a set of distributions over Zi indexed by nu each point in that circle is a different setting of the variational parameters the goal of variational inference is to fit the parameters knew to be close in KL divergence to the exact posterior just for kicks I'll walk over here now so what we want to do is we want to start at some value of nu nu in it and we want to search through this family of distributions using an optimization procedure to get to nu star where nu star is close in some sense to the posterior distribution that we care about P of Z given X and we are in this tutorial we're going to be close and KL divergence we are gonna we want a new whose value is close in terms of KL of Q of Z and P of Z okay and that's the objective function alright so in this way we've turned inference into an optimization problem while MCMC forms a Markov chain whose stationary distribution is P of Z given X in variational inference what we do is we posit this family of distributions then we search through that family to find the the member that is closest in KL divergence to the exact posterior to the thing we care about all right now here we're gonna talk about KL divergences there are other divergence measures you could look at KL PQ you can look at other things and those correspond to other methods like EP and BP and other things which are also could be more broadly construed as variational inference but just this morning at least we're gonna focus on KL QP alright it's an interesting line of research to look at other divergences but we won't do that here okay so here's a picture here we're taking a mixture of Gaussian so this is these are data from a mixture of gaussians and as we run variational inference we are finding better and better approximations of in this case the posterior predictive distribution and so you can see that we converge on a posterior predictive distribution that looks pretty good for mixture of gaussians with these data and then we'll get to what these quantities are later on but this is something like the KL divergence this is the KL divergence up to an additive constant and you can see that we're getting closer and closer to the exact posterior okay so this is in a nutshell stare at this picture this is variational inference okay we thought we would I don't know where to stand I have to we thought we would say a little bit about the history of these methods because they've become important lately you know what variational inference really is is it's adapting ideas from statistical physics to probabilistic inference and it's hard to know when ideas start but arguably this idea began in the late 80s with the work of Peterson and Anderson from 1987 who I think were physicists and used mean field methods to fit a neural network and this idea was picked up by Mike Jordan's lab at MIT in the early 1990s there were people like Tommy Accola and Lawrence all and Zubin germani pioneers of this some of the pioneers of this method who generalized variational inference too many probabilistic models so there's a nice paper by by that those four authors from 1999 that reviews that work in parallel Hinton and VanCamp also developed mean field inference for neural networks and possibly not aware of the Peterson and Anderson work and they connected that idea to things like e/m the e/m algorithm that we all know which led to variational methods for other kinds of models like mixtures of experts and HMMs one of the other pioneers here was the late David Mackay actually there's a whole chunk of years missing in this history so you know in the in the late 90s and early 2000s a lot of machine learning researchers developed variational inference for specific models and generalized to some in some sense I'll we'll get to some of those generalizations in a little bit and now today there's a flurry of new work on variational inference and that work is about making variational inference scalable easier to derive faster to give it better fidelity and to apply it to more complicated models and applications and modern variational inference which is what we'll teach you about this morning touches on many important areas in machine learning things like probabilistic programming reinforcement learning neural nets convex optimization Bayesian statistics and of course many applications and so that's our goal today is to teach you about the basics to tell you something about some of these newer ideas to try to situate them in context relative to each other and to suggest open areas of new research although I'll tell you that I don't believe we made that slide okay so here is the next three parts of the tutorial in part two the next part I'll give you the basics of variational inference particularly around what are called conditionally conjugate models and and talked a little bit about how to make variational inference scalable with stochastic variational inference then rajesh we'll talk about stochastic gradients of the variational objective function which via Monte Carlo which help us expand variational inference to to many more types of models than we could in the late 90s and early 2000s and then finally Chuck here will talk about beyond the mean field I know I haven't defined mean field yet but beyond the mean field where we can take those same stochastic gradients and they enable us to think about very complicated and expressive variational families okay and so this is what I just said is the summary of this tutorial all right here's the picture of variational inference variational inference is about approximating difficult quantities from complex models turning inference into optimization and the running theme of this tutorial is that variational inference and stochastic optimization are a powerful combination with stochastic optimization we can scale variational inference to massive datasets we can enable variational inference on a wide class of difficult models and we can enable variational different inference with an elaborate and flexible families of approximations those q distributions this is the this is the summary okay so with that let's begin and talk about the simple ideas mean field variational inference and stochastic variational inference it's useful to have a model that we're talking about concretely when we when discussing these ideas so the motivation here is going to be topic modeling topic modeling our models that use posterior inference to discover the hidden structure in big unstructured collections of documents unsupervised learning method and one of the earlier examples of variational inference applied to a real problem and the example model that we'll look at is late and fiercely allocation or Lda and I'll go over it quickly Lda this is a document model works based on the intuition that documents exhibit multiple topics okay so here's a article about determining the number of genes that an organism needs to survive it's from science magazine and I've highlighted different words here with different colors words like computer analysis predictions computational or blue words like genes and genomes or yellow words like organisms and life and survive are pink these are all different topics this article could be seen as combining data analysis and genetics and evolutionary biology and the intuition behind Lda is that if you if I bother to highlight every word in this article and you squinted at it you would say oh this combines those topics okay what we want to do is we want to build that into a probability model okay so here's the probability model lay in fiercely allocation which as a generative process assumes the following first each topic is a distribution over words okay so this is going to be like a mixture model where the mixture components are our distributions over words words like gene and DNA of high probability in this one and so on and then each document is generated by this model by first choosing a distribution over those topics then for each word choosing a colored button from that distribution over topics and then choosing the word from the corresponding distribution over terms the corresponding topic alright so this is a what's called a mixed membership model in statistics it says it's an example of a Bayesian model for which we cannot compute that normalizing constant and you can see it as a mixture model where each document comes from a mixture the mixture components are shared across documents but the mixture proportions are unique for each document they're drawn fresh for each document I turn the page of science and I generate an article now about data analysis and neuroscience for instance okay so this is LD a but more generally this is mixed membership modeling it's an example of a class of models that's for which it's difficult to do posterior inference now what is posterior inference of course we don't get to observe all that structure that I described for you in explaining the model really we just get to observe the documents and so we want to fill in all of those hidden random variables all right and so that is a posterior distribution here the probability of topics proportions and assignments given the documents okay that's an example of when we're doing posterior inference and notice we'll get to this later we want to do this with millions of documents okay or and billions of latent variables if there's a latent variable for every word we have billions of latent variables okay so in the modeling process we want to represent our models as graphical models these encode assumptions about the data by factorizing the Joint Distribution of the hidden and observed variables it connects to both assumptions that we're making through the model and algorithms for computing with data using this model and it defines the posterior through the joint okay so here's the graphical model for Lda remember nodes are random variables edges - no dependence between random variables shaded nodes are observed unshaded nodes are hidden and so you can think of this as a picture of the posterior right because I've observed the words of each document and everything else is hidden all right so in this picture the topics those distributions over terms are denoted by beta so there's K of them these these rectangles are our plates and in denote replication so we have K topics and then for each document I choose a distribution over topics that's theta and then for each word in each document I choose a colored button that's Z and then I choose the word from the corresponding topic all right so this picture tells us what the factorization is of the joint distribution and connects to algorithms all right again here's the posterior now for this specific model and you can see it is the joint divided by the marginal distribution of the of the words and again we cannot compute that denominator so we appeal to approximate inference okay I didn't know why that slide was there here's an example of doing inference with that model on a large data set with the kind of algorithms I'm going to tell you about so with this model we feed in two million New York Times articles or 1.8 million New York Times articles we look at the posterior distribution of those distributions over terms and we see things that we see words that go together in coherent topics words like children's school family parents stock % companies market and so on ok so how did we get there how do we get from that that assumption those mob that model the posterior that we couldn't compute to an algorithm that computed or approximated the posterior that we cared about how did we get there well in the next three sections of the talk I'm going to first define a generic class of models that that model is an instance of derive the classical mean field variational inference algorithm for that generic class and then derive stochastic variational inference which let us scale that algorithm to large data sets ok but keeping in mind LD a as our running example alright so here's a generic class of models again using the graphical model notation so we have observations X x1 through n we have what are called local variables Z there's a Z for every X and we have with global variables beta the difference between local and global variables is that the I data point X only depends on its local variable Z and the global variables beta and those criteria are denoted in this graphical model and in the corresponding factorized Joint Distribution okay our goal of course is to compute the posterior the probability of beta and Z given X all right so let's define a few more terms first the complete conditional what is the complete conditional the complete conditional is the conditional distribution of a latent variable given all of the other latent variables and all of the observations alright so if you know the Gibbs sampler you know that the complete conditional is what you sample from when you define a gibbs sampler now we're going to assume in this class of models that each complete conditional is in an exponential family alright so here what I've done I've put those complete conditionals in their typical exponential family form in their natural exponential family form I'm sorry I want to go over there but I just somehow need to be over here and so here for example is the complete conditional we'll start with the global variables the complete conditional of beta given all of the local latent variables Z and the observations that's in some exponential family where notice that the natural parameter of that exponential family is a function of whatever it is I'm conditioning on okay that makes sense it's a conditional distribution so the natural parameter is a function of what I'm conditioning on and we're being general here so this could be any exponential family a gamma multinomial categorical Gaussian whatever it might be all right the complete conditional of the local variable first of all notice it only depends on the global variable and the and its data point okay and that's a function of the independence assumptions that this model naturally makes and it too is an exponential family an arbitrary exponential family okay so this is a restriction we're placing on this class of models now when we assume that we can say something about the global parameter of the global random variable from the theory around conjugacy here we're citing the Bernardo & Smith Bayesian theory book that this is a die CONUS and evil aqus from the 70s I think from conjugacy we know that that natural parameter of the complete conditional of the global variable has a particular form and it's the hyper parameter plus a sum of sufficient statistics applied to each data point and local variable okay and so we're going to come back to this form later of the natural parameter of the complete conditional of the global variable okay so we just define a class of models where we have this graphical model where we have asserted that each complete conditional is in an exponential family and that class of models actually describes most of the models that you might have read about in the 90s and 2000's in the machine learning literature alright so Bayesian mixture models time series models factorial models matrix factorization models mixed membership models like LD a a variety of models all fall into that category you can kind of take any of those models and sort of funnel them into this into this framework okay so what we're gonna do is variational inference on that large class again here's the picture of variational inference define a queue optimize the parameters that Q to make it close and KL divergence to the posterior that we care about this is an important slide so that KL divergence is intractable and we actually can't compute the KL divergence somebody asked me about this in line for a lanyard actually they said hey how do you do variational inference you can't compute the KL divergence because it requires knowing the posterior itself which is not the typical lanyard line conversation and this is the answer to that question if you're here Vincente then this is the answer what we do is we work with what's called the elbow it's the call to evidence lower bound and it's a lower bound on log P of X on the problematic denominator of the posterior more importantly maximizing the elbow is equivalent to minimizing the KL divergence it's easy to see and so this is our objective function for this whole talk the elbow okay now you can see that the elbow has two terms the first term so it's it's a function of the variational parameters that makes sense because we're gonna optimize with respect to the variational parameters and it has two terms the first term is the expectation of the log joint all right so now if you just in your mind imagine the log joint use the chain rule divided up into log pryor plus log of the likelihood you could see that if I only had that term and I was optimizing with respect to Q I would want Q to place all of its mass on the map estimate of the latent variables right that's the first term the second term of course is the entropy of Q all right expect - expectation log Q that's the entropy of Q so while the first term wants me me I'm Q to pilate's all my mass on the map estimate of the lane variables the second term says I actually also want Q to be diffuse okay and so these terms trade-off and in the sense the second term is regularizing the first okay one caveat the elbow is not convex so are it's true our goal is to optimize this but it's also true that this is not a convex function so we're gonna we're gonna find a local optimum of the elbow okay so that's our goal to optimize the elbow we still need to oh wow okay we still need to specify the form of Q right we still need to specify what is this Q of Z and and and beta that we're gonna optimize and that's where another important idea comes in the mean field family if I heard of mean field variational inference it's about the family that we're optimizing with respect to so the mean field family says that all the latent variables are independent and governed by their own variational parameter okay so here here I have Q of beta and Z has variational parameters lambda and fie where we have Q of beta governed by lambda and then independently Q of Z is governed by its own fee so I might have 10 million latent variables each one has its own variational parameter and I'm tweaking all of those parameters to make the corresponding distribution close to the exact posterior okay furthermore Toofer to fully specify that family each factor is going to be in the same family as the models complete conditional so if P of beta given Z and X is a gamma then Q of beta is going to be a gamma but with a free parameter lambda that I can control all right now this when I show this to statisticians they think it's crazy because it's a bunch of disconnected variables each with their own parameter right and and and there's no data in there so how are we estimating it the idea is that through the elbow through this optimization function through the KL divergence we are connecting this this family to the data into the posterior that we care about okay so that's that's how it all works that's the full set up we then optimize the elbow the elbow defined in terms of these variational parameters and the nice result of gare monte and beal from 2001 is that if we iteratively optimize each parameter in turn holding the other parameters fixed then they have this nice coordinate update lambda is the expectation of ADA of Z&X and fee is the expectation of ADA of beta and X all right and and and the mean field assumption ensures that the right-hand side of each update is independent of the left-hand side okay and so this is how this is how variational inference in the 90s and 2000's worked we set up variational parameters and then we marched down those variational parameters iteratively updating each one okay that gets us to a local optimum of the elbow and notice the relationship to Gibbs sampling right and Gibbs sampling we iteratively sample from these distributions in variational inference we set these parameters equal to the expectation of their natural parameters now let's go back to LD a very quickly because I'm trying to be sensitive about the time of rotation check here so in LD a mean field family says everything's independent the topic proportions all the topic assignments and from that we can run coordinate descent variational inference take our article look at its topic proportions so that's an example of what we would do with a fitted variational parameter we would take it and then look at look at it just to explore the corpus and then we can look at you know we can see this this article only exhibits a handful of topics we can look at the most frequent words from the most frequent topics and we see that they again through the variational parameters give us something kind of interpretable okay and so this is classical variational inference in a nutshell we start with our data and a model and we repeatedly update each variational parameter using these coordinate descent updates until the elbow has converged again that recipe gives you variational inference for a huge class of models or for a large class of models for a huge class of models Raj we'll talk about it okay there's a problem though classical variational inference is inefficient right so let's take L da as an example we start out with random topics garbage topics that have no meaning and then we painstakingly analyzed every article according to those topics because we have to that's what the recipe dictates right we have to march through all of our latent variables and we have lane variables for every dock before we can update the topics again this can't handle massive data and that's where a stochastic variational comes in so Cass Tech variational inference scales variational inference up to massive data and only uses the mathematics that we had in the first part of the talk okay this is the cartoon of stochastic variational inference where instead of marching through the whole data set before updating my topics I iteratively subsample a small piece of the data infer it's local hidden structure update an idea of the global hidden structure and repeat okay and and going through this over and over again gives us very fast convergence of the elbow okay so here's another important idea from this from this tutorial stochastic optimization alright and that's the key idea that turns classical variational inference into stochastic variational inference so stochastic optimization this guy his name is herb Robbins he invented stochastic optimization also empirical Bayes also Bandits and he did this you know in the 60s so whatever he smoked you want to smoke it too okay what's the idea behind stochastic optimization this is Robinson Munro 1951 the idea is that we replace the gradient in an optimization with a cheaper noisy estimate of that gradient this is guaranteed to converge to a local optimum in a in a objective like the elbow and this single idea really has enabled modern machine learning why are we speaking to however many hundreds of people about variational inference because of stochastic optimization this is I feel like we owe stochastic optimization to the success of our field and the idea is very simple when I teach stochastic optimization I tell a story that like you know Tamra wants to go from here from Barcelona to Berlin how far is Berlin anybody know okay I'm a it's a it's a a thousand kilometers I don't know how far it is she wants to go to Berlin and everybody's drunk okay here in Barcelona everybody's drunk but really all over Europe everybody's drunk all right so Tamara wants to go to Berlin everybody's drunk remember I'm explaining to optimization what does she do she asks Shakir I want to go to Berlin and everybody's drunk she says to Shakira I'm how do you get to Berlin shakir points in some random direction and any falls over a normal person would ask somebody else but not according to stochastic optimization Tamra should walk 500 kilometers in whatever direction Shakir pointed all right she's in the middle of Europe somewhere let's hope not in the middle to ocean and everybody's still drunk and she asks Tom's Hoffmann hey how do I get to Berlin he's German she thinks maybe he'll know but he's drunk too and he points in some random direction she walks 250 miles in that direction she runs into Andrew Inge he's drunk she runs into all kinds of people they're all drunk they all point in some random direction if she takes smaller and smaller step sizes on her way to Berlin you can kind of imagine and if we were to magically revive Shakira and ask him where Berlin is he would on average point exactly at Berlin then Tamra will eventually get to Berlin okay and it's it works just like that that is stochastic optimization in mathematics with noisy gradients I just follow if if that hat hadded gradient represents the noisy gradient if I follow the noisy gradient with a decreasing step size Rho T is a step size then I will get to an optimum and this requires two things it requires unbiased gradients that the expectation of the noisy gradient that there's a lot of expectations floating around here that's an expectation with respect to this random gradient that that expectation if that is unbiased meaning the expectation of the noisy gradient equals the true gradient if I could revive shakier he would point right at berlin we need that for this to work in 1951 and - we need the step size sequence Rho T to follow what are called the Robins Munroe conditions okay and those say that it we need to be able to get to Berlin but we need to slow down at a rate such that we will eventually get there even though we have noisy gradients now lately in machine learning we've played with those robins monroe conditions right innovations like a degree rmsprop play with that but these this is the idea in this old paper okay and this is going to be a key idea throughout this talk stochastic optimization okay so in this first setting of conditionally conjugate models the natural gradient of the elbow the natural gradients the type of gradient is looks like that okay it's it's alpha plus the sum of these expected sufficient statistics minus lambda alright and and these expected sufficient statistics are taken with respect to the optimal local parameters all right so therefore we can construct a noisy gradient by sampling a data point at random and then computing this scaled version of that noisy of that natural gradient all right this is a good noisy gradient because you can probably see its expectation is the exact gradient because I'm choosing a data point uniformly at random by the way this is a vanilla application of stochastic optimization here it's expectation is the exact gradient it's unbiased and it's cheap it only depends on the optimized parameters of one of the data points okay so rather than marching through all the documents to get this noisy gradient I just pluck one document and I find how it exhibits the local structure and then proceed okay and that gives us stochastic variational inference which looks like this i I'm getting signs that I'm running out of time so I'm gonna skip this slide but more importantly it looks like this right i subsample but from my data I infer it's local structure in other words I optimize its local variational parameter I then update the global structure by updating this natural this scaled natural gradient by updating the global variational parameter according to this scaled natural gradient and then I repeat alright and so for L da how does this look sample one document estimate how it exhibits the topics and then update the global topics based on those estimates why is this a good idea here's a picture so in the top part of this plot we have on the x-axis document seen at a log scale so here I'm fitting an Lda model and this is how many documents I've had to do some kind of inference about a max axis and the y axis is perplexity it's a measure of model fitness where lower numbers are better all right so first look at this line batch 98k this is a hundred thousand documents this is about what we could do with the classical mean filled variational inference and you can see that before we even get one point on this estimate of model fitness we have to have observed 100 thousand documents the whole collection and then at each iteration we process the whole collection but with stochastic variational inference we can analyze the whole document collection three and a half million and by the time we've processed the same amount of documents that we process in just one iteration of batch inference what's called batch inference we're already at a much better place okay and more importantly this is what lets us do things like estimate topics from millions of New York Times articles on a laptop and again this is general for that whole class of conditionally conjugant models all right so now you can scale up any of these models any of these models from the 90s and 2000's you can scale them up to very large data sets using stochastic variational inference okay how we got these pictures all right so that is kind of where things were a couple years ago and now I'll turn it over to Rajesh all right thanks Dave so I'm gonna get into you know did we really hit that promise so we talked about how we have a question have some knowledge we want to use probability models to express that knowledge take our data find the patterns and then predict and explore well I would say that for conditionally conjugate models we saw a pretty good way to do this the gastic variational inference scales it works for this pretty large class of models but you know one question we have might have is what about the general case so to get into this let's go over the variational inference recipe so like if you sit down with a model like what do you do so here we have our little tired PhD student maybe jet-lagged PhD student today he's thought of a model which is a joint distribution over the latent variable Z and data X the next thing that happens is you need to choose a variational approximation which is a distribution over the latent variables with some free parameters new then we get an objective this objective is the evidence lower bound it is a function of Nu and part of the dependence of Nu comes from the actual dependence on new comes from the expectation and so to optimize this we need to take that expectation so for example you know we could get a function that looks something like this after we take the expectation its explicitly in terms of Nu and to optimize this we can you take our standard approach recompute its gradient and we optimize with some form of gradient descent so this recipe you know is fairly straightforward you know I just work from the outside in and then put it into an optimizer and we can summarize it with this picture right here where I have a model I have a variational approximation compute the expectation take a derivative and then optimize but let's see how this works for a fairly simple model so this model is Bayesian logistic regression if you're familiar with the logistic regression which is a binary label prediction problem the Bayesian version puts a prior on the regression coefficient so here X are the covariance why are the binary labels Z is the regression coefficient and that has a normal 0 1 prior let's make this even simpler so we can actually see what happens let's assume we only have a single data point let's also assume or code our covariance or scalar and let's choose our approximating family to be a normal distribution it's nice we know we have some properties we can exploit and let's write down the evidence lower bound for this so to follow the recipe now that we have the evidence lower bound we have to compute that integral to take the expectation and so the first step we just write it out for the next step we can use our properties for Gaussian distributions so you know expand the first term to get the expectation of the square and the second term is the entropy up to some constant C next we can expand out the likelihood of the data so that's just the Bernoulli likelihood of a single data point we can take the expectation of the first term because you know Z we know the expectation of a Gaussian random variable but for that last term you know we're stuck we can't analytically take that expectation the dependence on that we would like to do it because we want to make the dependence on the parameters explicit so we can use gradients to optimize and so you know we need something else so there's some options for this we can further bound that object you know you can analyze that function find a lower bound and then we'll get a consistent objective that's very model specific because that function is going to look different for different kinds of models and you can you can also have more general things but these also still require computations around the model and while we think that you know hey there's one example let's really work on that examples of this might be okay but here's a different list of models and all these models are non conjugate meaning that they don't they don't fall into the class that Dave mentioned earlier so you know we have nonlinear time series models we have models with attention regression models even fancier versions of the topic model deep exponential families basic internal networks basically you know the model before we really very limited because you know they're defined with conjugacy in mind now that you're free this list could go on for you know we're still creating them but because of that we really need a solution that doesn't you know entail model specific work because this this Derik this kind of derivation really slows down the process of developing these models and figuring out what the right tool for your data and what we want is summarized in this picture we want to be able to take any model massive data some you know reasonable facts about the variational families that were introduced earlier I feed it into this computation engine will call blackbox variational inference and get a posterior distribution or approximation of it so what's the problem in the classical variational inference recipe that really stops this from happening it's really this computation of this integral to make the elbow explicit in terms of the variational parameters do and that's why it's highlighted in red so red is you know it's like a stoplight but you know if we switch if we switch the order of these two steps so before computing the before computing the expectation we compute the gradient then try to approximate the expectation we might you know we might find success and why might this work the same reason we saw earlier it's the casting optimization but to do this we'll need a general way to compute gradients of expectations so I'm gonna go over this slide carefully because a lot of what I'll talk about in the next Bart's relies on this so imagine the term inside the elbow which is a function of latent variables and the variational parameters that's log P of X comma Z minus the variational distribution to compute it's great to compute its gradient without taking the expectation first we first write an integral form we make the assumption that we can swap the integration and differentiation which is true for a relatively general case this next line here is just the product rule so we differentiate the first term and then we differentiate the second term the third line we rewrite the gradient with respect to Nu of Q using something called the log derivative trick and what this lets us do is alessa's introduced the density into both terms so we can rewrite this object as an expectation and this is really the tool that will make the rest of the talk work so a road map I'm gonna cover two kinds of gradient estimators that are built from that differentiation rule called score function gradients and pathways gradient and lastly I'm going to talk about how to make inference with large data even faster than we saw with stochastic variational inference so score function gradients of the elbow so at the very first the very first term here we have just written down our differentiation rule we derived earlier to simplify this we used to see that the second term is just the score function hence this how the score function estimator gets his name as expectation zero so we get the form of the gradient given the model probability the probability the variational approximation times the score function and this estimator we call the score function estimator but it also has other names likelihood ratio from the Monte Carlo literature or reinforced from reinforcement learning but how do we use this you know like now that the gradients and expectation what's the next step really the next step is once it's once an object is an expectation with the distribution that I know I can use Monte Carlo to construct noisy unbiased gradients so specifically we can sample from Q and we can take the average of this over those samples of the quantity inside the expectation and as we saw earlier if you have unbiased stochastic gradients we can get an algorithm that will converge to a local optima and let's look at that algorithm this algorithm is just basic black box variational inference what you do is you draw a bunch of samples from your approximation you choose a learning rate from the Robbins Moreau sequence and you update your current parameters with the learning rate times the Monte Carlo stochastic gradient that we defined on the previous slide and then that's really it you know theoretically this would work if you add enough computation so what are the requirements for inference did we really meet these blackbox criteria well if we look at our formula we need to be able to sample from the variational approximation we choose that so and it's not related to the model so we have some flexibility there we need to be able to evaluate the score function again we choose Q we can derive that once put it in a table and reuse it and we need to be able to evaluate the log probability of the model which is the same as specifying the model itself and we need the probability of the variational approximation to the key thing here is really this there are no there's no model specific work our criteria are really satisfied all I have to do is write down the Joint Distribution and so it really does look like this I have a bunch of facts about my variational approximation how to sample it it's score function its density I can take in my model which is in the form of this Joint Distribution I can take in data and then I can get an approximate posterior by running that algorithm I just showed but that algorithm doesn't work directly variance really can be a problem so when you do stochastic optimization your gradients are noisy the more noise you have in those gradients the slower the optimization process this wasn't really a problem in the classical setup is because you took the expectation first then took the derivative your noise free but now that we're sampling you can get quite high variance and this picture really gives some intuition around that so this is just the score function of a Gaussian distribution and this is just this is the PDF of a Gaussian distribution intuitively sampling rare vet sampling sampling where values can lead to large scores which can give you high variance and this is a problem we need to address one solution for this our control variance the idea behind a control barrier that comes from Monte Carlo estimation is to replace a function that we're trying to compute it's expectation of with another function that has the same expectation but possibly lower variance and one general way to create that function so hat F is to take the original function and subtract from it something that has expectation zero so take a general function H and take subtract its expectation it has expectation zero and in this case you know H is the control various a function of our choice and you can choose this scaling factor here to minimize the variance so this picture kind of depicts what's happening so again in red I have a Gaussian distribution in blue I have a function f whose expectation I'm trying to estimate it's X plus x squared let's say that I use x squared as a control varying the using the fact that I know what the expectation of x squared is for a Gaussian distribution the function f hat gets changed from this blue version to this green version which has lower variability if I take this all the way and say that H is equal to F I'll actually get something that has zero variance we can see this from this formula here where the F will cancel and you'll actually just get the F hat is equal to the expectation which is what we're looking for but we need a way to specify H like we really want to maintain our black box criteria because we're after some graph or some level of generic mess in our inference and what that means is we need a function that has known expectation for a broad class of models and you know we saw one already so if we set H to be the score function we saw previously that this thing has expectation zero for a large number of queue and because of that we can directly use it as a control very deuce the variance there are a lot of other techniques from Monte Carlo that can here and that are still being applied to variational inference things like important sampling quasi Monte Carlo and radicalization which is a kind of marginalization and you know I'm coming back to this model this is you know with the algorithm we have we can actually run you know inference for all these models and even many more that we've yet to think of and I think the nice point to make here is that rather designing models based on inference we can design them based on the data we have and for the problem that we're trying to solve and tailor it to that problem so we have a current set of black box assumptions which are sampling from Q it's score function and evaluating densities can we make additional assumptions that make inference easier or faster and this will get us to the second estimator for variational inference called the pathways estimator so the two assumptions will make are the following the first the first is the first assumption will be that for our variational approximation so Z is coming from our variational approximation right there we can rewrite it in terms of a noise source that's parameter free what that means is given some epsilon that comes from a distribution S with no parameters so not dependent on nu we can then transform that noise source with a function that depends on the parameters to get a random variable that has the same distribution as the original so that's a lot but we can we are familiar with several examples of this one simple one is if you take a normal distribution and say I want a normal with some mean and some variance one way to do this is to draw from the standard normal so there are no parameters here is your own one and then just to do a location scale transformation of that so multiplied by the standard deviation and then add the mean the second assumption that will make are that the model and the variational approximation are differentiable with respect to the latent variables and this smoothness assumption will buy us something we'll see in the next slides so to compute the gradient in this way recall our original ingredients arm yuh which is the score function times log P minus log Q times its derivative and we can rewrite this using our transformation rule and when we do that we get you know a very similar form except now we have the score function with respect to this parameter free distribution we have the elbow of the transformation and we have the gradient of the transformed elbow with respect to its parameters one thing to note here is like in the score function great estimator the second term was zero but in this case because the noise distribution does not depend on the parameters it's zero so we can simplify this so we have the second term I'm just expanding it out here we get we can expanding out here and using the chain rule we get the derivative of the model times the chain rule term minus the score function and we can get rid of the score function term using the same fact we used earlier and this is also known as the Reaper amateur ization great in if you're familiar with that so why would we want to do this you know we limited the class of models and made some assumptions that are a little bit you know maybe odd but it really does buy us something so here we have the variance of the gradient on the x axis on the y axis and on the x-axis is the just the number of monte carlo samples we're taking to estimate that the basic estimator described has variance that's orders of magnitude larger than this path wise estimator and it's also larger than using that score function estimator there's also smaller than using the score function estimator with the control variant and this is why this approach really popular so comparing between the two we have that you know the score function estimator differentiates the density well the past wise estimator differentiates the function and these are really the only two options because if you remember what that interval looks like it's the integral of the sum density times some function the score function one works pretty generally you know it works for a broad class of approximation so we don't need it to be Reaper amateur izybelle but variance is really a big problem and so what the path is estimator is all about is you know making the set of assumptions so differentiable models and some some amount of reaper ammeter ization to get generally better behaved variance so you can deploy it on models new models more easily so the last thing I'm going to talk about is amortized inference so we call the hierarchical model that Dave covered in this section in this in this model you have a global latent variable which is beta which is shared across the data you have a local variable Z and you have data X I and these local variables are one for each data point and that's why they're in the box the Joint Distribution has this form here which is a product across those local factors and as we saw earlier we can drove we can define a mean field variational approximation for this and we can do so casting variational inference but there's a problem with that stochastic variational inference in this set up mainly the expectations that we require are no longer tractable it's the same problem we saw earlier when we were trying to compute that integral you just can get a nice analytic form for it so instead we need to do something else we could do the same trick of doing stochastic optimization inside an inner loop but now this stochastic variational inference algorithm will get slow because instead of just you know writing an equation for an expectation you're going to run an optimization algorithm for each data point you get and so the idea here is to learn a mapping that goes from a data point to its local variational parameters and so let's see how to do that so at the top line we have the evidence lower bound for a general hierarchical model this has the terms of just the data terms and it has the entropy part which I expanded just to see what's happening for the per data for the per data variational parameters so the way that amortize it works with an inference Network is to say that instead of having just a Phi I for each single data point you can make that Phi I to be a function of X I with some new variational parameters theta so why this is called amortized is because the variational parameters that used to be one for each data point have gotten replaced with parameters to a function that are shared across data points and making this change allows us to apply these blackbox inference techniques directly to get something that scales across data and is general for a broad class of models and that's what this algorithm does here where we can sample a global latent variable from its variational approximation we sample a data point we can sample a local latent variable given its parameters and then we just use stochastic optimization to update both the variational approximation on the global variables and the parameters for the inference network on the local variables and that's what these two updates are so this is really a computational and statistical trade-off what I mean by that is by choosing the amortized family we've shrunken the class variational approximations and the size of how much we shrunk it by depends on the complexity of the function so one way to think about this is imagine just the best case where where there are some optimal values of five for each data point and F perfectly predicts that that is that is the best thing you could do and that's why it's a smaller class than the original so let's look at an example of how this is used so there is a popular model called the variational auto encoder what it does from in from the generative standpoint it puts a prior over a vector of random variables this simple it's just normal a zero one and it generates the data with say a normal distribution with functions that are parametrized by the global variables and take in as input the local latent variable and both these functions are generally some form of deep network to do inference in this we use our inference network so we specify a mean field approximation where the parameters of that approximation in this case normal distribution come from functions of the functions of the data so we have the mean parameter coming from the data and we have a variance parameter coming from the data and the picture on the left really describes this process so the model is generative in the sense that you take a lean variable you pass it through a set of deep functions and then you get a likelihood function where you can sample data and on inference this process is reversed we take data we pass it through these two deep networks to get parameters for the variational approximation and that's the distribution up here and you know this really works so here are some results that are produced by variational autoencoders the top just is a measure of model fitness and the variational auto-encoder does well and these other pictures are you know we have some simulation results you know how well can you generate per tape faces or house numbers and the last last one is a different task where we have our original data we have corrupted that data so we this is what the model sees and we try to do inference to recover the original values based on what the model thinks and this is just a posterior calculation also so in the last part of my segment I'm gonna go over some rules some rules of thumb and the first rules of thumb I think that are important are like how to choose between these two kinds of estimators if your model is differentiable I think the first thing you do is to try out a variational approximation that's Reaper amortizable and that's because variance is an issue in the other case and it's pretty well behaved for this problem for this estimator if your model is not differentiable the process is a little bit slower you're gonna use the score function estimator because that's all you can use you should use it with the control various right away but I think you'll also have to add further very variance reductions based on experimental evidence what I mean by that is you know you'll plot your evidence you plot the variance the gradients you'll see how much progress you're making if you're making too little progress you'll probably have to adapt one of the other techniques that I listed earlier and there's some more general advice I think that is important for optimizing these problems too which is you know don't use the robbins Moreau sequences use something like rmsprop an data grab which do coordinate specific learning rates there's something called annealing which balances the cost between the regularization term the entropy that Dave talked about in the likelyhood which can help getting stuck in local optima early on and from a computational standpoint the algorithms we described or really embarrassingly parallel across samples and so implementing it that way can make the entire inference is much faster for software there are two different kinds of systems that are useful here the first kind of system our systems that have variational inference built into it and these are a probabilistic programming languages and there are a lot of them there's wet there's venture web people Edward PI MC 3 and what these are really good for they're good for trying out just a broad class of models the second set of the second set of tools are just math libraries they let you do differentiation they provide other utilities maybe like log probabilities and what they're useful for is getting a faster implementation of an individual model because you can take advantage of structure in that model in your implementation and thanks [Applause] and I'd like to take you through a bit more of the more recent work that's happened in variational inference in the past few years and to get to this point we've needed a number of different ingredients and we began with Dave who you know introduced us to the principles of probabilistic modeling the principles of variational inference and how we can really scale our models using stochastic optimization and Rogers just introduced us to blackbox variational inference methods and how we can automate the process of variational inference how we can extend them to non conjugate models use Monte Carlo gradient estimators and use amortized variational inference and without us knowing it we have now been empowered to answer one of the key questions in variational inference which is how do you choose that variable q how can you get the best distribution possible and sort of for the next 30 minutes we're just going to explore different ways of doing that so this was the variational inference picture that we showed in the beginning you need to find the best approximation to the true posterior distribution P of Z given X and what we needed to do was specify some family of distributions which we called Q where some variation of parameters new and up to this point everything that we did was called the mean field approximation or what would sometimes called the fully factorized approximation so in my little cartoon image here we have three dimensions of a latent variable Z and the latent variables have no connections between them so we assume there are independent gaussians for example and once we make that assumption we can then optimize our variational lower bound using the Monte Carlo techniques that Rogers just described and so the question is is this a good idea and maybe so the way we'll do that is by exploring some real-world posteriors so Rajesh just described to you what was this model a deep latent Gaussian model it consists of a latent variable of gaussians and then goes through some deep neural network in the example we'll look at is a two fully connected hindered layer and then we have modeling emne stages where we'll use a Bernoulli distribution at the end and these two plots are plot for one of these emne stages it's number five in this case and we're looking at the true posterior distribution which is the gray and there are two plots it's the same posterior distribution at different zoom levels and so there are some interesting things to look at and to learn from this plot already just in two-dimensional latent variables one is that this posterior looks quite Gaussian so a Gaussian could do quite well but it has a slight bit of correlation there's a little bit of a tilt in that gray contour at the back ignore the blue curves and if we use the mean field approximation then we will only have access aligned gaussians which means we will never be able to model even that small amount of correlation so a conclusion already from simple kind of plots like this is that the mean field or fully factorized assumption usually will not be sufficient so let's look at a few other diagrams so here a few are the other of the Amnesty Jets there are four plots and each of them has a four subplots but they are all again the same posterior distribution at different zoom levels and we're going to look focused on the gray contours in the end and we see a lot of interesting things in the first one there can be very strong dependencies strong correlation between the distributions sort of in the second one the distributions can look somewhat spherical but they aren't quite Gaussian in the third one you have distributions that are somewhat multimodal with some way density connecting the two parts of the mode and then the other part you could have some very heavy tail with very sharp cut offs and so from here we take the lesson that the kind of posterior distributions that we see in the real world have complex dependencies they are typically non Gaussian and they might have multiple modes and these are the three kinds of things that we're going to try and look for so this means that we're gonna have two high-level goals we want to build these very rich posterior distributions distributions that are non Gaussian can have complex dependencies and can be multimodal but at the same time everything that we just learned from rejection Dave about maintaining computational complexity we need to keep that and scalability so that sort of means that what we have is a spectrum of different approaches on one end we have the true posterior distribution which is the best thing we can do but the true posterior is unavailable to us and on the other end we have the mean field approximation which is the least expressive the simplest thing we could do and in between there lies a whole range of ideas that we can explore and everything you know about building models can now be applied here because the process and the problem of designing good posterior approximations is going to be exactly the same as the way we think about building models themselves and everything we know we're going to use in a somewhat different way so the first way to think about this is how can you improve your fully factorize approximation and introduce some structure so this would call the structured mean field and in the example that I have instead of having no dependency between the individual dimensions of the latent variable I'll have a dependency between Z 1 and Z 2 and Z 2 and Z 3 so a straw should mean field is any kind of posterior approximation where we introduce some form of dependency within the approximation and this can be very general so the first and simplest way of introducing some dependency is to improve that diagonal Gaussian approximation that we use and that means we'll just use a correlated Gaussian a correlated Gaussian is a distribution with some variation of parameters nu and these variational parameters are the mean and the variance of that Gaussian and now all the dependency structure is lives within the covariance and so we can think about different covariance models that we have available and the things we can do is start with the mean field which includes the diagonal Gaussian or we can add a Rank 1 term to that which will allow us to capture one degree of correlation or we can continue to add a few more dimensions of higher rank up until we reach the full covariance Gaussian and so this the little plot just to give you an indication of what can happen by building better models and using better inference so in the beginning the highest plot is factor analysis the simplest model it is the linear latent Gaussian model with a factorized posterior distribution you get some value which is quite high and once we move to a nonlinear model nonlinear models can be much more powerful and we can make significant gains in the way and our understanding and explanation of the data but using them the wake-sleep algorithm doesn't use a very unified objective with we use variational inference that gives us a principled way of deriving a unified objective function even with the mean field we can still do better and if we do a Rank 1 approximation we can do better still and so this is sort of the the lesson of building better posterior distributions there are two limitations of this kind of thinking one is that as we move from the mean field to rank one we have a linear time computation it is linear in the number of latent variables that we have but once we move to these higher order approximations covariance models we move from something that was linear in the number of latent variables to something that is cubic in the number of layer and variables and this cubic cost is something that will not be acceptable to us so we cannot actually use this model the other limitation is that these posteriors will always be Gaussian and this is one of the things that we did not want to have so what can we do better so to move beyond the Gaussian the simplest thing to do is to use one of the first models you probably ever learnt about which was the nonlinear Auto regressive model and we can use that to build posterior distributions so in this example let's look at dimension Z 4 z 4 is dependent on all the other latent variables that came before it said 3 Z 2 and Z 1 we introduced an ordering on these latent variables and each of these connections between therefore can in this case be a deep neural network so this can be very flexible each of the conditional distributions will be Gaussian but the Joint Distribution between all of them is most certainly not a Gaussian so this can be a very good approximation so to give you an idea of exactly what can be done and the barplot these are results on the amnesty data set so in the VA II algorithm that we looked before using a mean field approximation we can get around 86 knots this is a really good number but when we use a better model and we use this kind of auto regressive nonlinear auto regressive posterior that can induce these complex dependencies we can gain 5 knots I don't think I can explain to how much 5 knots is but that's a lot and if we look at sort of the sample that we can generate you can see from something very blurry for in the VA II we now get much more structure diversity of colors and a bit more coherence there are not perfect images of course and the most modern world can do much better than this but this is sort of their the understanding of what better posteriors mean again I said the Joint Distribution is non Gaussian and because of this Auto regressive structure this maintains the complexity which is linear in the number of latent variables so we can move beyond that so since we can use the nonlinear auto regressive model what other models can we use as potential approximate distributions one popular approach might be to use a mixture model which is a which is a very good idea or we can start with the mean field approximation and we can use some form of binding function for example this function C to introduce some dependencies and if you look at all the models that you will use and thinking about ways of building these better posterior distributions a recipe will emerge and what it suggests to us is that we should try to introduce some new variables into our approximations and we'll use these new variables in some way to induce dependencies and every new variable that we introduce we'll have to think about ways to make sure that the approximation is still tractable and efficient and so we're going to look at exploring this in much more detail so here's the general recipe for the rest of this part is we're going to introduce some new variables and I'll call them Omega these new variables are going to help us they're going to be something we can play around to actually build a much richer approximation so while we might be interested in this distribution Q of Z given new we're going to form this new joint distribution Q of Z comma Omega given U we should do the integral but we're gonna try and work with the joint instead which will help us do the the tractability so because we're going to now work with this joint the bounds that we already had might not actually work and we'll have to think about how we can actually modify our bound in some way usually this likelihood term will be okay to handle but the entropy term will be something difficult and that's one of the things we'll think about quite a bit and at all points we'll always be thinking about the computation what it actually means what the implications and to ensure that we are linear in the number of latent variables so there are two general approaches that we're going to look at one approach will be the change of variables method are going explore some techniques to under various different names like normalizing flows and look at how invertible transformations and functions can play a role here and the other approach will be call auxiliary variable methods or we'll look at ways of building entropy bounds and using other ways of monte carlo star so the first where are these change of variable methods and this is also one of the first things you learnt in introductory probability which was the rule for change of variables of a probability distribution so we can start with a simple distribution Q naught assume it's a Gaussian for example and if we take samples from their distribution and we transform them through some invertible function then we can know the distribution at the end because we can apply the rule for change of variables which will involve taking the original distribution and multiplying it by the determinant of its Jacobian so here's the cartoon that gives us that image we start with this distribution Q naught which is a Gaussian I would transform it through this non linear function f the function f must be invertible but once we get to the end that function will transform the density and give us something a bit more complicated and we can transduce multiple times we can apply as many functions as we like to try and make the distribution as complicated as we need to be and there are two important properties of this kind of a process the first one is that sampling is very easy that if we need to generate a sample from the final distribution at time step T we generate a sample from our independent Gaussian and we just push those samples through the sequence of functions and what comes out at the end will be a sample from this complicated distribution and that sample is what we need to do the stochastic optimization that rajesh described the second thing that we need to do is to be able to compute the entropy this term that was always in our bound and the entropy is also very easy to compute because we just need to take the log of this function of this transformation so it'll just be the log of the initial distribution which we always know and the log determinant of the Jacobian which we can always compute because we get to design this function f and so we call this a normalizing flow because this initial distribution flows through this sequence of distributions at the end and this can be one of the very powerful ways of building distribution so I want to give you an intuition for what exactly this means so in the first column we have two distributions it's either a spherical Gaussian or a uniformed tribution and have chosen one particular kind of simple function with random parameters and we're going to look at what different kinds of these transformations means and every time you do one transformation you can do a number of operations on the initial density you can contract the density as happens in the first row for the Gaussian you can expand the density as which happens after two transformations which allows us then to be multimodal and after you do ten of these transformations you have something very complicated multimodal there's lots of structure lots of different density mass allocated in different ways and the same thing for the uniform which actually this then these final distributions meet all the requirements we wanted to have there's complex dependencies there's multi-modality and non Gaussian 'ti so i'll ask some actual real functions here's a for examples let's look at the first column we have these two half-moons after you do two normalizing flows our two transformations then you see we've already been able to learn the mean and we already know that there are two different modes and we can apply as many of these functions as we need to apply and by step K equals 32 we've very well characterized and being able to learn the true posterior distribution and I think this image gives you a different way of thinking about what it means to build a richer posterior distribution another way to think about it is to say how what can I do if I had more computation can I allow my system given more computation to learn and to become better and this is sort of a theme you'll see many of the conferences and talks throughout the workshop especially in the deep learning about how we can use adaptive computation and apply it on the fly to learn richer things in this case posterior distributions so the key question then is how do you choose this function f we can't use any function because we need the function to maintain it must be invertible and the function needs to allow us to learn in linear time so the bound is easy to adapt because we initially we can compute the initial term which is just the expectation of the factorized Gaussian this log determinant term is also easy to compute so we'll always start with the simple Gaussian and there are a number of different kinds of functions that we can use and here are three different examples the first one is the one I used in these previous examples we call it the planar flow and it's just a simple function that either allows you to learn the identity function or a one-layer nonlinear transformation using for example at an H layer you may have seen this in other kinds of models for example if you're building a large-scale classifier they will call a function like this a residual Network and the same kind of thinking can be used here there are two other kinds of functions you can use one called a real non vol preserving transformation and a more recent one called an inverse autoregressive flow each of these can be wired in different ways but the key thing that these two functions do is that they ensure that the jacobians that you end up with are triangular and triangular jacobians have the special property that they can always be computed in linear time because all you need to do is handle the diagonal of the Jacobian and so these two can be very powerful especially the last one the inverse autoregressive flow can be implemented very easily you can combine it with lots of convolutional networks and stack it and add lots of other kinds of methods and for any of these you have linear time computation of the determinant which is what you need to evaluate the lower bound and you have linear time computation of the gradients of the free energy of the variational objective which is what you need to do learning so again just to compare these are the two results we had before with the autoregressive and if we put the results from the inverse autoregressive flow we can do even better and we can be much more flexible depending on the amount of computation we are willing to spend again if we look at sampling then we go even further we have even more structure much more diversity of color much more consistency between different images when we look at samples and see far it's ok there's a different strategy you can take and this is a very popular one for building modeling in general so when you build a model of complex data one of the questions that you have is that what happens if I use latent variables to make my model better and you'll ask the same question can I use latent variables to make my posterior distribution better and that's exactly the approach we're going to explore here which we're going to say can I introduce these additional variables Omega that will help me induce some dependencies and build a better distribution so this is the approach of building a hierarchical model to represent your posterior distribution I would call these hierarchical variational models and unlike the previous case where I looked at all these additional variables Omega were deterministic because they were known transformations of previous variables in this case these variables Omega will be stochastic variables but this approach now addresses a limitation of the normalizing flow approach the normalizing flow approach can only be applied to continuous distributions because we needed this requirement of invertible and differentiable functions but with this approach we can build much richer complex distributions which can be both discrete and continuous or even some mixture of the two and this is why this approach now will be very appealing so how we actually will think about that is to ask the question if I have these additional stochastic variables which have entered into my posterior distribution what change would I have made to my model so that when I apply the rules and principles of variational inference magically those variables Omega would appear in my variational bound so on the left is our original model at the deep latent Gaussian model it has some latent variable said and an observation model X this model must always remain unchanged because this is actually the model that through our loop of processing of thinking is the one we are interested in so the only way we can modify that model is to introduce a variable Omega here on the side that's dependent on Z and X and if you look at this graphical model and the distribution of Omega will be this variable which I'm calling our this distribution arm so there's something interesting here you can see that we call Omega now auxiliary variables and the graphical model tells you why if you observe Omega observing Omega can never give you any information about Zed or its dependency on X and so because they live outside they do not have a role to play in building better models but the reason we are interested in auxiliary variables is the impact that they have on our inference and auxiliary variables are one of the most powerful methods we have for inference in general and what auxiliary variables do is that they give us a very clever way of building a mixture model in our posterior they introduce correlations and effectively what we will get is build a distribution a mixture of a distributions of z given X and Omega and by varying Omega we will then be able to vary the mixture which means we'll be able to adapt to the kind of dependency structure that's available in our posterior so let's think about how we'd have to adapt our variational method this is our original variational objective in the top the first term will usually be able to compute quite easily because we just need to compute the expectation under some sample but the second term which is the entropy will typically be more difficult because it involves this term long queue so again we have our new auxiliary variable model and since we need to build an inference network or think about inference we'll build two posterior distributions a queue of Omega given X and a posterior distribution over the actual latent variables we are interested a queue of Z given X and Omega that and this is how you can see the mixture appearing and then what we will do is that okay it is well introduce a new bound which we'll call an auxiliary variable bound you can see it's very simple from the top we'll just make the new joint distribution which will include this new distribution R so log R and then we will extend the entropy term to be the expectation over this joint distribution and then what will turn out you'll see is that this just subtracts a non-negative term from our original objective and by being able to choose R and Q in some way and because we can learn them through stochastic optimization we have the ability to make that second care term close to zero and that is how we will learn so that's the key question for auxiliary variable models we just need to choose the auxiliary variable prior our and the auxiliary posterior queue and there are lots of different ways and one of the ways I wanted to point out is if you choose R as an independent Gaussian it has no connections on the data this would call the Hamiltonian flow this will connect to the previous way of thinking about normalizing flows it will connect you to ways of auxiliary variable sampling that you already know using Hamiltonian variational Hamiltonian Monte Carlo which you can then use within variational inference and can be a very powerful way the launch of our sampling is also in that class of methods but you can obviously do better instead of an independent Gaussian that has no dependencies you can add some dependencies you can use a Gaussian that is dependent on some data you can use the auto regressive distribution that we combined in the beginning you could use mixture models you could use the normalizing flow you can use a Gaussian process all of these methods now become available and when you put them together we can build distributions that are just as good as anything else that we have with the additional flexibility that in this class of models we can handle both continuous and discrete distributions and so the conclusion we have still easy sampling easy valuation of the bound and the gradients we're always so linear in the number of latent variables and this is now one of the currently the nicest ways of doing this so if we look at all the different ways of dealing with different posterior distributions we have this spectrum on one end we have the true posterior distribution that we were trying to get you and we began at the very other side with fully connected fully factorized mean field approximation and we try to take steps to get closer and closer to this true posterior we started with covariance models with simple structures but they were always gaussians we use mixture models which can typically be difficult to learn we use nonlinear auto regressive models which helped us be much better and then we looked at normalizing flows and auxiliary variable methods which allow us to under fly use more computation and build better and richer posteriors so the question you will have is how actually do I choose my best for Sarah and so we go back all the way to the beginning of the talk to what Dave mentioned this is what we call boxes loop this loop of thinking about your problem building the simplest model starting with the simplest kind of inference understanding what is going on and going back again and building more understanding more intuition using richer and richer per series so we get to the end of the talk and this was our summary slide we want to introduce you to variational inference which was this approach of learning approximate posterior distributions in some family of variational methods we introduced stochastic optimization as the key tool that allow us to scale variational inference to massive data sets it allowed us to apply to the widest class of problems especially non conjugate and non linear model and we were able to use very flexible and rich posterior distributions together these things give us a set of tools that allow us to really scale variational inference to modern problems and we think this will be one of the things that would be increasingly important as we look to build machine learning with higher impact and at larger scale so on behalf of myself and Dave and Rajesh thank you all for coming this morning so we have lots of time for questions maybe you guys should come up oh there's mics there and there if anyone has questions yeah Oh question is there a question yeah over here so what do you see as the next step in approximating the true posterior what is the efforts going forward towards that goal I think looking at variational approximations that don't necessarily have analytic densities like we laid out these criteria about needing score functions needing to be able to value the density and sampling from them but that a lot of distributions that you can construct where you can just simulate from them working with these kinds of approaches I think would be pretty cool hello this question to the last part that Shakir told so the two models with auxiliary variables is this essentially an approach to have encoder and decoder and in some way yeah so that's approach I described it in the framework of encoders and decoders so it can be used exactly in that approach of amortize variational inference and the variational auto encoder framework but it's also more general so I haven't seen an example yet for example of applying auxiliary variables to Bayesian your own networks where you want to learn posteriors over parameters but that approach is also applicable in that setting so it's yes thank you though but ancient Olympians is at least when it started it was kind of an analytical alternative to sampling and if I quote David correctly you used to say that variational ferns that's what you do while waiting before you give sample to converge so but now in this talk I saw that actually sampling is being drawn into the picture even in the inside variational inference to convert those gradients which are analytically tractable anymore is there any kind of convergence now between MCMC and variational inference would it fair to expect that in the future a kind of a converged model will emerge which nicely combines both this methods and it becomes just one that's a great question the few remarks I would want to make so it's true that the old joke was that you would derive a Gibbs sampler start running it and while it's converging right pages of math to derive your variational inference algorithm implemented and if you were done with that before the Gibbs sampler had converged then you know you use whatever was done first and now you can see from from this from this tutorial that that's really changed that these methods have become much more generic where you write down a model and we can if we put either its we put it into some conditional conjugate form we can immediately write down the coordinate inference algorithm that for example for the LDA model I talked about we had a whole long appendix deriving that we didn't we don't need that now or if if or working with one of these other approximations like a score gradient or repr ammeter ization gradient without even having to do the conditional conjugate analysis so it's become easier as you pointed out though now sampling is in the mix right so sampling now it's part of this variational process though there is a real distinction right in in there they're two different philosophies to approximate inference the philosophy makes it sound more important than it is they're two different approaches to approximate inference one is generating a Gibbs sampling is creating a Markov chain whose stationary distribution is the target here we're using sampling as part of the optimization procedure that said there has been some interesting work over the last couple years maybe you guys remember the references better than I do Solomon's yeah so there's work by Solomon's and moles right it's all momentum Knowles that tries to bring together these two perspectives and there's work in the context of stochastic variational inference by Matt Hoffman that uses MCMC to approximate intractable optimal variational distributions so so indeed there there are places where now these two approaches to approximate computation are coming together thank you no thanks for the great tutorial I have a question about the hierarchical just at the end about the direction of them in this hierarchical structure why is about that's the Russian matter but it seems in the case of causal inference its integration does matter and wonder that that means we need additional structured learning in order to to find the haricot structure I mean I think structures and causal inference are important but when you build like a hierarchical variational model the ideas to condition the new things you're adding on the structure you already believe so that if you marginalize out that extra stuff you still have the structure that you actually care about you're just getting better posterior approximations for say you know the parameters in your causal model in the end days of the austerity variable the direction for the auxiliary Omega is actually from X to W which surprised me I thought maybe it actually Omega to act as observation so I think you want that because that's what gives you this marginalization property where you can imagine any graphical model you have the things that are at the model the things that are at the bottom you can integrate out and recover the original model and that's what conditioning that way gives you you know your question is a good question brings up a higher level comment I would want to make here which is that you know they're these especially at the end it starts getting it starts looking like we're doing modeling in every stage of the process right both when we're building the model and when we're building the posterior and you're asking about the structure of the posterior model the approximate put the model of the approximate posterior and you you want to have different sensibilities when you're building these two types of models when you're modeling your data then you're really trying to simplify your data understand your data form predictions and generalize to new data when you're building a model that represents your approximating family you want something as flexible as possible subject to of course the statistical computational trade-off that rajesh brought up and one of the open areas I think in variational inference is to be thinking about variational inference as this estimation problem what are we trading off and how are these two different sets of considerations articulated one where we want to simplify our data and predict the future and the other where we want the most expressive class of approximate posteriors that we can still hope to do some kind of variational optimization over that gives us meaningful results that's a good question hi I'm Chi asked a question about using more complex approximating distribution that mean field I think that there are two ways to decrease the variational gap one way is to use a more complex approximating distribution and the other is to make the distribution that is actually approximated easier to approximate so for example for the severe Asian or autoencoder scenario if we have a more complex generative model that yields a posterior that is more factored or say that it's easier to approximate then is there any reason to prefer to put the complexity into a more complex approximating distribution rather than putting a complexity into making this generative model easier to approximate oK we've agreed that I will make a couple comments about this you know that's a real great question it's kind of a age-old question of do I want the right answer to the wrong question or the wrong answer to the right question and you know I think what you want is the wrong answer to the right question but this is a matter of debate like it goes back to what we're talking about earlier when you're choosing a model to use you want to choose something that simplifies your data and predicts new data you want to not be hindered by things like conditionally conjugacy like in Rajesh's piece and in Chuck Yours piece but at the same time of course as you pointed out that adds complexity to the to the subsequent computation and inference and optimization that that you need to make so you know there's no real good answer to the question of when should I make my posterior more complex or when should I simplify my model I think one thing that your question brings up is that traditional methods of model selection kind of go out the window because we are now evaluating our model and our approximate inference as a bundle and that's important and a lot of the results that Chuck here showed are doing just that right we doing some downstream prediction because we can't separate the choice of model from the choice of approximate inference they're connected in some ways that we we that are hard to understand how and that's another open area of course thank you thank you okay well I have a question oh great so variation of inference is often criticized for under estimating the variance is there anything new that we can actually get the right variances out of variational inference or what kind of things could we do in order to address this issue yeah yeah so Tamra has some nice work on you know using perturbation theory to find estimates of smooth functions so like a covariance is an example using the mean field approximation that being said I think these richer approximations that Shakir talked about can help in this way too that you know if you have a full rank structure and you transform that structure you have a high belief that you'll capture the correlation structure so at our computational cost relative to Tamra's work so so let me follow up that one with you guys brushed a little quickly over how you were evaluating your your different methods and so I was wondering if you could talk a little bit more about what you're using to evaluate right now what you think is the right way to evaluate and and maybe even a third point which is criticism as separate from evaluation I'll just start just to say the current thing that we do so typically I think this question is the question no one can agree on actually because we don't actually have a good way of evaluating our models typically because we also want to use them for something else and we don't have even evaluation for that time so right now all the results that I showed always reports the variational bound instead because the variational bound is at least four that it's good enough for model selection and it's consistent in that sense typically if we wanted to be very careful what you would do is also compute the true marginal likelihood by importance sampling under the model and there's different ways of doing that these days and the may be the easiest way to do it is to switch to use a different objective function and you can use a difference of variational objective which is called the importance weighted objective or a more generalized variational objective which will then allow you to use important sampling and then you can just send the number of samples to very large and that gives you a better approximation of the true distribution then of course there's the issue of luck what you do with it before which is why I show you a lot of samples and we do a lot of you know inspection of the model but probably dave is gonna come and say yeah I mean I think the idea of model criticism specific to a task is important and there are lots of cool ways to do it you know there's stuff like posterior predictive checking which asks like how well do simulations from your model match the data and I think in the future expanding this to inference to is also important like am i capturing the correlations that are implied by my model my approximation yeah I don't think I was gonna add much I mean the how to evaluate the Yi's Percy really how to evaluate probability models is the question you're asking about and that's an issue that statisticians and machine learners have been discussing for thirty years and there's great work for example in the 1970s by Seymour geyser on predictive sample reuse looking at log predictive likelihoods of held-out data that's my personal preferred way to evaluate models to avoid thing doing things like comparing bounds and comparing approximate inference methods putting all methods on the same kind of scale and Rajesh mentioned posterior predictive checks in the you know we can go back to this picture with criticized model where that picture there is from a beautiful paper by George box called robustness and the statistics of science or something like that from 1980 and it's about how do you check your model if you condition on data and you get a posterior P of Z given X whether or not your model is right or wrong you're going to condition on data and get that posterior and so George box said that and this is at the highest level if you want to understand whether or not your models doing well you need to step outside of its cage and ask yourself are those posterior inferences good now he did it in a certain way in 1980 nowadays things like held out likelihood and other measures of generalization error kind of could take the place of that so and that's where things like cross validation and held-out likelihood come in I want to relate to intuitive adversarial networks and variational inference do you think there is some underlying variational inference going on and generative adversarial networks when we compute them I think there's a lot of different ways I think here about this this is the importance of asking your model I want it so let's see where's the picture of a model so here's a picture of a model it is said is the latent variable and then you have X so the key part of this model is that you have specified about it you have said there's some probability of the latent variable and you've also made the assumption of what probability in the world looks like so you've specified the likelihood function and so statistically you would call these prescribed models and in this class of prescribed model things like variational inference maximum likelihood are applicable now when you go to the adversarial Network setting then you are in a different class of models those are models that you call implicit models so they don't specify this likelihood function at the end they just only specify a data generating mechanism and so the principle of inference that you have to use is different from this one here the principle of inference is about estimating the marginal likelihood of data and then using that likelihood to do other kinds of reasoning but when you are in implicit models and in gans the principle of inference is different the principle of inference is about comparison is can you compare two samples of data as in to sample and hypothesis testing and then given knowledge about how you think they are related can you derive a loss function that helps you learn so they're not actually so there are very different principles of inference actually going on and that's that's how you shake ears to modest he he posted a very beautiful paper on the archive like four days ago that explains the intuition he just gave here at the podium so to answer your question I would recommend looking at that paper Shakira's paper on the archive hi so as you mentioned in the first part of the talk optimizing the elbow will lead us to a local solution and I'm wondering so in that respect relational inference techniques are known to be sensitive to initialization so the question is what if I want to be sure that I'm exploring different Optima so could you comment on is there a measure of coverage of this in the space of solutions and I guess in that respect being I mean having a stochastic gradient descent might help you know asking these drunk people and then well we are covering more space but how can we assure that we are really reaching different Optima so the actual assurance is pretty hard so if you want to know if you're getting coverage I don't think there's a great answer to that besides running a sampler but even that isn't great because the sampler will also get stuck in terms of getting better too and being less sensitive to initialization there's a lot of work like annealing temporary that helped you escape those Optima there's newer work on regularizing the steps you take in stochastic optimization so the trust region methods where you only believe that you should move in a certain region that is feasible given what you believe right now and I think you're right that stochastic methods in general seem to get to better local optima we saw that in many problems and more broader than variational inference you know when used with non convex objective functions stochastic optimization methods often gets a better local optima and people have theories about why the yumbo 2 has some nice theories thank you so my question is essentially when you're typically using something like a Bayesian neural network because you want to get uncertainty estimates on your prediction a lot of people do mean field approximation zhan the posterior over the weights that's very good when you want like one really car one really good sample over the weights because you are doing a mean field with a reversed kullback-leibler but to get like the uncertainty estimates because of the mean field approximation it might really be bad do you are you aware of any work using d posteriors over the sets of weights or more complicated posteriors in that sense for Bayesian neural nets yes I think this is one of the most interesting questions right now is how can you use these new approaches for uncertainty over distributions now the difficulty is that there are a lot of parameters that you have to do these global parameters are typically million dimensional 10 million dimensions and this is why I see the failure that learning the mean filled Gaussian is not good enough because you just basically learn the me nothing else so obviously you can do things like Monte Carlo sampling but that doesn't really scale maybe the best way right now the people have explored this by building ensembles of these models and then combining ensembles of them because you can parallelize it very easily and then combine the predictive probabilities together but from probabilistic and doing a Bayesian posterior analysis this I think is one of the really interesting questions and you know we'll figure them out you know I think in the next few years [Music] okay I think we're all set for questions so if you have any other questions you can find our speakers during the the conference but let's give them one last round of applause you
Info
Channel: Steven Van Vaerenbergh
Views: 21,781
Rating: 4.955801 out of 5
Keywords: conference, nips, nips2016
Id: ogdv_6dbvVQ
Channel Id: undefined
Length: 113min 4sec (6784 seconds)
Published: Thu Jan 18 2018
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.