Tamara Broderick: Variational Bayes and Beyond: Bayesian Inference for Big Data (ICML 2018 tutorial)

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

I mean so hello we are gonna get started so it is my actual great pleasure to introduce today Tamera Broderick who is going to talk about to talk us about variational ways and vision and how to apply it for Big Data and as most of you may know is an assistant professor at MIT and see God help it's the phone from Bell key in principles is not going to make a break during the talk and we are gonna postpone all the questions for for the end of the tutorial you will have two mics one that I'm gonna move later on over there and another one over there what you can cue to ask your questions at the end of the tutorial and okay help me welcome in Tamera okay thank you it's great to be here and I'm excited to talk to you today about how we can perform Bayesian inference for our modern large data sets and large analysis problems so let's start by talking about you know how is Bayesian inference being used in practice so you may have noticed that Bayesian inference has been used in some of the big scientific and related discoveries of the past few years so for instance there was this Trappist one exoplanets discovery and in this case Bayesian inference was used to get point estimates and uncertainties about candidate observations and thereby decide if these were really exoplanets Bayesian inference was used in the gravitational waves discovery in this case it was interesting to get point estimates and uncertainties about aspects of the black hole that was emitting gravitational waves things like the mass and the spin to know not just what do we know but how well do we know it tragically in 2009 a passenger plane crashed over the Atlantic Ocean and for years people were trying to find the wreckage of this plane and not having any success and then sort of miraculously they brought in Bayesian inference a few years into this search and after they started searching again with this new statistical set of tools they were able to find the wreckage in about a week so if we look at transportation analysis so dawn Woodard is the senior data scientists of maps at uber and she and many other people are looking at Bayesian imprints for for routing for understanding what are the times for vehicles to get between different areas so this could be ambulance routing this could be other types of vehicles and in this case it's really helpful to know not just how long we expect a vehicle to take to get to a certain point but also how certain we are of that what's our uncertainty about that timing which is very important if we're doing some kind of planning our Rachel meter is a assistant professor at the London School of Economics so there she's studying microcredit these small loans to individuals in impoverished areas and a very important question is is it actually helping is it helping bring people out of poverty is it helping the situation and so there we want to know things like well how much is it helping you know how much is it increasing business profits and it's important to say not just how much is it increasing business profits but how certain are we have that value what's our uncertainty about that value especially again if we're gonna make decisions based on this like maybe we might make some sort of philanthropic investment or something of that nature Jasha barton shoddy and hamsa Balakrishnan MIT are interested in predicting how much fuel a plane is going to use based on publicly available information about that plane and in this case again we want to know not just how much fuel that plane is going to use but how certain are we especially if we're gonna make decisions based on that information in fishery management Bayesian inference is very often used because we really care about risk management overfishing is a major issue these days it's possible to have population collapse and a fishing population and then we don't have any fish anymore and so we want to be able to when we're managing these fisheries keep in mind that type of risk okay so in all of these cases there are very similar sort of set of analysis goals we really care about not just what do we know but how well do we know it so we want to get point estimates and we want to get uncertainties and typically there are many unknowns in a different model you know in a given model and so we want something that's coherent across those unknowns uncertainties that are coherent across those unknowns also typically we're interested in pinpointing some particular aspect of this analysis and really understanding it so we want interpretability in that sense that we can pinpoint here's the quantity that we want to understand we want to build up complex models of physical phenomena or social phenomena economic phenomena engineering phenomena and so the graphical models framework that Bayesian inference works in the modularity of that framework lets us build up these complex models and finally in all of these cases we see that you know we're working with experts experts are working on these problems they have a lot of domain knowledge a lot of knowledge that they can bring to bear and should bring to bear on these problems and so it's nice that with Bayesian inference we can incorporate that kind of expert information okay so now we've gone over at this point sort of a lot of example applications applications where Bayesian inference is being used but just to get a scale of you know a little bit of a scale of how often these methods are being used let's notice that there are a lot of probabilistic programming languages these days a lot of probabilistic programming languages that let people use Bayesian inference in a relatively automated kind of way just one of them among quite a number is Stan Stan works with Julia it works with our works with Python MATLAB and so on if we look just at our and we look at just one download mirror on our stan is typically downloaded in a given month over 10000 times well over 10,000 times at about 300 thousand times if not more since 2015 and so this is something that is being widely used in the sciences and engineering and the social sciences as we said before and so we want these tools to be as useful as possible to people we want them to be something that people can use that they can rely on that is easy to use now unfortunately a lot of existing methods are sort of overly slow you talk to people who are using Bayesian inference and it can be taking days if not weeks for their methods to run for their approximate Bayesian inference methods to run and so something that we really want to be able to provide to them is something that's much faster I mean for one a lot of problems actually have a real time sort of component but in general this is time that they could be spending on you know their actual research and the sciences or whatnot okay so we want methods that are fast for these users that's a big outstanding challenge here and when I say that they're fast I mean two different things so of course I mean computing time I want these methods to be able to run quickly but I also mean user time so we don't want a user to have to sit down and you know tediously derive a bunch of equations and you know updates for their algorithm or do a bunch of sort of hyper parameter tuning at length we want them to be able to focus on the biology problem the chemistry problem the economics problem and not this type of problem so we want something that's fast and user time okay well there are plenty of algorithms that are fast in computing time and fast and user time that are just wrong you could just return the number five if you didn't care if it was right or not so of course we also want it to be reliable we want it we want our algorithm to give something that we can actually trust at the end of the day and so this is sort of the challenge we want something that's fast in computing an user time that's easy to use that's reliable oh so I'm just going to mention there's sort of this fallacy that I think people think about sometimes this idea that well we don't care about uncertainty because they're datasets so large now sure if you have you know one tiny little thing that you're trying to learn and all of your data is informative about it and you have millions of data points that's true but that's not typically the situation that we're in in these modern large data problems you know typically we have a ton of data and we're trying to learn more as we get more data more nuanced about our day or more different types of things that are going on in our data and so we still have uncertainties and we still want to quantify those uncertainties okay so what do we do how do we deal with this problem how do we have fast Bayesian inference well certainly something that people have increasingly been turning to over the past you know a few years is variational baits why are they're turning to variational Bayes well it seems like it's addressing these modern problems these problems with a large amount of data and a large size and the problem something like a large dimensional problem I'm in a particular it seems like you can do this very quickly so variational bass has been applied in a lot of areas sort of the canonical area that I think we see it being applied as in text analysis there's a ton of text data nowadays even if you don't have access to a proprietary text data if you just look at what's online there's a ton of text data there you could spend all day analyzing it and it's a very sort of large problem there are many topics that we could pick up in this text data there are you know we want to find the assignments of words to topics we want to find all of these things and variational Bayes has enjoyed a lot of success in this area now variational bass isn't just used there it's also used in things like discovering structure in networks or graphs this is another type of data that's increasingly growing in popularity maybe we want to find communities or find some other type of structure variational bass is used in neuroscience and gene expression analysis and other areas of computational biology so it's using a lot of these sort of you know important cutting-edge areas where we want to do scientific analysis and so it seems like it's worth studying what's going on with variational base okay so in the rest of this talk we're going to first spend a little time recalling to ourselves what is the goal of Bayesian inference what are we trying to do with Bayesian inference and in particular why are we going to approximate Bayesian inference what's the goal there what's our approximation goal along the way we'll start talking about what is variational Bayes what is mean feel a variational base a particular type of variational Bayes that's very popular and very widely used in practice and at that point we're gonna focus on mean-field variational days is a little bit of a microcosm why use mean-field variational based why is this so popular why is this widely used in prayer yes and then can we trust me in field variational bass should we always be using mean field variational bass is this sort of a panacea and then finally in the second part of the talk so this is essentially the first part of the talk and the second part of the talk will ask where can we go from here sort of probably unsurprisingly any method that we have nowadays has challenges that has pitfalls we'll go over those pitfalls and then we'll think about you know how can we address those so we're gonna start again thinking about Bayesian inference what is Bayesian inference and what are we trying to do with Bayesian inference okay so with Bayesian inference the first thing that we want to identify is our unknowns our parameters these are the things that we're really interested in or we don't know about our problem so for instance let's think about that microcredit example from before you know maybe we don't know how much microcredit is increasing business profit in each country that we look at we don't know how much it's increasing business profit across all the countries we don't know the base business profit in each country do know the base business profit about all the countries there are plenty of other things that we don't know and so we collect them all in this high dimensional or the very least moderate dimensional parameter vector let's call it theta and if we're taking a Bayesian approach then we start off by saying that we know or maybe we don't know that much either way we express that knowledge or lack of knowledge in a distribution in a probability distribution called the prime the prior distribution over the parameter okay now something that we're going to deal with repeatedly in this talk is the fundamental limitations of our human eye sight so a problem with with us is that we can only see in two dimensions and so it's hard to actually look at this high dimensional distribution that we're dealing with and so here I have a cartoon of a prior where we have sort of a diffuse distribution over theta you want to keep in mind that actually the type of distribution that we're dealing with the type of parameter that we're dealing with is actually moderate if not high dimensional okay so we have this cartoon of our prior we've we've collected our prior next we want to get our likelihood or at least at the same time we want to be choosing our prior and our likelihood so our likelihood tells us how our data interacts with our parameters so here we're gonna call our data Y and we're gonna imagine that we have capital and points y1 y2 y3 up 2y capital N and I'm going to abbreviate that collection of data as y1 : n ok so our likelihood again tells us how this data relates to our parameter theta and together the prior and the likelihood make our model this is something that's traditionally been called a generative model I just want to point out that this is a different usage then you may be seeing in modern machine learning a lot of the time but this would often be called a model or a generative model these two choices they tell us how we could simulate from our data but of course we don't simulate our data we actually observe our data and so what we do is we notice that what we've established here at this point is a marginal the prior and a conditional the likelihood and so that just makes a Joint Distribution and so another way that we could look at a conditional that comes out of this Joint Distribution is to condition the theta on the Y's and that's called the posterior distribution this is just a fact of probability that we've used and this fact of probability is typically called Bayes rule or Bayes theorem so now what we have at the end of this is what do we know about this parameter the set of unknowns now that we've observed the data and the typical cartoon that we imagined in those cases that we start off knowing relatively little about theta going into our problem we apply Bayes theorem after getting our data and then we hope that we know more about theta and the way that we can express knowing more about theta is that it seems like our variance has decreased or uncertainty has decreased in this distribution okay so let's just summarize here what are the steps of Bayesian inference one we build a model we choose a likelihood and we choose a prior and that's our model do we compute the posterior we apply Bayes theorem and now at this point we recognize you know again it's gonna be hard especially if we have a high dimensional parameter vector or something like they 'te to actually plot that posterior to really look at that posterior and any depth and so a typical thing that we might do at this point is we might say hey well we got to summarize this somehow we've got to report this somehow to two people at the end of the day and so a typical summary we might look at it something like a point estimate and an uncertainty now there are a lot of different ways that we could report this point estimate and uncertainty but certainly a very popular method a method that has a lot of sort of statistical and theoretical backing for it is to report a posterior mean as our point estimate and posterior variances or covariances as our uncertainty and so this is what we imagine doing at the end of the day is making this report okay so now something that's worth realizing is that each of these steps presents a challenge we have to build this model first of all we could have a whole tutorial about building models in this case in the case of this tutorial we're gonna focus on the computational aspects we're gonna suppose that somebody has given us the model a practitioner you know or somebody else has given us the model they've given us the prior they've given us the likelihood and we want to report back to them the mean and the covariance this would be a typical perspective that we might take for instance if we were a probabilistic programming language okay so I said that there was a computational challenge where does the computational challenge come in come in it's in steps two and three so let's take a moment to think about why is there a computational challenge why are steps two and three difficult well first of all even though this Bayes theorem is just a fact of probability that doesn't mean that we can get out a nice form for the posterior a nice closed form you know some sort of distribution that's very familiar and easy to us to work with in fact in general that won't be the case for any sort of moderately complex model that won't be the case we're not going to get that closed form and now what we're left with is a moderate to high dimensional integration so if theta is high dimensional then this is gonna be a high dimensional integration why do we have a high dimensional integration well remember we're reporting a posterior mean or covariance to get a posterior mean that means integrating theta over the posterior another way to look at this is just even to get the posterior we have a high dimensional integration so in particular let's look at that normalizing constant so right now we just have proportional to but we can actually write out the normalizing constant here so the way we get this as we just say hey this is marginal times a conditional here's another way to write a marginal times a conditional here's the normalizing constant and this is often given a name it's called the evidence the probability of the data is called the evidence well how do we get the data or how do we get this probability of the data how do we get the evidence well remember the only thing that we start off with is the prior and the likelihood that is the input that is what we are given to start with and so we have to get anything that we do from that and in this case this marginal is just the joint with theta integrated out and this joint is just the prior times the likelihood and so that's where we have our high dimensional integration theta that's a hard problem integration in more than one or two dimensions is a hard problem and that's fundamentally the issue that we're dealing with today is how can we do something like this integration okay so now let's start thinking about well how would we actually deal with this this is where the approximate Bayesian inference aspect comes in now traditionally the gold standard for this has been Markov chain Monte Carlo it's been called one of the 10 most influential algorithms of the 20th century it's extremely widely used and in particular one of the reasons that it's widely used is that it's eventually accurate if you run it long enough we know that you're gonna get the right answer for your integration of course the problem is you might not have until eventually you know if this takes a week if this takes a month that just takes the age of the universe then that's gonna be a problem for your analysis and so increasingly people are saying gosh you know my method it's running too long if I use Markov chain Monte Carlo what else is there that's out there to get this integration to get this approximate Bayesian inference that I want to calculate these means and these covariances and so one observation here is gosh you know something that's really sped up a lot of inference a lot of machine learning procedures over time is to use some form of optimization how about we use some form of optimization here and so here's an idea to take an optimization approach so remember the the idea that we're we're approaching here is we say okay let's find some approximation to our posterior same thing we do with MCMC we're finding some approximation q star and then we're gonna use its mean and its variants as approximate means and variances for our exact posterior okay so now we're gonna say though that we're gonna find this Q star with an optimization approach so suppose there's a lot of distributions out there in the world and there's a set of nice distributions what do I mean by a nice distribution what would make a distribution nice well remember we have a specific goal at the end of this we want to report a mean and a covariance and so if we can calculate a mean and covariance for our distribution that seems to make it pretty nice to us so let's assume for the moment that we have this vague definition of nice that this is a case where we can calculate a mean and a covariance for these distributions and of course the whole problem the whole reason that we're talking is that our exact posterior is not a nice distribution we cannot easily calculate means and covariances for it and that's why we're in a bit of a pickle so it might say to ourselves well maybe the way that we can deal with this is we can approximate our exact posterior which isn't a nice distribution with the closest of the distributions that are nice and so now we have something that's starting to look like an optimization problem we want to find Q star out of all of the nice distributions let's call them capital Q such that it is the least far so we have farness represented by this this F so we want to find the closest of the nice distributions well there's a reason that I'm avoiding the word distance here because distance has a very particular meaning so distance means that f is positive definite and it's symmetric and actually we're not going to make that assumption here we're going to assume in fact if we use variational Bayes that F is not symmetric that it's the callback Leibler divergence in a particular direction okay so that's an interesting choice we're choosing a divergence rather than a distance we're not choosing a metric here so we're gonna spend a little time thinking about well why would we do that there are plenty of actual distances between probability of distributions we could have chosen Wasserstein we could have chosen total variation we could have chosen all kinds of other things and for some reason we went and chose the callback Leibler divergence so let's say about that for a bit so first let's just think at a very high level why might we have done this well one reason is that it seems like people are using it a lot so maybe you should just understand it from that perspective it seems to enjoy a reasonable amount of practical success we seem to get point estimates and predictions that are pretty decent it's certainly very fast so one piece of work that just happens to be ours this is not the only piece of work demonstrating this as fast we show that we can make it streaming we can make it distributed we can run on millions of Wikipedia articles with 32 cores in about an hour and this was back in 2013 there's been a lot of other things going on since then we can you know be even faster this is a very fast method and so that seems very promising because that was the whole point right is that we found that our existing methods were too slow and we really wanted to speed things up and so we asked ourselves what can we do how can we get faster ok so that's at least a little bit of a motivation to keep thinking about this but let's think a little bit more deeply why KL divergence you know why can't we do this with other forms of distances or even other forms of divergences so in order to start answering this question let's actually write out the KL divergence and remember it's not symmetric so we're writing out in this particular direction used by variational Bayes okay so the KL divergence is the integral over our approximating distribution of the log of our approximation divided by our exact posterior ok now I'm gonna ask you a very real question that I'd like you to think about you know to yourself for just a second suppose that I had a perfect optimization algorithm if you give me the form of the optimization it will return to you the exact optimizer and so I tell you hey here's my kale divergence problem tell me the optimal Q can you think about a reason just think to yourself for a second about why this is not practical at this point okay so one thing that you may have noticed is that this requires knowing the exact posterior and that's kind of a problem because the whole point was we don't know it right so if we want to optimize the kale divergence to the exact posterior the optimal value because this is a divergence it's positive definite is in fact the exact posterior but we don't know that and in fact this is gonna be a problem not just with the kale divergence is gonna be a problem with any distance any divergence we might choose if we're trying to find the closest thing to the exact posterior and we don't know the exact posterior how will we ever have any hope this seems like a chicken and egg kind of problem so in order to think about how could we possibly get around this problem let's do a little algebraic manipulation so in particular what I'm doing in this line is I'm just that the posterior is a conditional so we can just write it as the joint divided by the marginal so that's what I'm doing here I'm just writing the posterior as the joint P of theta and Y divided by the marginal P of Y and now we're gonna use some logarithm facts so remember that log of a times B is log of a plus log of B so we're gonna get an integral over Q of log P of Y but Q is a distribution and so that integral is gonna be one so we're just gonna be left with log P of Y and then to get this term over here we're gonna remember that log of 1 over a is just negative log of a okay so together with those facts we've written out our optimization objective in the following form and now let's focus on these two terms these two terms that were summing in this objective so the first term log of P of Y remember we are minimizing the KL divergence over Q this term does not contain Q and so if we're trying to find the Q that minimizes the KL divergence we can just throw this term out because it affects nothing okay let's look at the second term what's the thing that we're subtracting well this is pretty nice now this second term contains only things that we know P of theta and Y the point-p of theta and why that's just the prior times the likelihood those are exactly the inputs that we put into this problem so sure we might still have an issue of you know how do we optimize this what's the best way to optimize this but things are looking at least like they could be optimized in this case this has a hope in the way that just having a general distance here seems possibly hopeless okay so this is so important this thing that we actually can optimize instead of the kale that it's given a name it's called the evidence lower bound or the elbow the elbow just being an abbreviation for evidence lower bound okay so something that I think is pretty useful if you really want to learn in areas to do some exercises and so I'm gonna suggest a few along the way also suggest some reading later on but the first exercise that I'm gonna suggest is that you can actually show that the kale is greater than or equal to zero in fact you can go further than that you can show this as positive definite the kale is always greater than or equal to zero and it's equal to zero exactly when the inputs are the same if you have any trouble with this just look at major textbooks in this area like Bishop's textbook is pattern recognition and machine learning textbook but it's worth spending a little time you know at least trying to do this now once we know that the KL is non negative we can use the relationship that we just described to show very simply so this should just be immediate that the log of the evidence is greater than or equal to the elbow so this is just a way to explain where did that elbow name come from well this is clearly a lower bound for the log of the evidence okay but the really important point here is that minimizing the KL finding Q star to be the minimizer of the KL is exactly equivalent to finding Q star to be the Maximizer of the elbow we didn't make any approximations this is essentially the exact same problem at least in terms of the Q star that we get out in the end okay so now let's ask ourselves again ykl why kale in this direction well we had this super key derivation this super easy derivation where we were able to get rid of the thing that hated this normalizing constant the thing that was really slowing us down and it really seemed to have depend on depended on the form of the KL now that's not to say you couldn't do something else with another type of divergence certainly people look at other types of divergences maybe you could bound it maybe you can get some approximation but this is very convenient this means that we can actually solve this problem we can actually minimize the scale divergence exactly at least up to this point we haven't had to make an approximation and ykl in this direction you can't pull this trick in the other direction in the other direction the integral is over the exact posterior which we don't know so it's really this trick this ease of use that led us to choose KL okay so let's think about where we are at this point we have this sort of not fully formed optimization problem we said we wanted to find the closest of the nice distributions now we so far have defined a notion of close we're trying to minimize the KL divergence but in order for this to be a well-defined optimization problem we still have to say what it means to be a nice distribution so let's think back to ourselves what was the goal of nice distributions what were we trying to accomplish with nice distributions well we were trying to report means and variances so what types of distributions let you report means and variances so let's think about how are we gonna choose these nice abuse key okay well I think that there are sort of two answers that immediately come to mind when you think what cuyps of distributions let you report means and variances relatively easy the first one is low dimensional distributions it's really these moderate to high dimensional distributions where we have trouble with integration low dimensional distributions aren't much of a problem and so this is a way to motivate one choice that we might make for the nice distributions when we make this choice when we say that our Q are nice distributions factorize in this way we say that we're doing not just variational bayes anymore but mean-field variational bayes okay so what's what's going on here so we're that we of our parameter vector theta it's maybe very high-dimensional but it breaks down into components so it has components theta 1 theta 2 theta 3 up to theta capital J they don't have to be one-dimensional but we imagine that they're smaller and easier to deal with and we assume moreover that our approximating distribution q of theta factorizes over these theta 1 up to theta J that is the mean field variational Bayes assumption ok now I said that there are two answers to what makes a nice distribution one can you calculate means 1 can you calculate variances with a distribution I think another perfectly valid answer to that would be exponential families exponential families are great when you have exponential families you can do whatever you want especially if what you want is reporting means and variances then that is the thing that you can do and quite easily and in fact often when people say that they're doing mean field variational Bayes they mean the combination of these assumptions they mean that we're going to assume that our distributions factorize and we're going to assume that they factorize into convenient familiar exponential family distributions okay now something I want to emphasize very much here is that this is not a modeling assumption we went into this talk saying that we imagine somebody has come to us and they have given us a prior and a likelihood that is our model and we want to have an automatic way to report back to them a posterior mean and a posterior covariance for that choice and we would not like to limit them in that choice at least as much as possible and this would be an extremely limiting assumption in actual models this would basically restrict us to models that are super boring and nobody uses them we do not want to make the assumption in our models that there's some independence going on between all of the parameters and we're not doing that we are only making the assumption that our approximation has this factorization assumption now we can still ask is that a good approximation and in fact we will we're gonna spend a lot of time asking that question but we are not making an assumption about the model itself there is an exact posterior and we are trying to approximate it okay now at this point you might be thinking to yourselves this is still a really weird proximation right I mean we are just assuming that everything is access aligned is that ever gonna work out and again we're gonna spend some time thinking about that question but just to sort of a swayed your interest for a moment here is a very simple example again we can sort of only see in two dimensions so if I want to show you a distribution it's got to be in two dimensions and so what we're seeing here is a case where we're looking at an exact posterior the contours of the exact posterior are in green it is a two dimensional distribution and we can apply mean-field variational base to this distribution and the contours of the mean field variational Bayes approximation are in red and I think the thing that's worthwhile about this picture is hey this doesn't look totally weird it looks like the mean field variational Bayes approximation is actually decent the means aren't probably too far off from the actual posterior means the variances the uncertainties probably aren't too far off from the actual posterior uncertainties so maybe this is something that could work we'll keep asking that question okay so now we have an optimization problem we have a well-defined optimization problem finally we're trying to find the closest of the nice distributions our notion of closest is KL our notion of nice doesn't mean field variational base how do we solve it in some sense any way you want so at this point once you have an optimization problem you can use the latest optimization technologies you're not limited to anything in particular we're gonna talk about some methods to solve it but not in a lot of depth and it's more like I want to say you could use whatever technology you want what we want to really examine is this optimization problem that we've set up so one way that you could do this a very classic way but certainly not the best way on its own is coordinate descent so you can just go through each one of the Q's and turn and optimize the Q's again this isn't the best thing that you could do it's just an option there are many options it's an optimization problem okay so let's review where we are at this point we decided that we were gonna try to do about approximate Bayesian imprints and then we made a bunch of choices and it's worth being cognizant it's worth being really clear about what those choices were and examining our assumptions examining those choices so the first choice that we made is is sort of a very subtle one we chose to aprox made our posterior with another distribution if we're just gonna report posterior means and covariances technically speaking at least a priori we don't necessarily need to have done this we could have done some other way to just get at those values directly but this seems to be the approach that we're taking here okay the next choice that we made was to take a certain style of optimization approach we said hey the way that we're going to find this Q star that approximates our exact posterior is we're gonna minimize some notion of divergence over nice distributions so at this point we haven't said exactly what those are but it's still a choice we could have done Markov chain Monte Carlo which doesn't fit into this framework in fact there are other things that don't fit into this framework if you're familiar with the whole class method it doesn't fit into this framework we could have just found the maximum posterior estimate we could have said what value of theta what value of our parameters maximizes the posterior and none sort of a local perturbation around that to get a Gaussian approximation that's the Laplace idea so there are other there are other methods of approximation we could have used this was a choice hey then we said hey let's use the KL divergence as our notion of divergence we're gonna use the KL divergence we didn't have to do that we could totally use other divergences in fact sometimes when you're reading papers you might find that people refer to the general optimization problem as variational inference as long as you're clear what you're doing in your paper that's what matters but here we decided to use the KL divergence okay then we said what are the nice distributions how we can optimize over those we made a choice we said that we were gonna use mean-field variational bass we're gonna use this factorizing assumption and typically also make an exponential family assumption that was a choice we didn't have to do that we could have considered other types of things to optimize over one in particular would be something like optimize over all of the normals all the normal distributions that are the size of our full parameter space okay and finally once we've made all these choices we can choose how to optimize this problem and there are many different choices we could make there so one is coordinate descent but it's certainly not the only one something that people have noticed is that if you use coordinate descent this is actually still pretty slow and so an observation that people have made in general machine learning is that stochastic gradient descent seems to really speed things up and so if you used to cast a gradient descent in a sort of unusual way you can't apply it in this case it's a little bit unusual but you can apply it and you get something called stochastic variational inference another issue again is this issue not just of computing time but if user time so if we just do mean field variational base coordinate descent it turns out we have to sort of hand derive an algorithm much like Gibbs sampling we have to figure out exactly what the steps are that could be a big pain for a user and so another option is to just assume that every marginal is Gaussian that all the distributions are Gaussian and then it turns out we can automatically do this with the magic of automatic differentiation with automatic differentiation variational imprints okay there are a lot of other choices that we could make for optimization here but all of them when we applied to the mean field variational based problem are still mean field variational Bayes and so it's worth our time to understand not just the optimization procedures we're using but in fact what is the problem or even trying to solve and is it doing what we want and so that's what we're gonna spend some time on now okay so we talked about Bayes and approximate Bayes we define variational Bayes and mean field variational Bayes and so now let's think a little bit about is this something that we should be using does this ever work I mean it is a pretty restrictive assumption this factorization assumption and so we want to know you know is it ever gonna give us something actually useful especially in complex models so we're gonna start off with a very simple model so midges you might be familiar are these small winged insects and in general when biologists you know might be going out and studying different species or genus 'iz they might want to get a sense of sort of you know what's going on with animals with individuals within that that species and so scientists have actually sort of measured a bunch of midge wing lengths and maybe we want to learn more about those lengths and so a model that we might have is that these midge wing lengths are normally distributed with some mean mu and some variance Sigma squared we might think that this is a reasonable model maybe because we think there are a bunch of sort of genetic and environment factors and maybe have some kind of linear way that they add up and we get a normal distribution so in this case we have a very simple set of parameters mu and Sigma squared that's our theta in this problem so we're interested in this theta that's mu and Sigma squared we have this we have part of our model we have the likelihood remember there's two parts to our model the likelihood and the prior and so if we want to complete our model we have to add a prior a particular conjugate prior that we could choose is an inverse gamma prior for Sigma squared and mu given Sigma squared being normal this is conjugate what I mean by that is that if I have this prior and I have this likelihood I can calculate the posterior exactly and then I know the posterior I don't have to do any approximation and so now a real question for you that I'd like you to think about for just a moment again is gosh isn't this whole tutorial about approximate Bayesian inference why are we looking at a model a super simple model where we can calculate everything exactly what was the point of that real question I'd like you to think about for just a second okay so maybe a deeper question that we should all ask ourselves is so we're approximating a posterior mean and a posterior variance and we don't know the posterior that's the set up of our problem that set up our a problem is we don't know these values how will we know our approximation method works that's a tough question because the whole point is we don't know we don't know the answers we don't know the ground truth so something that we could do is we could look at an example where we do know the ground truth we can compare it to an example where we do know the ground truth and that's probably worth doing and that's why we're doing it in this particular case that's why we're looking at this super simple problem that's not super realistic we're doing that but of course there's a problem with that if we only look at that type of problem we might be very biased right you know I might we might have a method that only works and super simple problems so something that we're really gonna be grappling with throughout this talk is you know how do we evaluate how do we know that we're doing the right thing but it's something to keep in the back of your mind as we go forward okay so something else that you should be very aware of and definitely take the time to realize is that the mean field assumption does not hold in the model it is not a modeling assumption here is already a very first case a very simple case where we can check you can just immediately check that the the posterior here does not factorize into some function that depends only on mu and some function that depends only on Sigma squared that's not the case you can just check there is no mean field assumption in the model it's only in the approximation okay so we are gonna make this mean field assumption in our approximation and we're gonna choose to do coordinate descent that's not super important here again whatever optimization algorithm you want to use you can use but just for illustrations sake we're gonna do coordinate descent here this is something that you can derive you can also check against Bishop I will just write the results for you to check against just to give you a sense of what this is going to look like so we're gonna have some distribution in mu that's defined by some parameters now it's unfortunate this overloading of the term parameters these aren't parameters of the model in the way that we were talking about them before they aren't unknowns of the model they just describe the variational distribution so we might call them something like variational parameters to distinguish so we have some variational parameters for MU we have some variational parameters for the distribution over Sigma squared and the way that we're going to do this is we're gonna iterate between finding the distribution over mu and then we find the distribution of Sigma squared given the distribution of MU and then back and forth and back and forth and so that's one way that we could solve this problem and we're just gonna look at that for illustration sake in this case okay so again a big thing to note here is that we are able to plot the exact posterior that's not something that we expect to be able to do in any real problem we are only able to do that here because we have this sort of unnaturally simple problem but we can do it so here we have the contours of the exact posterior it's a distribution over mu and Sigma squared inverse the precision and we have to initialize our approximation somewhere in our approximation of course is itself a distribution now initialization is a tough problem it's one we should think about we're not gonna think about it too hard right now okay so now the question for you to think about on your own for just a moment is what is going to happen in the first step of this algorithm supposing we're doing coordinate descent and the first step is we update the distribution in Mew so if we update the distribution mu as our first step what is going to change about this plot just think about that for a second okay so for updating the distribution in mu things will only change horizontally nothing's gonna change vertically because the distribution in Sigma squared has not changed it's only the distribution in mu that has changed now on the next step we update the distribution in Sigma squared and this is what we get we only now change things vertically also it's worth noting that when we change things horizontally the approximate distribution is getting closer to the exact distribution as desired same thing when we change things vertically it's getting closer to the exact distribution now we can iterate back and forth and back and forth and eventually we are full variational Bayes meaning filled variational Bayes approximation to the exact posterior that's in red now in this case we happen to do coordinate descent but I think what this illustrates in general is that if we're optimizing if we're using any optimization procedure what's happening is we're changing our approximating distribution and hopefully to look more and more like our exact distribution and that sort of visualization is very general okay so this looks pretty good as we said before it this looks like the means don't look too crazy it looks like they're pretty similar the the variances look pretty similar between the exact and approximating distribution so let's go on to to a more realistic example to get a sense of is that true in general okay so now let's look at this microcredit experiment so remember microcredit are these small loans to individuals in impoverished areas and we want to know are they effective so raychel meager has been doing some really interesting work on this if you really care about microcredit I really encourage you to check out her papers they're very thoughtful they're very interesting they really go into depth in the modeling we're gonna be looking at a very simplified version of the interesting models that she uses okay but there are really randomized controlled trials that went on in seven different countries ranging from Mexico to Ethiopia and at each one of these sites so each site is in a different country a randomized control trial took place among 900 to 17,000 businesses depending on the site so business was assigned microcredit or it was not assigned microcredit and so we want to get at it was microcredit effective and so perhaps a quantitative way to get at that is to say well how much did it increase business profit and a priori we don't know maybe it increased business profit maybe it didn't maybe it actually hurt we want to find out and so to that end we imagine that we well in fact we did and by we I mean you know other people collected the actual data on the ground of what are the business profits so Y sub K n is the prophet of the enth business at the Cates site which we can model independently as normal around some base business profit let's call it mu K that's the base business profit at the cate site before microcredit happens and then we have some term that has to do with microcredit so TK n is an indicator function that says whether or not this business received microcredit so it's one if the business received microcredit it's zero if the business did not receive microcredit and so in particular if it's zero we just don't get the second term at all and if it's one we do get the second term and in particular we get this microcredit effect tau how much is microcredit increasing business profit at this site this is really the parameter of interest for each site this is what we want to know is tau positive how positive is it zero is it negative what is it okay so we also have some here that we don't know for any particular site and we want to learn that as well so at this point we have a bunch of unknowns we have the muse we have the taus we have the Sigma Squared's we want to learn them and so to complete our model we're gonna put some priors on these values so in particular the first thing that we're gonna do is put a prior over the mew k's and the taupes so they're gonna have some shared mean mu that's sort of like the global base business profit and some shared global microcredit effect tau and so if anything this is really the parameter of interest globally what can we say about the effect of microcredit how much is it increasing business profit now this is an interesting another reason that Bayesian inference is useful for a problem like this so we want to be able to share power across different countries across different experiments in a way that respects that these are all different but still allows us to learn something and so instead of just analyzing all the data together and smashing it together which doesn't respect that you know Mexico might be different from Ethiopia we don't want to do that we also don't want to just analyze the data from Mexico and then throw it out analyze the data from Ethiopia and then throw it out and so on because now we're not learning anything across you know the different countries we want to perform some sort of partial pooling and that's what this kind of prior lets us do okay so we put on some other priors so this is a conjugate prior four Sigma squared the inverse gamma prior we put a normal prior on mu and tau C is a covariance matrix and so the conjugate prior is inverse Wishart but another choice that happens to have some nice properties is the separation and lkj prior so we can put that here okay the point is what we have here is a model it's a way more complex model than the model that we saw last time it's a much more realistic model kind of thing that we might deal with in practice and it's got a lot of parameters it's got the Meuse and the Thao's and mu and tau without subscripts the Sigma Squared's the C's we definitely can't solve this in closed form now we're gonna need some kind of approximation and so we have to ask ourselves what is that gonna look like and in particular we want to ask ourselves the question that we asked ourselves earlier how do we know that it's working how do we know that our approximation is working well here's another idea for testing if our approximation is working maybe we think that we trust Markov chain Monte Carlo and so we can test against that now of course eventually we'd like to not have to run Markov chain Monte Carlo we'd like to have something that's just very fast that we can run on its own even in this case if we use state-of-the-art Markov chain Monte Carlo probabilistic programming language and we're very careful about how we set up the problem it takes on the order of 45 minutes to get 2,500 drawers whereas Beanfield variational Bayes optimization takes far less than a minute and so we'd like to be able to just run mean-field variational bayes and in this case things look pretty good so what we're seeing here is that for each parameter of the model so we have all those different parameters we have a dot and the estimate of the mean of that parameter from mark oh for Markov chain Monte Carlo is the horizontal value and for mean field directional bass is the vertical value the line is x equals y and so this is really agreeing very strongly this looks very good ok another thing that we can look at is kriti or at least a very large dataset on online advertising to the machine learning community its obfuscated but it's open and so something we might be interested in this case is conversion prediction are people going to click or sorry if they are they going to you know buy something after they click on an ad are they gonna join an email list after they click on an ad you know what are they gonna do we might also be interested in how predictive of conversion our different features are different covariates the model or different aspects of this user and how well do we know that and this is the kind of problem that comes up in a lot of areas not just online advertising so in this case we can use the logistic generalized linear mixed model we're gonna reduce the size of the data significantly in order to compare to Markov chain Monte Carlo so again that's sort of limiting us right now so we're gonna take a 60 mm subset of the data to compare to Markov chain Monte Carlo and a first point that I want to make before we go on is another really fast optimization thing that we could do that I sort of mentioned earlier is map the map estimate the maximum a posteriori estimate why don't I just take the prior times the likelihood and maximize it that's the same as maximizing the posterior that's super fast and so why don't we just do that to get our mean estimates well let's suppose that we care about mean estimates because that's the approach that we are taking in this talk and let's suppose that we run map ok well that just takes 12 seconds that's great that's extremely fast so why aren't we running map this whole time ok well in this particular case we have a set of global parameters so here in the first plot we're plotting all the global parameters except for one that's called tau and again we have this x equals y line we plot the mean estimate from Markov chain Monte Carlo versus the mean estimate for map and yeah a couple of these are a little off so the reason that tau is separated is because if we write the exit why line four Tao and we plot Tao it's so many orders of magnitude off that you can't even tell that the x equals y line is anything but a horizontal line so that's pretty bad and if we look at these local parameters in this model and we use map to get those and we compared to the Markov chain Monte Carlo mean estimates we again see that you know we're just estimating them all to be the same value even though they're not and so if we care about these means these are just completely off and I guess we shouldn't be surprised by that right so if we're just taking a maximum all you have to do is distribution you all you have to do is just change your distribution a little bit somewhere and that easily changes the maximum it's not a very sort of robust thing to do by contrast in this case mean-field racial base takes just 57 seconds that's pretty fast and we get great estimates of the mean so here we're looking at all the global parameters now they're on the same plot because they can be we look at this x equals y line we look at the Markov chain Monte Carlo estimate versus the mean field variational Bayes estimate and it's looking really great same thing with the local parameters and the reason that we'd like to avoid running Markov chain Monte Carlo is that even for this very small subset of the total problem it takes around six hours okay so that looks pretty good so it seems like you know we ran a bunch of experiments on mean field variational Bayes to try to test if the means and the covariance is that we reported were any good and so we might say to ourselves okay well we ran three experiments that's like sort of the number of experiments that might be in a paper should that be enough are we ready to use mean field variational Bayes on every one of our problems going forward well of course something that you might have noticed is that I just said that we tested on the means and covariances and that's not true we just showed the means we also just did three experiments is that enough is that enough to know that things are working in general so let's ask ourselves a little bit more let's dig in a little bit deeper on these questions okay so first of all what about uncertainty so remember we're trying to minimize the KL divergence over this mean field variational Bayes assumption and let's suppose that we have a very simple approximate area let's suppose multivariate normal so multivariate normal that seems like a very simple kind of situation we're just plotting bivariate normal because again we can sort of only see in two dimensions it seems like a very simple situation but it's actually pretty realistic so let's suppose for instance that you had conjugate linear regression you have a linear regression problem with Gaussian noise or maybe you have and maybe you have a Gaussian prior to go with it then the posterior you can calculate it exactly it's gonna be Gaussian now in this case of course you don't have to do any approximations you don't have to do anything but if you had changed either of those things slightly if you had changed the normal assumption slightly you might still get something that looks quite this another big reason to think about gaussians is the Bayesian central limit theorem which tells us that as we get more data our posterior looks more and more like a Gaussian and so it seems like gaussians are actually at least a somewhat realistic situation and so here if we look at this exact posterior we suppose it's just the simple bivariate Gaussian and we look at the mean field variational Bayes approximation unfortunately it's pretty badly under estimating the variance so to see why that is let's look at the scale divergence so when P of theta given ax are our exact posterior is pretty small then if Q isn't very small then we're gonna get something that's pretty large and we're not gonna choose that Q to minimize our KL and so in some sense and some cartoonish sense we're gonna fit our approximating posterior inside our exact posterior and so that's what's happening here when we have an access line distribution that we're fitting inside our exact posterior this is what we get we're under estimating the variance now this is really important if we really care about uncertainty so you know let's think about how might we be using uncertainties in practice let's say that you need an ambulance to the hospital and you could take route a which takes about nine minutes or you could take route B that takes about 10 minutes so on average it's nine minutes versus ten minutes you might think to yourself gosh I should always take route a because that sounds a lot faster well what if you need a life-saving procedure in 13 minutes and this is 9 plus or minus 5 and this is 10 plus or minus 1 now it suddenly seems like route B is actually preferable you're much more certain to get to that hospital in time now what if you're estimating procedure erroneously told you without support from the data that this was nine plus or minus one that's pretty scary right you want to really understand your uncertainty to make the right decisions to make the right decisions about which route you should take to the hospital you don't want to be overly confident in a way that might really be hurting you down the road and so this underestimated of the variance is doing exactly that it's making us being overly confident it might make us make wrong decisions and of course it gets arbitrarily bad as we increase the correlation of the Gaussian we also don't get covariance estimates by construction that's the mean field variational Bayes assumption and this is something you can derive exactly we don't need an approximation to do this particular derivation and in fact that's a useful thing to try out okay so in fact we can go back to those examples we saw before and ask is this what was happening in those real life examples and in fact was so if we look at the mean field variational Bayes estimate of the standard deviation the posterior standard deviation this element of uncertainty for each of these different parameters so we still have each parameter is a point we now see that we're systematically under estimating the standard deviation relative to the exact posterior standard deviation which we think we're getting from Markov chain Monte Carlo and again this could affect our decisions you know what if we found that the microcredit effect how much is microcredit increasing business profits what do we found that that was about three US dollars purchasing power parity every two weeks and we decided maybe we're some multinational corporation we decided we're going to invest in microcredit if the effect is sufficiently high and the mean is more than two standard deviations from zero so we feel like this is you know a real effect that's really helping people and so we would find that in fact the standard deviation the posterior standard deviation is the following and so in fact the mean is one point six eight standard deviations from zero so we would say huh this didn't meet our threshold we're not gonna go ahead and invest we're gonna say instead that you know we're gonna collect more data maybe understand more about how covariance influenced this like maybe if some has had a business before that might affect whether microcredit will be a good investment you know various aspects of these individuals or the places that they're from but if instead in this case we had used mean-field variational days we would have underestimated that standard deviation in fact to the point that we would have erroneously without support from the data reported that the mean was more than two standard deviations from zero and we might have rushed into this decision over confidently not in the way that we had actually planned to okay and we say the same thing with the kriti Oh online ad situation so here we are looking at these global parameters and again they're massively under estimating the uncertainty is relative to to ground truth okay so is it just the uncertainties it turns out it's not in fact sometimes the means are wrong too so here is a couple of plots from Bayley Fosdick thesis PhD thesis so professor Fosdick is really interested in understanding relational data so graphs data so we're interested in saying something about you know these networks where people are interacting and there's an edge or there's not an edge and so we might want to model something like you know is there an edge or not and this is a case where she really wants to be speeding things up so for a thousand plus nodes Markov chain Monte Carlo is taking over a day and that's obviously not the scale of networks that people are really interested in we really want to get beyond that but if we run mean field variational bayes in this case we have the same plot as before each dot is a parameter and we have this x equals y line in red some of these dots are not on that x equals y line they're not being well estimated and in fact we can go a little bit further for each parameter here we can look at not just the mean but a 95% credible interval and we see that some of these mean estimates are outside the 95% credible interval so in fact they're particularly dangerous forms of the mean that's sort of bad on the scale of what we would assume to be a good estimate of the mean okay so something else that we could say is well maybe we know a priori that there are certain types of models where this works and certain types of models where it doesn't and then we just avoid the models where it doesn't work and we use this for the models where it does work and just to examine that assumption suppose that we have the fall well suppose we're interested in doing something like predicting college GPA and so we collect a bunch of standardized test scores ax n you know these are sort of our our covariance in this case and maybe we collect regional test scores as well so these vary by region and so our model might be something like a linear model we say there's some linear dependence on the standardized test scores what the GPA is and some linear dependence on the regional test scores but of course that linear dependence depends on your region so that's K of n that's the region and so we put priors on everything because that's what we do here's a bunch of basically conjugate priors that you might choose and now we have a full model we have priors and we have likelihoods and so we can ask ourselves okay how good are mean estimates in this model and what's nice about this model is we can just simulate from it again and again and so we can test you know are we getting these mean estimates right and so this is a little bit different than the results that we saw before so now what we're saying is that we're looking at just one parameter Rho squared but we're looking at different data sets so over the three data sets that we simulated so we simulated values of the parameters and that we simulated points from those parameters we can check to the mean field variational Bayes mean agree with the Markov chain Monte Carlo mean and it did for these three data sets and so we might ask ourselves Oh does this mean that we can always trust this for this model is this model always going to be good now we could simulate ninety seven more data sets and it turns out that's not always the case that some of these cases we are getting bad mean estimates and so we can't just take this model by model approach okay so what can we do well we don't need to throw out the baby with the bathwater it's still possible that even though we don't know in advance whether something's going to work maybe the solution is to have a good diagnostic that tells us after the fact did it work and the hope would be that the diagnostic is fast relative to the full inference procedure so it's something that we can use very practically so I think this is really interesting line of work it's worth noting that it's non-trivial so we said in the beginning that minimizing the KL is exactly the same as maximizing the elbow and that's true in the sense of when you minimize the kale you get the same thing as if you maximize the elbow what's different is the values of those two things so if you look at the value of the KL at your minimum that tells you something there's an absolute scale for KL zero is best and lower is better and so if you're close to zero that might be able to tell you something about your problem unfortunately with the elbow it does not tell you something about your problem it will ask them to something as you run your method but that value won't necessarily tell you anything about how well you've done it certainly doesn't have the absolute scale that the KL does and so we need to do something like estimate the KL or bound the KL or consider some other type of evaluative mechanism it doesn't have to be directly to do with that so it's been a lot of really cool work in this area I'm sure that my list is not exhaustive here but I just want to point out at least some things I also want to point out that there's at least one talk here at ICML with I think the very relevant name to our discussion yes but did it work evaluating variational inference um so I'll just point that out if somebody else is giving a talk at ICML on this subject then I would love to hear that as well I hope I didn't miss out anybody okay so reliable Diagnostics are a super interesting area I think a really open area with a lot of potential for research to know you know can we do something very fast you know maybe very you know very quick and dirty and if we know that it worked out in the end then that's fine and if it didn't work out then maybe we can go to something slower and so this is I think a really really interesting area to get involved with but of course we'd like to have both reliable Diagnostics you know if only because our code might be wrong as well as methods that we can trust a priori methods with theoretical guarantees a priori and so it's worth thinking how could we get that well one method that we could do is we could sort of peel back this onion of assumptions that we made along the way so for instance we started by making this assumption of approximately the posterior we said what's the closest nice distribution we've decided to make the variational Bayes assumption we decide to make the mean field variational Bayes assumption so we could ask ourselves well how about we use a richer nice set how about we just don't limit ourselves to mean field variational Bayes so an interesting observation and Turner and Suhani 2011 paper is that in fact a Richard nice set doesn't always do better on means and covariances beyond that you can get you know such rich nice sets that they cover everything but in practice that can introduce more headaches in terms of it could be longer for computing time a nice thing about me and feel very tional basis it's very fast it could be more user time it could be harder to do and a lot of the methods that exist right now that get richer in terms of the nice set introduce a lot of additional work or sorry additional assumptions additional approximations and we need to know how good are those just like we need to know how good are our approximations earlier we could try our alternative divergences but that's gonna have a lot of the same questions that we have here ok so we really want theoretical guarantees on quality and so for the second part of this talk which I'm about to get into I'm going to talk about a very different way that we might approach this instead of sort of thinking about variational Bayes and mean field variational Bayes and this particular type of optimization maybe we can think of a pre-processing step where first we take our data we compress it in a very clever way into a smaller weighted data set and then we run our approximate Bayesian inference algorithm on it can we get guarantees by doing that so before I go into that let me just point out some resources for variational Bayesian imprints so there are a lot of really great overview articles and textbooks and I just really in particular want to point out David McKay's book so a lot of the classic examples in this talk and other textbooks really date back at least to that book and it's been pretty much a timeless book not just for Bayesian inference but for a lot of machine learning I also really want to point out Turner and Suhani 's article two problems with variational expectation-maximization so one issue that I didn't get into here is multi-modality that's a really tough issue with a lot of subtlety to it and I think that their article really deals with that subtlety in a very nice way all the experiments that did not come from textbooks or from Bailey Fosdick thesis you can find in our papers with my PhD student Ryan Giordano down here okay so with that I'm gonna go to the second part of the talk I'm going to talk about just another direction that we could go a different direction for thinking about how can we get these theoretical guarantees how can we get something where we trust the results that we're getting a priori going into the analysis or at least have some theoretical way to think about them so let's just recap we've talked about a lot of different applications where people are interested in Bayesian analysis and in particular for the sciences and engineering and health and a lot of different sectors we want something that we really do think is trustworthy where we're not gonna underestimate uncertainties we're really going to get these uncertainties correctly estimated and so we're magining again the case where we have this prior in this likelihood and we're trying to approximate the posterior and so as a running example in this part of the talk I'm just going to imagine that we have something like a classification problem so in particular let's imagine that maybe we have observed a lot of websites so some of these websites are malicious they're fishing for your personal information and some of these websites are perfectly benign they're just normal websites and so for each website we observe a set of co-variants a set of features X sub n for the nth website and we observe a label Y sub n whether it's malicious or not and we want to find a classifier so in the simple example we're interested in the classifier and because we're being Bayesian here we're not just gonna get a point estimate we're gonna get a whole distribution so in this case it's 2-dimensional we can actually write out that exact posterior we can draw it over here and as we said before we want to approximate this but we want to approximate it in a way ideally that we have theoretical guarantees on quality and so the proposal that we're going to explore in this part of the talk is summarizing our data as a pre-processing step before we run Bayesian inference and ideally in a way that will get us theoretical guarantees on quality and then if we combine this with another method that has those guarantees like Markov chain Monte Carlo then we should totally in the end have guarantees and so the hope is to do this in a way that's fast that's easy to use that doesn't require you know deriving a bunch of different equations in a way that gets us these error bounds that we for the finite data that we have in particular error bounds on the quality of means and covariances posterior means and covariance estimates okay so first I'm going to talk about getting at the core of the data set how are we gonna summarize this data set we're gonna try to get at its core now what does that mean well something that we could use to summarize our data set with a smaller data set is uniform data subsampling that's a great way to deal with a very large data set in fact that's probably something that you know more people should be doing in general it's it's very fast it's very easy to use but it'll turn out to not be enough and we'll see what it's sort of pitfalls are another thing that we could try is well maybe the uniformity of Uniform subsampling is the problem maybe if we use some more careful importance weights we'll do better and we're gonna see that there are some pitfalls there too and so finally we're gonna try some optimization approach and I want to emphasize this is a very different optimization approach than the ones that we talked about earlier this won't be minimizing closeness over a set of nice distributions it won't be something of the form of a map that'll be a very different type of optimization to choose this core set and then finally we'll talk about another way we could summarize our data so exponential families are a beautiful case where we don't have to do any really really tough approximation for Bayesian inference things are very easy because of sufficient statistics but of course in general we don't have sufficient statistics so are there ways that we could come up with approximate sufficient statistics okay so let's start about thinking about the core of the data set how can we get at that okay so an observation that we can make is that a lot of times there are redundancies in data even if it isn't traditionally tall even if it's not just very you know sort of low dimensional and just a ton of data about the same thing so I'm as a cartoon imagine that I have a bunch of text data maybe it's sort of about sports articles and so actually it's very complex and high dimensional and I just can't represent that here so please forgive the cartoon and I have a ton of data you know in particular the World Cup is going on right now so I imagine that there are a ton of articles about football and maybe because it's not so close to the Winter Olympics anymore there are fewer articles about curling least at the moment it's maybe a less popular sport and so I might think to myself gosh I want to run you know my machine learning algorithm on this data set and it's very large and that's gonna be a pain what would be nice if is if I could run it on a much smaller data set and maybe I could observe that actually amongst all of these football articles there are a few that are pretty representative and so maybe I could replace a large collection of similar football articles with one representative football article and I'll just give it a really high weight because there were so many football articles to begin with and maybe I could similarly replace many similar curling articles one representative curling article and I'll give it a low weight because there weren't so many curling articles to begin with and then I would run my approximate Bayesian inference algorithm on this data set of size two or just a handful and of course it would be super fast you know if you'd run any of these algorithms you know that if you have a very small data set things are gonna be great everything's gonna be fast but of course we don't just want it to be fast we want it to be right we want us to give us something useful okay this idea of a corset getting at the core of the data set pre-processing your data to get a smaller weighted data set is not a new idea it's an idea that came from the computational geometry and theoretical computer science community and has been around for a while and what's awesome about the work in that community is that they're able to get theoretical guarantees on quality that if they run their algorithms on this smaller weighted data set they can get closed if they'd run on the full data set and that's sort of our whole goal here okay so why do we have to you know do that why can't we use something that already exists certainly people have thought about summarizing their data before even in the Bayesian inference community a number of people have thought about this so there's data squashing for bayesian logistic regression if you do anything with Gaussian processes you're familiar with inducing point methods for Gaussian processes these are all very common I'm sure there are more methods than what I'm putting up here but you'll notice that these tend to be heuristic methods they don't come with theoretical guarantees on the quality of these means and variances that we want to report at the end of the day ok another idea is uniform subsampling we'll come back to this a little bit more later but let's just start by thinking about what might go wrong here let's suppose that I'm working for a curling manufacturer and I want to work with them and they want to advertise to the curling communities they want me to do an analysis of these sports articles and I think oh there's too many sports articles so I'm gonna uniformly subsample my sports articles and oops I lost all of the curling articles now of course in this case maybe that's not the end of the world if we if we missed out on this curling advertisement but you can imagine cases where this would really matter right so like let's talk about cyber security I don't maybe know actually what the malicious packets are in advance and if I sub sample my packets in a network I might lose all of the malicious packets or at least some interesting and important subset of them or if I'm in a medical situation and I subsample my data I might lose a set of people who respond to a medicine in a very particular way and that would be really bad okay so that seems like a problem and so we want to ask ourselves now can we apply this to Bayesian inference now it's not going to be a straightforward as using these methods that have been developed on things like k-means and minimum bounding box problems because Bayesian inference has some very unique geometries we want something that's gonna work for whatever prior and whatever likelihood we throw at it and so we want something that's going to work in general and that's where the challenge comes in okay so in order to think about how could we apply this to Bayesian inference let's start by setting up the problem what is a bayesian corset going to look like well remember there are two ingredients that go into the posterior we've talked about this a lot at this point there's a prior and a likelihood and if so if we're gonna do something like summarize our data into a smaller way to data set that's pretty clearly going to affect one of these it's the likelihood okay so in that case let's establish a little bit more notation so starting on this line I'm going to assume that our data is iid or sorry it's independent conditional on the underlying parameter theta so that is an assumption it's a very common assumption it's certainly true of the analyses that we've looked at so far but it is an assumption so once we do that we can assume that though we can write that the log likelihood is equal to this script L so and well-defined script else have been to be that and the total log-likelihood script I will just be the sum of the individual data point log likelihoods okay so now we can define a course that log-likelihood so in particularly the idea of the course that remember should be that it is a very small weighted subset of our total data and so if we assign a weight to each data point weight one two data point one weight to two data point to weight 3 2 data point three and so on and we collect all those weights in this vector W then the zero norm of W should be much less than the total size of the data the number of nonzero elements of W should be much less than the total size of the data because we want to run our expensive Bayesian posterior inference algorithm on a very small subset of data ok and so now we can define the course that log likelihood as a sparse sum of the individual log likelihoods remember this should be sparse most of these terms should be zero we shouldn't have to add them up and we want this to be such that our corset log-likelihood looks very much like our original log likelihood in particular for our theory we're gonna say that any Hilbert norm is fine in terms of looking at sort of the difference between the corset log likelihood function and the exact log likelihood function but our in our experiments we use a weighted Fisher information distance that crucially puts a bound on the Wasserstein distance a pretty tight bound and the Wasserstein distance in turn puts a pretty tight bound on the posterior means and variances our approximation to the posterior mean and our approximation to the posterior variance which is sort of the goal you know we wanted to get good estimates of these things okay so in some sense this slide is about what do we want you know what do we want in life we'd like to have these Bayesian corsets but it doesn't tell us anything about how can we get them how can we actually achieve these Bayesian corsets so let's think a little bit about that now okay again I think a very very reasonable first thing that we should think about that we should always you know have is our first idea for dealing with a large data set is uniform subsampling so let's think about a little bit more about what goes wrong there okay well one thing we said is that so I might uniformly subsample away part of my dado now in a classification problem that's not very realistic I would know not to subset away one of my sub sample away one of my classes but of course I might sub sample away an important part of one of my classes as we discussed earlier this idea of missing important data is actually related to a more subtle problem with uniform subsampling which we can see in the following way so let's suppose I have my full data set a uniformly subsample it I up wait the data so it looks like I have the full data set and now I apply approximate Bayesian posterior inference because that was the whole goal I was gonna get an approximate Bayesian posterior and compare it to my exact Bayesian posterior now of course I get a whole posterior and here I'm just plotting the posterior mean just because that's an easy visualization but of course there's a whole posterior there and I could do this a few times I could have my whole data set I could uniformly subsample my data set I could plot I could run my approximate Bayesian posterior inference algorithm I could plot my approximate mean and I could do this again I could uniformly subsample my data I could run my whole approximate Bayesian posterior algorithm I could plot my approximate posterior mean and here's where I want us to pause for a second there is one exact posterior for the full data set our goal is to approximate that exact posterior for the full data set these are meant to be three approximations of the mean we don't even need to know what the exact mean is to know that at least two out of three of these are bad approximations because they don't even agree with each other in fact this is not just a cartoon this issue of noisy estimates we can see this in practice so here what we're seeing in red and blue is 10,000 simulated data points now we can take a Bayesian logistic regression model so this is a classification model we can uniformly subsample ten data points amongst this data set that's in black we can run our approximate Bayesian posterior inference algorithm in this case it happens to be Markov chain Monte Carlo we can plot our approximate Bayesian posterior mean and we can do this 50 times and if we do this we see that we get these estimates of the mean that are just all over the place they can't possibly be good estimates because they don't even agree with each other now sure I could increase the number of uniformly sampled points that I make but that's not the point I don't want a subsample almost as many data points as my full dataset and then run Markov chain Monte Carlo on that I want to run my expensive Bayesian posterior inference algorithm on a very very small number of data points that was the whole point to speed things up okay so it seems like uniform subsampling isn't cutting it we might ask ourselves is there something else we could do well certainly an observation we could make is that at least in this particular cartoon we know that we care about some of the data more than other parts of the data like in this cartoon we say oh you know actually this fishing data is very important to us we want to make sure that we don't miss out on that fishing data that was the problem that we saw when we uniformly subsampled is that we might miss some of that fishing data so maybe there's a way to avoid that so let's just recall what we're trying to do is we're trying to calculate this posterior and that's too expensive so what we'd like to do is we'd like to find this corset that represents our data such that when we run approximate posterior inference on it it looks like if we had run it on the full data and so what we're gonna do now is we're gonna say well what if we just put in some big importance weights on the fishing data and now our hope would be maybe that in fact what we do is when we sub sample we tend to get a fishing data point because they have such high weights and we tend to get a benign website because there were so many to begin with and now we invert these weights to get the corset weights and finally we run our approximate Bayesian inference algorithm on it and we hope that this looks pretty good so this is the goal it turns out that we can actually solve for an optimal set of importance weights in this case so let's say that the optimal weight for the enth data point is Sigma sub n it'll turn out that the optimal weight is proportional to the Hilbert norm applied to the individual data points log likelihood so we can just write out the proportionality constant that'll be the sum over all of these individual Hilbert norm and so just to recap the idea here is we start with a collection of data we find these importance weights we sample according to these importance weights and finally we invert the weights to get the core set weights so this is the proposal the proposal is to do this form of important sampling instead of uniform subsampling we might ask ourselves is this a good idea well we can get some theory that tells us how well we're gonna do so remember we're interested in this error between the core set log likelihood and the exact log likelihood we're gonna get a high probability bound because we did that important sampling to get our core set points here Sigma is just that normalizing constant we saw before ate a bar has to do with how well aligned the different log likelihoods are and square root of M so M is the number of points that we're subsampling so sure as we increase the number of points this goes to zero but something that you're probably pretty concerned about is that square root of M is just the Monte Carlo error this is exactly the same error rate as for uniform subsampling and in fact we see that this doesn't look too good in practice so in practice we do the same thing as before we have 10,000 data points we subsample 10 now we're doing it with important sampling and so they have different weights that's what we see in black we calculate approximate Bayesian inference on those 10 data points we plot the mean we do this 50 times that's what we see in green they don't agree with each other well at all this seems pretty bad we can increase the number of points that we improve and sample but that's not the point we want to run our expensive Bayesian inference algorithm on a very small number of points not a very large number of points ok so what's going on here what's the problem so let's take a step back and think about what do we want we want a good course that we want to minimize this error in such a way that the W's are positive so it looks like a log likelihood but also there are very few w's that are nonzero this course that is very small so to get a sense of what's going on let's imagine an even simpler problem let's suppose that our data are just one-dimensional so the y ends just in one dimension in this case we can actually plot the likelihood above each data point so that's the curve in black it's just Gaussian and then the total likelihood with a scaled log likelihood because it's just scaled anyway is gonna be in blue and you'll notice that there's a great corset here there's a great core set of size one just choose that one of the points in the middle and it's gonna be almost a perfect a perfect approximation for that total log likelihood and what important sampling is doing is it's choosing these points according to the weights and the weights are the thicknesses of the line so it's doing exactly the opposite of this it's not choosing the point in the middle it's choosing the points way out to the sides why is it doing that why are we doing better well let's simplify this problem of you've given further let's not think of these likelihoods and log likelihoods as functions let's think of them as vectors so essentially what we have is a sparse vector sum approximation problem we have a vector for each log-likelihood they all sum up to the total log likelihood in blue which is just scaled down a bit here and there's a great core set of size 1 just take one of those tiny points that are in the middle and give it a big weight that would be a fantastic core set but if we put all of our important sampling weight on there and we sampled again and again and again we would never do better than that point what we really want to do is we want to find the point that's most aligned with the total log likelihood and then we want to find the point that's most aligned with the residual and then we want to find the point that's most aligned with the residual and so the problem here is that we're independently sampling when we want to do is consider the residual error direction and so instead of sampling we might instead say hey this sounds like we're just trying to do sparse optimization where the sparsity is that we want to pick up a small number of these points ok so nowadays when you hear sparse optimization so we're gonna step away from sampling for the moment and think about optimization when you hear sparse optimization you think Frank wolf so let's just review what exactly is Frank wolf so in Frank wolf we're trying to optimize over a convex function so that's the function that we see this this bold is f this convex function and we're trying to optimize over a polytope D so X is in this poly tope D and so the first step is we choose the vertex that minimizes F and then we repeat three steps on each iteration beyond that so we take whatever point we're at right now let's call it X we find the gradient of F at X that's this tangent hyperplane in Brown now we say what's the minimizing point on that tangent hyperplane well that's got to be a vertex because it's hyperplane optimized over a polytope and then we do a line search between our existing point and that new point and so we think about it we started a vertex then we do a convex combination with a new vertex and then a convex combination with a new vertex in a convex combination with a new vertex and so after M minus 1 steps we've picked up M vertices and so if we could apply this to our problem and if the vertices were the data points then we could do pretty well that would be the goal is to pick up a small number of data points okay so let's see if we can't apply to our problem well our problem is to minimize this course at error the error between the course that log likelihood and the exact log likelihood we might as well minimize it squared now if you've ever done optimization if you've ever tried to you know optimize over some likelihood you think gosh that's not a convex problem but usually we're optimizing over the parameter and here we're not optimizing over the parameter we're optimizing over the weights and remember the course at log likelihood is linear in the weights and so in fact this is going to be a convex optimization problem in the weights okay so now we ask ourselves is this constraint set a polytope unfortunately it is definitely not a polytope we have that zero norm there but here's a way we can get around this if we could come up with some constraint set that is a polytope such that when we run it each vertex corresponds to a data point so we pick up m data points and then we'll have satisfied this and so here is an example of such a polytope each vertex corresponds to a data point when all the weights are one which is the original data set it satisfies this the Sigma and Sigma n are those importance weights we saw before and when we run this if we pick up m data points we automatically satisfy this zero norm condition so it's not a relaxation it's that when we run Frank wolf we satisfy this condition okay and when we do this we can actually find that we get a theorem that tells us now that we don't decrease it a square root of m rate we decrease at a rate that is exponential in m which seems much better and so now we want to ask ourselves well does this actually come through in practice okay so let's start by looking at a very simple problem where we can calculate posterior z' in closed form so we can isolate what's the effect of the corset so here we have 10,000 data points in gray we have a conjugate normal normal model we can apply uniform subsampling that's in black and the exact posterior predictive mean and three standard deviation interval is in blue and so we run with our uniform subsample we find these things we do it 50 times and we see that the estimates and green are all over the place we can increase the number of uniformly subsampled points but that's not the point we want something where we run on a very small number of points and in fact we think there should be a great core set of size one here just pick that middle point okay we can important sample you can choose important sampling in this case and we get the same problem these are these very noisy estimates and so instead of sampling we can try Frank wolf core sets and so we're still doing 50 of these approximations it's just that they all agree perfectly with the blue line in this case the green lines agree with the blue line and that's because it's doing the right thing we see that for M equals 5 it's picking out that point in the middle that makes a fantastic core set and that is just slightly adjusting it sure this gets better as we increase the number of points but this is the point we want things to be really good for a very small number of points okay so we can look at other examples oh let me first say so another nice thing about this example is that we can calculate the actual kale there emergence between the approximating posterior and the exact posterior because everything's closed form here and so we can see that in fact for any number of core set points if we try any form of iid random sampling like subsampling uniform subsampling important sampling that we're going to be getting orders of magnitude worse in terms of kill divergence so this is a log scale than if we use these Frank will of course that's okay we can look at logistic regressions for the first two rows are what we saw earlier nothing's changed there but now if we run Frank wolf to get these core sets we get much better performance for a very small number of points so for M equals 10 we're already getting less noise in our approximation than for M equals 1,000 with the sub sampling methods another thing that's worth noting here is that this is really automatically picking up the geometry of the problem so in the case of logistic regression we see that what seems to be optimal is picking a point that's sort of perpendicular to this dividing line we wouldn't want to be using something like k-means here we wouldn't want to be used something like uniform sampling that would just get something in the middle for this particular problem we want to use the geometry of logistic regression and what's nice is that we're getting this automatically okay we can see this again for Poisson regression for another example so here our observations are integer valued so we can pick out again 10 points 100 points a thousand points and we see that we get these much better estimates these much less noisy estimates with these Frank will of course that's okay now something we should be concerned about though is that while we might be getting much better results for a small number of points a smaller number of points with Frank wolf it's more expensive to construct and so is that time trade-off worth it is it worth it to spend that time constructing it because uniform of subsampling is super easy to construct so let's think a little bit about that so here we have the total time spent construction and approximate Bayesian imprints versus error and we have a log scale in both cases and we can come uniforms subsampling in red to Frank will of course that's in gold looking at a number of real datasets including actually a fishing dataset chemical reactivity bicycle trips Airport delays this is over also a few different models now we see that there are orders of magnitude improvements for a larger number of core set points but towards the beginning actually looks pretty similar but it turns out that we don't need to have made that choice we did earlier for the polytope that was just sort of a convenience so that we had a way that we could pick up you know a small number of course at points if we're more careful about how we construct this optimization algorithm we can actually do many orders of magnitude better with giga corsets which I'll mention again briefly later okay so let me just spend a little time a little bit of our time remaining talking about an alternative way to summarize data before before running some kind of approximate Bayesian inference algorithm and that gets at sufficient statistics so it is not always the case that Bayesian inference is a problem to run in many cases is extremely fast in those cases tend to be when we have exponential family likelihood so if we had an exponential family likelihood we would be set so let's just remember how things work with exponential families so suppose our likelihood is the following so we have our data Y 1 through n we have some covariance you know X n for y N and our parameter theta and so we assume that our likelihood decomposes as a product over an exponential function of some function just of the data times some function just of the parameter and the function just of the data we can call our sufficient statistic now you might be used to this also having some normalizing constant after but you'll notice that could also fit into this framework by just making the function just of the data one and so what's nice about the exponential family likelihood what makes it so convenient is that we can rewrite it as the exponential of a sum over these functions of the data times the function of the parameter and so let's say that we're gonna do something like Markov chain Monte Carlo the way the Markov chain Monte Carlo is gonna work is we're going to start at some parameter value we're gonna calculate the whole like so we're gonna go through each data point to build up the whole likelihood we're gonna calculate that so that'll be a time you know linear operation with some constant some large constant and then we're gonna go to the next step and we're gonna have a new parameter and we have to calculate the whole likelihood again and so what we can do here is if we had an exponential family likelihood is we could pre process we could just sum up this data specific term and then when we're at any particular parameter we just do what hopefully is a low dimensional multiplication and that's it we don't have to do this whole sum over the whole data and so this is sort of trivially scalable it only requires a single pass through the data it's trivially streaming we can just keep adding things it's trivially distributed we can just add on different you know cores and it's very complimentary to Markov chain Monte Carlo so I'm not assuming that we have to have conjugacy here we can have any type of prior that we want it's just that MCMC is gonna be super simple when we only have to deal with this low dimensional representation of the data of course the whole problem is that we don't always have sufficient statistics we don't always have these nice exponential family likelihoods um so for instance even in something as simple as Bayesian logistic regression we don't have this particular form this X of a function of the data times a function of the parameter and so we have to go through and calculate this whole likelihood at every step same thing with other generalized linear models with so-called deeper models this is going to be a problem and so we might ask ourselves can we come up with some approximate notion of sufficient statistics and in particular we pursue a polynomial approximation because we know a lot about polynomials we can say a lot about them theoretically we know how they behave so I'm not going to go into as much depth on this one but I'll just mention that again now we can look at that kriti data set that I mentioned earlier but we can look at a lot more of it so if we analyze six million data points in this case with a thousand features we inherent this trivial streaming nassif distribution as minimal communication from the sufficient statistics framework we can run in 22 cores on this dataset um in 16 seconds and we get these finite data Wasserstein guarantees which is nice okay so in this part of the talk i've really focused on this idea of data summer is the hope is that this could be a way that we could get methods that are scalable that are automated so certainly something that we were interested in especially in the corsets case was having something that works for a wide variety of priors and likelihoods we want something where you know people can just say what is your prior in likelihood and then we can give them the means and variances that they're interested in and in particular we want to be able to get error bounds on those mean and variance estimates so we talked about a couple of methods one was this corsets idea which I think it's pretty general approximate sufficient statistics was more constrained to sort of the generalized linear model approach in a couple of other ways but it was very fast and we saw some empirical results um something I'll also mention that's really nice especially about the corsets framework is that there's a way to sort of trade off do this computational statistical trade-off to trade off more computation investment for better performance and that's by increasing the size of the corset but I think this is just a start I think that there are a lot of other ways that we could summarize data a lot of ways to improve these ideas I mean even just going from Frank well of course that's to giga turned out to be a huge improvement and that was just slightly tweaking the optimization approach and so i think there's a lot of a potential here okay so let me just mention that if you're interested in this work that you can check out a few of our papers by the way all of the slides for this tutorial will be available at the tutorial website with the the URL up at the top there you should also be able to find it off of the ICML main web page let me highlight on this paper in particular if you're interested in this course that's idea.the summarizing data with corsets this is a great place to start our earlier paper of course that's for scalable Bayesian logistic regression was the first time we applied corsets to Bayesian inference and this Giga paper here the Campbell and Broderick Bayesian corset construction for greedy iterative geodesic ascent oops which will be appearing in ICML here on Friday actually I show Trevor will be discussing the the details of that how does that work so if you're interested in learning more that um check out that talk um if you're interested in the polynomial proximate sufficient statistics work then you can check out our nips 2017 paper and we've also been exploring this idea of summarization in a couple of other different ways and so one is for really highly combinatorial problems maybe we can reduce the space that we're looking at and then by being careful we can get theoretical guarantees again on our posterior quality and so this is something that Raj Agarwal will be presenting also on Friday at ICML and so I just want to to point out the members of my group my research group who are various first authors of works that I have presented today and to conclude I just want to point out you know there are all these cases where people are using Bayesian inference they're using it in day to day lives and important problems important data analysis problems and we want to provide them with methods that are fast and easy to use so they can focus on their important problems in biology and chemistry and economics and all of these different areas and we want to do it in a way that's reliable because these are going to be making decisions that impact our lives that impact the world I also want to point out that there are some really interesting fundamental questions there are a lot of open questions from what we discussed today I think there's a lot that remains to be done not just in terms of improving Bayesian imprints but in diagnosing Bayesian inference and knowing how well we're doing but I think there's a really fundamental question here and how well can we do not just can we come up with some new method to make things better but can we understand what's possible you know what can be accomplished thank you [Applause] so if Thank You Tamara and if there are any questions I would appreciate that people can come to their mics either in this corridor or over there should I ask you yes okay so I do not fully understand the pose of the problem of this Frank and Wafaa optimization so as far as I remember you want to optimize like something like a norm which in this small data likelihood and the forties are likely good but before you must impose a model or like a statistical assumption for this and you also have higher parameters for these like statistical assumptions so how do you set these parameters so just to be so just to be clear on the nature of the problem that we imagine is that we're given a prior and we're given a likelihood and so the likelihood is a full function over the parameters I totally agree with you on that but what we're optimizing is the difference between the core set log likelihood which is still a full function over the parameters and this full function over the parameters the likelihood so we're not choosing the parameters we're saying that we have this full function over them and we're optimizing this distance between those two functions of the parameters okay I think I'll ask you in person that's easy sure yeah okay please for people that is live in the room it would be nice if it could be a bit more quietly being while do we have any other question in the audience so okay in the earlier part of the talk you you talked about under estimating variances and then you sort of seemed to open a question about possible solutions and if I understood right you suggested corsets with MCMC which seems strange in a VB talk is misunderstand or yeah so so this is one possible solution there are other possible solutions so we actually have some work on using linear response methods to correct variances and mean field variational Bayes but then those depend on the means being correct and so we'd like to have something where we can say at the end of the day you know here is some kind of theoretical guarantee on how close you are in your variance estimate and so that's you know something that we're aiming for so one of the big problems with MCMC is initializing it and getting it to a point where once you start you eventually find a place where it's good and you're generating lots of samples so have you considered using variational Bayesian methods to propose an approximate distribution and then sample from that using MCMC to refine that solution yes I think this is this idea of initializing MCMC with a faster approximation is one that has certainly been proposed and to the best of my knowledge used by by other practitioners so also not even just variational Basil applause things like this and I think it's a great idea I mean also even just things like you know more heuristic methods like k-means basically any kind of sort of meaningful initialization can make a huge difference I totally agree with you on that yeah so it seems in the formulation of the Frank wolf optimization problem the cost function it seems like it requires an integration over all of yeah yeah so this is a great observation so in fact because these are two functions that we're comparing we don't actually have access to this we have to approximate it somehow and so what we end up using in our experiments and what I show for the timing and the results and everything is a random features approximation but that's one of many things where we have a bound about how well we're doing and so we can combine those to get a total bound in the end on these means and variances but I suspect it's one of the things that could absolutely be improved I mean in some sense what has to be happening here is that we're bootstrapping our way to a better approximation and a little bit of a different way than the last question but still doing that kind of thing and and so I think that yeah I think that's a great observation it's certainly something that has to have an approximation there how good is that how much is that doing you know we have some bounds but I bet they could be improved Thanks and hi tomorrow thanks very much as a tutorial that was very lucid in real structure and I just want to ask a stepping away from Bayesian modeling for a bit have you tried exploring and the idea of using core set base subsampling on say a batch of data for gradient descent optimization maybe you know or or even for subsampling experience in an RL sitting yeah so we've definitely thought about corsets and other areas probably the one that we've thought about it I think at least somewhat successfully in is kernels although we don't have anything quite up about that yet but hopefully soon I certainly think that there's nothing specific to Bayesian inference about it and I think something that's really obvious about that is well it's been around before this for other types of problems although they tend to be things like k-means and minimum bounding box I think one thing that's a little bit tough about using it for something like stochastic gradient or anything like that is that at least as it's formulated right now you really do have to do a sweep through your data and so it's not clear that the trade-off is worth it in those particular cases but that doesn't mean that it's not possible it's just sort of we haven't immediately found a way to make that trade-off work but I think it's I think it's super interesting to think about these things again beyond Bayesian inference thank you a question I'm wondering if you have anything to add regarding expectation propagation which is basically yeah so so again you know I think there are all kinds of interesting pieces of work that have been done for types of divergences that are not just KL but expectation propagation points out a few of the similar challenges that we face in those divergences so when we do Cail on the other direction first of all we can't pull the trick that we pulled here we have to make further approximations and so you see that people typically aren't just directly optimizing the Kayle when they're doing expectation propagation they're doing these other you know sort of cavity methods and things like that also there's there are these tricky ways of understanding or not understanding exactly what's going on with these approximations so sort of um I think one of the ways that people tend to think about expectation propagation and kalynn the other direction is that they do things like overestimate the uncertainty rather than underestimate which at first glance sounds better at least we're not being you know overly confident we're making rushing into decisions maybe we're just being a bit more conservative but unfortunately that's not always the case there are cases even with kale and the other direction where we actually can underestimate the uncertainty for instance in multi modal cases and things like that we actually have a short workshop paper on this I think in from 2016 but there is some sense in which I think we just don't fully understand what's going on in these cases which isn't to say that we can't I think I would say this I would see this more as a challenge that we should work to come up with good Diagnostics with ways to understand fully what's going on in these cases what do we know about means and variances when can we trust them thank you for the nice talk I mean I may be missing trivia but it seems the course that pick up some small data set - what do you think of the patient-centered Museum I mean it required the data as the sample size to be infinite so um so let's see if I'm if I'm understanding your question correctly so certainly the Bayesian central limit theorem will eventually kick in but one thing that's a little bit challenging about it is that it does require the data set to be very large relative to the parameters and I think in a lot of cases that people really are using Bayesian inference that's not always the case that there at least for some parameters it's still a lot of uncertainty it's although also by the way worth making that distinction that there can be some parameters that are extremely well estimated while still having uncertainty in other parts of the space and so I think we're really interested in cases that even though they may have very large data they might either be high dimensional or there might be so many individual problems or there might be some aspect of the problem that still has sufficient uncertainty that we don't even expect the Bayesian central limit theorem to have kicked in necessarily much less that we are going to a point estimate so you mentioned Diagnostics that you could use to determine whether the variational basis working reasonably can you elaborate a bit more on what's already out there yeah so maybe I'll just mention a few things so I think um there's there's been a really interesting line of work that's been coming out in terms of understanding whether samples are coming from something like our exact posterior and so that's something that can be at least in theory pretty easily adapted to variational inference you just take samples from the variational posterior and now you take a look at that there's this paper that I mentioned that is is coming out today that looks at some sort of unusual variant of important sampling to check if variational inference is doing well let me see if I go there's um there's a paper that uses this chi-square bound on the elbow and so what's nice about that is as I said earlier you can't just sort of infer much from the elbow about what's going on in particular you can't infer much about the KL value but with this additional upper bound you can say more about the KL value at least you can bound it now there still remains the question of once you know the KL bound how much do you really know about means and variances but I think it's a really cool start thank you very much thanks thanks for your tutorial I like to ask have you tried to use GPUs those kind of tensor computation stuff to accelerate the computation because yeah so I think I think there's a general sort of meta interesting question here about how does Bayesian inference and the way that we do Bayesian inference interact with hardware so in some sense when you're when you're doing a very sort of similar set of operations again and again there's a natural way to interact with hardware to have hardware that sort of optimized for that and I feel like the way that at least we're exploring Bayesian inference right now is much less concentrated on that they're sort of Markov chain Monte Carlo which if you look at sort of the the fundamental operations in that ended up being very different from what you would do for map versus what you would do from for variational inference or even what you would do for expectation propagation and things like that and so I think it's it's not at least as immediately clear to me that there's an obvious hardware connection and the way that there are four other algorithms but I think it's a really interesting set of questions thank you hi thanks for your great talk is there a probabilistic interpretation of these up weighted samples that we're picking up in Bayesian course that's yes I think there's there's so many open questions here but I think one of them is definitely what is the interpretation of these core set points I mean it seemed like there were some kind of really interesting things going on just from the visualizations that we saw I think you could also think about is there some way to just interpret these is there some interpretability aspect I mean certainly people think about that with k-means I guess all of this is to say is I don't have an answer but I think it's an interesting question pursuing and maybe a follow-up do you know how much outliers influence which courses we pick up yeah so I suspect that outliers aren't extremely influential although I mean I'm basically just talking from my set of assumptions right now not from any sort of serious study I think they're influential to the extent that you know they allow you to capture certain aspects of the data so you know hey we got this point that was sort of perpendicular but if that point hadn't existed I think we would have found another point that looked like that I don't think it's the kind of thing where we need the outliers or that the outliers are necessarily super important but you know if you're trying to capture what's going on in your data and there is an outlier and that's really sort of influential and significant to your you know posterior learning then I think you have to pick that up in order to really understand what's going on in your data but uh yeah I was wondering if you could say more about the embedded geometry you get by taking the the data points and mapping them into this functional space like the examples you showed we just one point would capture everything relevant we're really compelling but do you have any more texture there yeah I mean I think the way that we have found most useful to think about this as really as vector some approximation that fundamentally these functions are like vectors and we're adding them up and so the extent to which we have intuition about vectors often follows through on this there's probably a lot more to say there but that has been pretty useful for us I think thank you thank you for a great tutorial at the beginning early you you thought say that the model the prior and the likelihood would be built together with experts and what are the ways to capture uncertainty resulting from model mismatch yeah so I mean so even though we sort of were agnostic in this tutorial about the choice of the prior and likelihood of course in general we we often in fact we always think that they're going to be miss specified at least a little bit that they're not going to be this this perfect match and so the question becomes how important is that miss specification and I think this is a really tough problem I mean fundamentally in some level what we'd like to do is we'd like to have some notion of global sensitivity like here are all the priors and likelihoods that we could have chosen what's sort of the range of outputs that we would have gotten over that hopefully it doesn't range too much in which case we feel like we have a robust analysis and if it does range a lot then maybe we feel our analysis isn't robust and we need to you know think about it more or do that I think one thing that's really important in this area and this gets back at this idea of Diagnostics it's just being able to tell when you have something that's robust or not and so something that we've worked on separately is getting automatic notions of local sensitivity now it's not what we want what we really want is this global sensitivity over all these choices we could make but I think that's a really hard problem and so if we can at least get a notion of sort of well if I perturb things a little bit if I perturb my prior and my likelihood a little bit how much does that change things and importantly I think I think a really important point in all of this is that we want methods that are really easy you know if somebody has to rerun an entire analysis if they have to make derivations to get it I think it's just not going to happen if it's part of a probabilistic programming language if it's part of the analysis that you get out for free then it can you know be something that's actually used and that's what we're aiming for with these local sensitivity analyses but overall I think this is a really important set of questions is how do we deal with model myth specification how do we diagnose it and how do we make sure that you know we're avoiding it thank you hey I wanted to ask a question from a practitioner point of view so I don't have a lot of experience with those Bayesian methods but was the first part of your talk was the purpose of that mainly to basically stress that mean field variational base still has a lot of limitations that we should be aware of like I'm just asking you about your intention with that yeah so so I guess my goal is to be to understand what we're using when we use our methods I think we could have a similar discussion about any other method that you might use so I think in you know this is a sort of a meta point about machine learning methods is that any method that you use is gonna have places that it works and it's gonna have pitfalls that's certainly true of MCMC it's true of Markov chain Monte Carlo it's true of mean-field variational Bayes it's true expectation propagation and so I think it's worth thinking okay let's take a method that's very popular I would say mean-field variational Bayes is very widely used and let's examine what are the assumptions in that when does it work and when might it not work and what should we as practitioners be wary of what should we be attuned to what are the things that we should be careful about in our own work and something that I think is particularly important in this case is to separate out variational Bayes from mean-field variational Bayes from your particular optimization algorithm you know you could do a ton of different things for your optimization algorithm but if there's fundamentally an issue with mean-field variational Bayes or there's fundamentally an issue with variational Bayes you're never going to be able to solve it with the optimization algorithm and so what I would hope that people would get from this part of the talk is what they should be aware of when they're using these methods but also what they can expect to improve by changing aspects of those methods thanks for that clarification let me just go back here [Music] yeah so it's a Bernoulli distribution with something that looks very familiar it's that X negative something on the data times theta but the problem is you're not going to be able to just get this simple sum over the data over the data functions when you multiply these these particular terms and that's what we're really looking for that it's simple sum I guess I'm not totally sure what the question is maybe we can talk offline afterwards Thanks hello thanks a lot for a great tutorial so you didn't mention in your talk I'm wondering how easy would it be or would it be possible to apply the idea of raising concept to to find the code or title for multiple learning because this problem would be even more challenging for because we have potato and then it's well then Basin inference for multiple learning yeah I mean I think it would be really interesting to apply this to a lot of other areas so I think another area we've started getting into not with exactly corsets but but the sort of data summarization techniques is is Gaussian processes so spatio-temporal models things like that there's a whole host of really interesting directions to go and I agree that that's an interesting one we just haven't explored it okay I want to ask how much is the Frank force of course it depends on initialization so in this case it shouldn't depend very strongly on initialization because you start off by finding the best vertex the best data point so that's just going to be you know whatever it is there's going to be one best data point and then you go from there and so in some sense the initialization is sort of fixed I thank you for a great talk so I'm I guess this is a very broad question but I find - I find that this corset idea is actually pretty close to the active learning setting where you're just trying to get a few example of a unlabeled data set and it seems like the best like methodology right now are actually geometric base or a topological base was there an inspiration from these works that actually inspired you to do this or is this just you stand out randomly also on something which seems topological - or yeah so in our case the inspiration was entirely from this course that's literature from computational geometry in theoretical computer science so it turns out that a lot of people who originated this literature also at csail my tea so it's easy for us to go around the corner and talk to them and so that was very convenient but I'm not surprised that there are connections that look like this I think there's something that seems very fundamental about this kind of idea it seems very natural about this idea and I think there are other connections to other parts of statistics to machine learning that are worth understanding better and I think that's an interesting direction going forward I hope that you know other people will get engaged in this as well from your team nobody actually looked into active learning with yeah we have not we have now looked into active learning but I mean it sounds sounds like a cool connection thank you for a great talk I have a question regarding them for example if you want to have a next step and go from the prediction of some distribution to the optimization of some parameters which affected for example if you can for example you have already identified that you can predict for example embryo ambulance arrival time kind of 10 minutes with a certain variation etc but now you want to for example improve the decrease the mean for example and decrease the variance how can you approach this problem within the variation operational bias framework so well so one thing that I would say that we were pretty focused on here was was sort of understanding the posterior given a prior and a likelihood but certainly doing something more active where we actually try to sort of decrease these things by changing some you know picking up data points things like that I mean this direction of looking at Gaussian processes is certainly related to things like Bayesian optimization and you know where we pick up points and then we learn more about our functions there might be more connections beyond that I'm not sure if that if that answer is your your question thank you Hey how's it going I just had a question if you had any opinions or thoughts so there's been some recent work I guess in MCMC where you do mini-batch MC MC and so MC MC has like asymptotic teas which are nice but variational inference you know when you're converged so there's the trade-off so I was wondering if you have any thoughts into that yeah so my understanding is I think that Michael Betancourt had a paper that said that if we do mini batches with MCMC at least under I assume certain conditions that the mini batches have to be something like a constant times the size of the whole data which is seems limiting for the purposes of scalability now there may be some you know nuance there that I'm missing but at least my understanding is that that's at least one state of mini batches in France so it's my understanding that the map mean parameters and covariance matrices of parameters are all dependent on the parameterization that you pick and that's kind of comes from on high and it seems like you evaluated a lot of your work based on your ability to estimate these means and covariance parameters is that a meaningful way to evaluate it or do you think you should be looking directly at the distributions that you're estimating yeah so I think there's there's there's a couple of interesting questions here so one is you know what is the right way to look at a Bayesian posterior when we have that posterior at the end of the day and I think there are challenges in this so one is yeah I think I think I agree with you that we'd really like to look at the full Bayesian posterior and I think there are fundamental challenges to that when we get into high dimensional posterior I mean honestly even anything over three dimensions is just it's hard to look at it's hard to query and so I think that at least one way that people deal with this is they look at means and covariances why posterior mean well it has a lot of nice properties so for instance it's the Bayesian risk minimizing estimator we know that it has these nice statistical properties it's admissible from a frequentist perspective and so at least it seems like there are reasons that we might be interested in a posterior mean I certainly think that's not the only thing that we're interested in something that people often report certainly on the uncertainty side and we even saw this in in Bailey fostex work was was some form of quantiles something like a credible interval and I think we could equally start from a talk where we start asking ourselves okay let's suppose that we care instead of about l2 error which brings us into thinking about the mean we care about l1 error and then maybe we think about posterior medians and maybe you think about posterior quantiles and then we can think about you know to what extent can we guarantee that we're getting good estimates of that and I think that's a tough problem I think especially once you start getting into tail probabilities I think that that's something we really as a community don't fully grasp and understand and know how to work with you know more than that can we get beyond these summaries is there a really good way to really look at a Bayesian posterior distribution I suspect that part of the answer to that is application dependent like what are we trying to do at the end of the day and so something that I think motivates us a lot is that we imagine that we're making some kind of decision and for a decision theoretic framework means are nice again just from a risk perspective but also means and uncertainties help us make decisions if we have a mean and uncertainty within some range then maybe we might make a decision based on that so we saw that earlier in something like an ambulance example or the microcredit example but for whatever whatever purpose we want to use the posterior I think that's what really should inform what we're trying to report what we're trying to do and then we should ask ourselves from a theoretical perspective are you reporting something that we feel confident in so it sounds like you're talking about means in your output space and I think it all makes sense but I was talking about means in the parameter space in data and I understood that so that is what I meant so I meant posterior means so means over the so in the posterior we have a distribution over theta and then reporting a posterior mean I'm thinking about it from a supervised learning perspective yes okay so everything is just these latent variables that you're trying to predict so some of them if they're like something about how fast ambulance will arrive certainly have an intuitive interpretation but if they're just arbitrary parameters in a Bayesian model some of them can't really be interpreted in any kind of linear way yeah in that case though I don't suspect that we care too much about their values period I mean then you know we have to ask what are we even doing with these if we're not doing anything with them then maybe we don't have to report the posterior you know sort of marginal in that direction Thanks okay so any other question otherwise I have a couple of questions the first one is that I mean we have been talking about data summarization and if you manage actually to reduce your data set size up to some level where you could actually run instead of original inference MCMC I was wondering whether you have or there is any work studying the fact that instead of using your whole data set you use the data summarization would actually affect the convergence properties of the MCMC or even though whether you can assume anymore that asymptotically you are converging to the true posterior which I believe you want to be but whether you could quantify how far you are yeah so for the second one I feel more confident asked answering that so I think that the types of guarantees that we're getting are exactly on the exact posterior we're saying how close are we to the exact posterior mean how close are we to the exact posterior covariance and so we're not going to converge to the exact posterior but we're gonna converge to something that we have guarantees on how close we are in these summary statistics that we're talking about and so I think from that perspective if that's the thing that we care about then we're good and if there's something else that we care about you know relevant to like that last discussion that we were just having and we should check you know how close are we in that and this is sort of the approach that we're taking also in this this other paper that I mentioned that'll be appearing it ICML let me just go back to this this idea that you know maybe we don't have to converge to the exact posterior as long as the thing that we care about we're close to as long as the thing that we're reporting at the end were close to and so we take that approach in this minimal imap paper as well that we say hey let's let's do MC MC on this thing that isn't the exact posterior but the things that we're gonna report things like edge probabilities at the end as long as we can bound how close we are then we feel like we're okay if that's the thing that we care about so now I'm trying to remember what the first part of the question was oh yes what do we know about the properties of MC MC when we run this that's a really interesting question you know I think there's an interesting question that's maybe baked in here that's worth reflecting on that now that you say it I would like to reflect on more is when we have smaller data sets you know certainly a trivial aspect of why MCMC is running faster is just any individual step in MCMC is faster the iteration is faster because we don't have to go over the whole dataset but there are two aspects of MCMC right there's the the cost of the iteration and there's the mixing time and so does this affect the mixing time as well I wouldn't be surprised if it might because you know maybe by having this smaller simpler dataset you have a different sort of geometry that you're exploring but it couldn't say anything hard and fast about that right now but I think it would be interesting to understand yeah related to the type of data so for now these data some initiation approaches seems to work for iid data so whether you can factorize the likelihood so do you think or do you have any idea of how this approach could be extended so some non iid data will do have for example some temporal component going on there yeah yeah so let me just first say that I'm very sanguine that this kind of thing should be able to be extended I mean that's the whole idea of inducing point methods right that you know you can have these summaries of our data even in cases where we have these highly complex dependencies and our parameter spaces these spatio-temporal models things like that I don't think that what we're doing is specific to this iid data I think that again something that makes me saying what about this is that there is I think that as long as there are cases where you think that there's fundamentally redundancy in your data that you should be able to exploit it in this kind of way the thing that I think I am less clear about and this relates to sort of my final question at the end of this is sure we can do these things we can summarize our data and if there really is data to summarize I think that we can make strides but how well can we really expect to do I mean if I really have a Gaussian process where I really have all of this data that's you know very spread out and really doing different things maybe I shouldn't be able to do better than like the fully correlated you know Gaussian matrix and everything or covariance matrix and so I think it's worth understanding and I'd like to understand better what are the fundamental limitations that we're facing in these problems and I think that's a I think it's a very related question to this type of problem okay thank you very much so I believe that if there is no any other question it's a perfect time to close this okay thanks for taking the last one I understand the computational effort is one of the main drivers for the methods that you are kind to share with us today and that some of the hardware aspect from your private from your previous answers were so relevant to discuss my quick question is that when can I adopt your methods on spark and is there open source toolkit that I could already pick up yeah so so in terms of open source and thanks thanks for asking that now I can plug us a little bit so the code is all online as mentioned here we're working with tensorflow right now to get the core sets into into basically base flow and tensorflow and we've started talks with Stan about doing something there beyond that if somebody's interested in talking with us we're interested to talk more about integrating this somehow let's thanks Tamara again thank you

Info

Channel: Steven Van Vaerenbergh

Views: 20,626

Rating: undefined out of 5

Keywords: icml 2018, machine learning, icml2018

Id: Moo4-KR5qNg

Channel Id: undefined

Length: 137min 50sec (8270 seconds)

Published: Tue Sep 04 2018