MLSS 2019 David Blei: Variational Inference: Foundations and Innovations (Part 1)

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

okay but I want to start by just telling you a little bit about the area broadly of a machine learning that I work in it goes without saying that we have complicated data that we want to make sense of a and complicated data while you're here now some new datasets are probably appearing on your hard drive complicated data can come in many forms any we might be scientists like physicists and we build big telescopes and measurement devices and then we collect terabytes and terabytes of data about the universe every second and we want to make sense of it maybe we invented a social network and then you know much of the world joined it and started communicating and sharing things with each other and and connecting with friends and so on and we have now in the data of that of that social network we have a snapshot of communities unfolding and interacting and we want to make sense of that data we we have will talk a little bit about this during these lectures text data we have a library worth of books and articles all in an electronically readable form and we want to wear a historian say and we want to make sense of all of these books and articles that we've collected in our library or maybe we're an economist and we have data about people shopping at at big supermarkets and we have millions and millions of people going on millions of shopping trips and we want to understand something about how people purchase items about how they're sensitive to price or what kind of items go together or just something about the economics of choice and these are just four examples I imagine in your field you have an example on the tip of your some kind of complicated data so you might have access to that you want to make sense of so let's dig into this kind of almost vacuous statement first of all what does it mean for data to be complicated well at the naïvely data is more complicated these days because it's just bigger right we have more data points we have terabytes of data from the from the astronomers and there are many dimensions so we might measure many many different things about each data point okay so this makes data more complicated than say 30 years ago but data can be complicated in other ways too it might be unstructured okay so when when we have text data for example I think of that as unstructured data where even though one can represent it as a matrix it's not naturally represented as a matrix it's naturally a bunch of documents in each document is a collection of words and the kind of your you know your grandparents matrix of data of rows and columns doesn't really apply when it comes to something like text data I think of that as unstructured data and then another way that data is more complicated and again different from your grandparents datasets is that it's multimodal and interconnected right so thinking about this social network here you know the the data is millions of people and they are connected with each other right in different ways there's a graph on these data points and each one is communicating with each other with messages that are sometimes sent to subsets of them sometimes sent to all of them they're sharing links and text and images and so on and so it's multi-modality that there are many types of inform any different types of information in these data and the interconnectedness of the data also make it complicated okay so the point is data is complicated in in many ways not just that it's large now what does it mean to make sense of data well the classical team learning answer to that question is that making sense of data means that I want to make predictions about the future okay I ingest a bunch of data and I want to know what the next piece of data is gonna look like right I get a bunch of communication on Facebook and I want to predict what the next communication is going to be okay so that's one way to and in and shakir mentioned finance a bit you know that in finance of course making predictions about the future is probably a big part of of what it means to make data but there's more to making sense of data than prediction we might also want to identify interpretable patterns okay so if I'm the historian and I have this big collection of text documents I want to know what patterns of language perme through history and I want to form my historical theories based on those patterns and I need them to be interpretable so I want to ingest all this text and I want to find interpretable patterns of words that recur in this and can you hear me okay put all patterns of words that that recur in the text and then and and be able to relate them to things in the real world okay then that's different from prediction I might be able to make a prediction without being able to interpret why or how that prediction is made and then most ambitiously which we heard about this morning from Barnard is to do science right I want to ingest this data but I want to actually not just interpret the results of my machine learning algorithm and not just make predictions of the future but I want to understand the world as it really exists through the data I want to understand how universe works from the data that I've collected about the stars and that is causal inference I want to confirm elaborate maybe hypothesize form causal theories about the world from my data and it's this is you know in facebook for example I might want to understand I might be a sociologist and I want to understand how people work and how communities are formed but these are causal inference problems these are different from prediction problems okay so these are all different ways that we might want to make sense of data and so the field in machine learning that I win probabilistic machine learning is about answering all of these kinds of questions by developing machine learning methods that connect our domain knowledge and assumptions and theories to the data that we're trying to analyze okay so probabilistic machine learning looks a lot like applied Bayesian statistics if you're familiar with that and what probabilistic machine learning does it provides a methodology for scale ibly modeling large interesting interconnected complicated data sets that in a way that both acknowledges our assumptions about the data and lets the data speak in the same let the data inform what we then learn about it can you'll see examples of this concretely through these talks and our goal broadly in probabilistic machine learning is to develop this methodology so that it's expressive meaning that we can encode many different types of domain assumptions say about the universe or about people into our methods it's scalable so that we can handle so it's so that we can use these methods on very large and complicated data sets and easy to develop which means that we can quickly build models and compute about those models and not have to do a lot of deep mathematical work in order to be able to use probabilistic machine learning and of course this is building directly on what shakir has told you about already for three hours today ok so just I want to show you some examples of the results of probabilistic machine learning I think all these examples come from the research in my lab just because I have those pictures handy and in my lab we've worked a lot on the the developed finding interpretable patterns and big unstructured data sets aspect of this of this work okay so for example we might be we might ingest a big social network of people of people talking to each other in this case it's patent citing each other and we want to understand what communities of people exist in that network ok so here's a picture where in comes just the network what is connected to what and out pops this color coding of the network in terms of communities of densely interconnected collections of nodes we'll talk about application later on these are topics found in 1.8 million articles from the New York Times so this is the historian analyzing the library so here what comes in to the probabilistic machine learning algorithm is a bunch of text documents and what pops out are called topics we'll talk about them more formally but loosely what a topic is you can maybe see one here one team second race round cup open game play win these are words that seem to go together under a single theme ok but they're discovered by the algorithm so in comes all the documents from the New York Times 1.8 million of them and out pops this idea that oh you know a bunch of these somehow involve sports all right and here are words that go together Sports is no late you know none of the documents are labeled as being about sports or not it's all it's an unsupervised learning method okay here's a related method that analyzes genes so this is a population analysis to billion genetic measurements and the idea here it's really just a pretty picture because it's got lots of colors but what this is is incomes we might we all spit into a test tube and send our test tubes to 23andme or whatever they sequence our genes and then this is an analysis of a big collection of genes with the idea that each person somehow mixes different ancestral populations okay so millions and millions of years ago ancestral populations roam the earth reproducing and it all bottomed out in us in this room and each of us exhibits those populations to different degrees and this kind of algorithm then tries to uncover what those populations are and model each person as mixing several of those populations here is a neuroscience analysis of a big fMRI data set okay so here totally different we are ingesting fMRI data this is so what happens people get put into machines fMRI machines and then they show them pictures of like shoes and chairs and stuff like that's pretty fun and and and the scientists are measuring what's going on in there that's their as they're in the machine and what the model does is then identify the locations in the brain that are most active under different types of stimuli okay so I actually don't remember what this picture even means but it's kind of like you know if you're looking at chairs then these these these areas of the brain that are marked with blue are on often and if you're looking at it was the other thing I said shoes if you're looking at shoes then the area's marked in red are on and what the algorithm does is just take raw data of these little tiny boxes of brain activity and figures out oh these are the areas that seem to be most and here's when they are active okay here the important problem of analyzing taxi trajectories here's an analysis of Declassified cables using things like topic modeling to identify world events this we'll get to later two's a of the economic model the model of people shopping and here what we're doing is we're learning latent features in items in the supermarket this is work we did with an economist and this picture tells you that crackers and cheese go together well so this was a really big result in economics so I it was a whirlwind tour of lots of different types of results the reason I showed you all those pictures of the types of patterns that we learn with probabilistic modeling is to be able to show you this picture which I call the probabilistic pipeline and this is the basic set of activities that we do and we develop probabilistic machine learning algorithms the idea is that we start off with our knowledge of the world and some question we want to answer about the world okay we have knowledge in a question and or the knowledge could be assumptions and we use that knowledge assumptions in question to build a model and I'll get to a graphical models in a little bit but the model is basically a probabilistic model of hidden and observed variables hey there's some hidden random variables are some observe random variables we build a Joint Distribution of hidden and observed random variables that somehow reflect both the question we're trying to answer and the knowledge we have of the world and that's the model then given our data set say the the gene sequences and our model of how these gene sequences mix ancestral populations that are unobserved we uncover the patterns in the data so we take our data and our model and we somehow Square them off but more formally what we do is we infer the hidden variables given the observed variables and that's discovering the patterns and then we use those discovered patterns to form predictions and explore the world through our analysis okay I like this picture because it separates these activities at which I'm which we're going to separate today as well the one activity is model building what is the kind of joint distribution of hidden and observed random variables that I want to use another activity is algorithmic given that model and my big data set how do I infer the hidden structure that that lives in the model so if I am given a big social network how do I infer those communities that I showed you if I'm given a big collection of New York Times articles how do I infer those topics ok that's somehow squaring the model and the data as I said and then the last activity is to then use the results of that inference that's where things like causal inference and exploring data and forming predictions and building building real artifacts machine learning artifacts comes in right that uses the results of these inferences to do something useful ok and I like this picture because it's rates those activities and it makes it easy then to collaborate with domain experts on machine learning problems on developing new machine learning methods ok but what we're gonna mainly talk about for three hours is or for forever because the battery will never run out is posterior inference okay so the key algorithmic problem lives in this box discover patterns that's answers the question given my model of how communities form and how social network reflects communities what does it say about this particular social network that I'm analyzing what are the communities in that social network if this answers the question what are the topics in this data set right so what does this model that I develop based on my assumptions say about this data that I care about that's posterior inference and as I mentioned before our goal is to make poster inference scalable but also to make it general we want post your inference to be able to we want algorithms that we can use with many many many models and we want algorithms that we can use on very large data sets and so why do we want that well I drew this pipeline here as though it's it has an end right it ends with predicting and exploring and when you have to work really hard to develop algorithms then you are committed to writing a paper about whatever it is you just did okay but if we had easy ways to do posterior inference if we have easy ways to solve this problem of conditioning on the observations and estimating the hidden variables and what we really want is to think about this loop which I call boxes loop named after George box is a great statistician where we make assumptions about the world we discover patterns based on our data and those assumptions we use those discovery patterns to do something for predictions explore causal inference whatever and then we look critically at what just happened and we ask ok how did things work did they work well do they not work well and then let's use what we learn in that criticism phase two then revise our assumptions I call this box's loop okay to go around and around continually building revising and checking models sorry continually building analyzing data and then checking the results to then revise the model and if the bottleneck is to do mathematical work to develop poster inference algorithms then scalable in general inference algorithms will help lubricate this pipeline okay I don't know what that just was because we're about to start with the introduction by the way please interrupt me so if there are any questions at any time I'm you know definitely only going to speak as long as they let me but I'm happy to not get through all my slides if you know it means that it's clear what we do go through so please feel free to interrupt me okay so that's our goal to develop your inference algorithms and let's get a little bit more formal so you've already seen this but a probabilistic model as I mentioned is a joint distribution of hidden variables II and observed variables X okay so we're gonna sometimes have beta as well as a hidden variable but for now just Z and X so we have a joint distribution of Z and X that's a probability model inference about the unknowns the hidden variables happens through the posterior distribution which is the conditional distribution of the hidden variables given the observations okay so that's this equation here P of Z give X which as you know from the definition of conditional probability is the joint P of Z and X divided by the marginal distribution of the observations P of X now for most interesting models for most models that we care about that denominator is not tractable to compute can't efficiently calculate it or calculate at all and so we have to appeal to approximate posterior inference we have to play the game of trying to approximate this conditional distribution given my model and my data what is P of Z given X so this picture shakier and Rajesh Ranganath and I came up with this picture when we were developing that tutorial that Shakir mentioned really says everything we need to say about variational inference so variational inference is an approach to approximating that conditional distribution and I want to pause here for a little while and meditate on this picture so here is the basic idea behind variational inference variational inference solves that inference problem here when I say inference I mean calculating P of Z given X with the my zation okay for those of you familiar with MCMC if I had a slide about MC MC it would say MC MC solves inference with sampling okay but from solving inference with optimization and the idea is as follows so imagine so imagine the space on the slide every point on that space is some distribution over Z okay there's some distribution over the hidden variables my model is X to P of Z and X is fixed I'm interested in P of Z given X my data is fixed - and every point on the slide is a distribution over Z so the idea behind variational inferences is following first I'm going to set up a family of distributions of Z okay and I've outlined that with an ellipse here so there's this big space of distributions but there's a subset of those distributions of family inside that ellipse each point in that ellipse is a distribution over Z and it's parametrized by variational parameters variational parameters which were denoting new so every point in this ellipse let me just move this chair every point in this ellipse is a distribution over Z and corresponds to a particular value of the variational parameters new is that clear now our target distribution is P of Z given X right so P of Z given X is a distribution over Z so it's somewhere on our slide and there it is lives outside this family but that's what we're trying to approximate and so the idea behind variational inference is to within this family find the member which I'm denoting new star here that is closest to that conditional distribution that we care about P of Z given X the posterior ok I want to find the new star such that the corresponding distribution of Z is closest to P of Z given X closeness measured by the KL divergence an information theoretic measure of closeness of two distributions and so the idea is to start at some initial set of variational parameters new in it and then to optimize minimize that KL divergence to find the point closest to P of Z given X okay now I'm omitting all kinds of details here such as it's not possible but this is the idea okay that's it I want to say more but I just want you to stare at that picture locker and burn it on to your eyeballs okay so I have it I have a family of distributions over Z it's parameterised by nu I start with an initial nu and I run an optimization procedure to get to the new that is closest to the exact posterior that I care about question correct we know P of Z comma X but we don't know P of Z given X what's your name by the way say it again on Schumann okay yeah that's one of the details that I'm skirting over but I will get to it yeah right so I'm Schumann makes a good point if I don't know P of Z given X how on earth do I opt amaizing that depends on it and so you know it it's it's gonna be the answer is luck yeah other questions yeah what's your name Alan I'll get to that yeah yeah question yeah so it relates on Tremont's question the answer is that then because we can write down this optimization so we can't calculate it but there is research current research on using other forms of measuring distances between distributions and so the reason I wanted you to stare at this picture so long is that actually when you you know when you start looking at the archive which is sort of a mixed blessing you you can take lots and lots of the papers about variational inference that are coming out on the archive and situate them inside this picture okay a brief answer to your question about why it's called variational loosely variational means anytime I'm turning some calculation that I want to make that's hard into an optimization and so that's what we're doing here because if I could put if I could put the family - in - if I could enclose the posterior with the family then the family would be too hard to work with then I won't really have a problem I could calculate the posterior okay so here's an example just to make things concrete so let they were trying to fit a mixture of gaussians okay so I have a clustering model of gaussians my hidden variable is the is the is the cluster assignment of each data point and I'm trying to find the clusters okay so here in the in the left is is my original data now the data here have colors but they don't really have colors the the algorithm just sees the day of the raw data and our goal is to cluster this data and let's say I magically know that it's one two three four five clusters or something like that the way the algorithm works so we're what we're plotting here is the posterior predictive distribution of the next data point so if I'm gonna use my posterior to form a prediction this is the prediction of where the next set of data points gonna lie and when I run variational inference what you can see is that at first I have no good sense of where the next data point is going to come and as variational inference proceeds my estimate of the posterior predictive distribution the next the the place where the next data point is gonna be gets better and better and better and here I have indeed through the posterior predictive captured the distribution of the data okay here in this in these pictures just look at this one here the evidence lower bound this is like the KL divergence and where higher means lower KL divergence okay the higher this is the closer my approximate posterior is to the exact posterior what you can see is that we put out with new init and then we get better and better and better and then we converge somewhere we don't know how far away from the exact posterior we are but we've converged somewhere to be closer to the exact posterior than we started that's the idea and the main takeaway from this slide is we have inference into an optimization okay so what are we going to talk about today and tomorrow again very strong inference solves the posterior inference problem with optimization and what I want to talk about are the basics of variational inference how to scale variational inference up to massive data sets that's a algorithm called stochastic variational inference and then how to develop variational inference to be easily applied to a wide class of difficult models so when I talk about the basics we're gonna constrain the model class and later we're gonna expand the model class and that relates to what shakier was discussing with things like normalizing flows and reprioritization gradients and so on but we'll get to that tomorrow okay so there are no prerequisites of course but the you know what I'm kind of hoping you are familiar with is a little probability so terms like the Joint Distribution the conditional distribution I hope are familiar to you terms like expectation and conditional expectation which shakier also went over I hope are familiar to you a little optimization by which I just mean knowing what optimized Qin is we're going to talk about gradient based optimization and coordinate ascent optimization and it'd be good if you knew a little bit of Bayesian statistics but you don't have to be a Bayesian are any of you Bayesian just one - yeah I'm kind of a half Bayesian what we will learn about is this like I said the basics of variational inference which include mean-field variational inference and coordinate it's an optimization for variational inference then like I said we'll talk about stochastic variational inference to scale it up and black box variational inference which generalizes it and in order to make our conversation more concrete I want to talk about real models along the way so again I like to divide modeling and inference and so sometimes we will discuss some models and then we'll discuss inference and then I'll show you how inference applies to those models but in particular we'll talk about latency rashly allocation and topic models steep exponential families embedding models of consumer behavior that's the model that tells us that and cheese go well together and then deep generative models which is a way to do Bayesian deep learning now it's your question about why it's called variational so I gave one answer but there's a historical reason to Oh variational inference the history of this set of ideas is that is that machine learning researchers and physicists who got interested in machine learning started adapting ideas from statistical physics to probabilistic inference okay so two physicists Peterson and Anderson in 1987 they fit a neural network with mean field methods this was back when neural networks were popular the idea then was picked up by several machine learning researchers like Mike Jordan in Mike Jordans lab in the late 90s Mike Jordan and Tommy Akula and Lawrence all Zubin Germani they read this paper and picked up on it and started generalizing variational inference algorithms to other models to other probabilistic model and a really good review paper where I learned about this stuff is Jordans and and these other guys review paper from 1999 at the same time Geoff Hinton and David McKay and others Chris Bishop also started developing variational methods for neural networks again and for other models and so all this was kind of happening in the early 1990s now why we're talking about it today it's past the 1990s today is that there's lately been a new flurry of work on variational inference again making it scalable easier to derive making it faster making it more accurate and variational inference nowadays touches many areas and machine learning not just a kind of classical Bayesian modeling but probabilistic programming and reinforcement learning and neural networks and convex optimization and Bayesian statistics all involve variational inference and so what I want to do today is talk about some of these early ideas but then quick get to these newer ideas and and try to tell you a little bit about why they might be important I want to mention that to the degree that I'm going to talk about the work from my lab this is work with some of my collaborators from the last 10 years or so Matt Hoffman Rajesh Ranganathan alpha kucik lb R ok so let's start by talking about the basics of variational inference but again I wanted to do this in the context of models so I want to first talk a little bit with you about topic modeling alright so what's the problem of topic modeling what topic modeling is about is I have a big collection of documents and I want to use posterior inference to discover the hidden themes or topics that pervade those documents because I just want to sort of use those for some prediction or I want to know what's going on in this big collection I'm a historian and I don't have time to read them all and the model I want to use as an example model for variational inference is called latent d'Orsay allocation and the idea behind this model is that so when we when we build probability models we have an intuition about what kinds of assumptions we might want to make in that probability model and here the intuition is that you know I want to understand topics from a big data set of documents but the intuition is that documents exhibit multiple topics ok so here's a document this is from the journal Science it's called seeking life's bare genetic necessities and what this article is about at a high level is the scientists use some computational methods to understand the number of genes that an organism needs to survive in an evolutionary sense like over millions of years okay and what I've done by hand is just highlighted different words in this article just to kind of solidify this intuition behind Lda for you ok so words like organism life survived words about evolutionary biology those are highlighted in pink words like genes genomes sequenced words about genetics those are highlighted in yellow it's like predictions computer analysis computational numbers words of data analysis those are highlighted in blue now the idea here is that imagine I took the time to highlight every word in this article with with its appropriate color and also throw out words like of and the but and and things that are have no semantic meaning and then squint okay and then you squint at the article you can do it here if you like what you'd see is that a big blend of blue and yellow and pink and blue and yellow and pink and so while you wouldn't know what the article is about you would see that you know I don't know what this article is about but it does somehow blend genetics evolutionary biology and data analysis and so that's what Lda is trying to it's trying to construct a representation of these articles so that you can squint at them and say loosely what they are about and so what we're gonna do is in building the model is we're going to embed that intuition into a formal generative process of data and so one way to articulate the assumptions that a probability model makes is to describe the generative process that it assumes data came from okay like every probability model has with it a generative process that it uses to generate data that it could use to generate data and so we want a generative process that captures this assumptions that documents exhibit multiple topics and so here is what it is for Lda so first let's define a topic more formally as a distribution over terms in a vocabulary okay so here on the Left our bunch of topics here's a topic that has words like data number computer with high probability topic words like do you when I hear it cut out do you hear it cut out but do you hear me anyway okay here's a topic that has words like brain neuron and nerve with high probability here's a topic that has words like life evolved organisms high probability and so on okay so these are topics now it to be clear each topic is like a way to die with the entire vocabulary on each face of the die and so or each face has a vocabulary term and so the word nerve has some low probability under this topic okay now so we have our topics and and now the generative process is for each document I choose a distribution over the topics okay so here I chose the pink the oh I got one yeah I just I need a little exercise so here I chose the pink the yellow and the blue or here here I drew a distribution that has pink yellow and blue and then for each word each observed word in the document I choose a colored button from this distribution here I chose the blue button and then I look up the topic associated with the blue button and I choose a word from that topic so here I chose the blue button and I choose the word analysis here I choose the yellow button and then I chose the word genes here I chose the pink button and I choose the word life and each time I'm drawing the word from the corresponding distribution over words okay and so you check check check check you write your article that way then you turn the page you're gonna write the next article in science and you choose a new distribution over topics okay what's important is that the same set of topics are at play it's just that the new article mixes them with different proportions clear okay so that's the probability model what's the posterior inference problem well the problem is that you know this is our fantasy that with that that this is how the documents arose but really we only get to observe the documents right all of this other stuff is hid it's hidden structure stuff we can't observe and so the posterior inference problem is to fill in all of the blank latent variables given the observations the documents themselves and that is the posterior the probability of the topics the topic proportions and the topic assignments given the documents okay and so just I went over that a little slowly but notice that I turned the problem of discovering topics from a document collection which is kind of a vague amorphous problem into now a calculation of a posterior distribution under a formal probability model also notice that we're gonna want to do this at large scale there are millions of documents and that means there are billions of latent variables okay so in machine learning we like to use graphical models to represent probability models and a graphical model basically represents a factorization of the Joint Distribution of hidden and observed random variables at play in this case topic proportions and so on and graphical models is a whole field we could have a lecturer come here to talk about graphical models and all of the its implications but the idea is that a graphical model encodes assumptions about data connect those assumptions to algorithms for computing with the data and essentially defines the posterior we care about through the joint the graphical model is a picture of the Joint Distribution and so it's it defines the posterior and so you can kind of read that generative process it shows you from this graphical model for Lda where so I should say so nodes are random variables edges means that there's some kind of dependence between random variables in the generative process and these plates with these rectangles are called plates they denote replication in case of the generative process for LD a is first choose a bunch of topics beta K so each beta is a distribution over terms these are the topics and there's K of them then for each document choose the topic proportions theta from some distribution it's a dear Schley distribution and then for each word in each document choose a colored button from theta and then choose the word from the corresponding topic that that colored button points to okay so this is the same picture of the generative process but as a graph and this graph connects algorithms and notice also everything is hidden except for the words right all I get to see is the words everything else is somehow hidden I can't remember if I mentioned shaded nodes are observed unjaded nodes are hidden okay so again with Lda the postie the latent variables given the documents is here it's the joint divided by the marginal and the problem is that we can't compute that marginal distribution okay and so we're gonna use variational inference to approximate this posterior so I haven't talked about the algorithm yet but let's we've got the inference algorithm we can approximate the posterior what is this algorithm gonna do well so here is a picture of a real analysis of these so we took 17,000 articles from science and we fit a 100 topic topic model so we found a hundred topics under distributions over terms for the 17,000 articles that's basically calculating that posterior now here is that article I showed you and here on the right is the is the posterior expectation of the topic proportions for that article so just to be clear we ingest 17,000 articles we find 100 topics from them now we ask what is the theta what is the distribution over topics for this article and you can see that the posterior for it so there's a hundred topics at play and the posterior is only somehow activating a handful of them to describe the words of this article hey this is P of theta given these words and the topics now moreover each topic is a distribution over terms so we can ask for the top most frequent terms from the top most frequent topics and you can see things like human genome DNA genetic evolution evolutionary species organism disease bacterial resistance bacterial computer models information data and so on these topics are interpretable right they each are and these somehow represent these threads of ideas that that this article seems to be about hey I can do this with like I showed you already the New York and here what we see is that you know these now are the most frequent words from from each topic and you can see that without any kind of supervision just ingesting millions of articles the algorithm can uncover the various things that pervade the articles ok so that's that's the idea so what is the algorithm that can take that model specification and the data and for Mysterio that gives us these interpretable topics do you think I'm doing something wrong with the microphone maybe I should meet you and so the roadmap for the the next ideas is to first define a general class of probability models that Lda is going to be a member of then derive the classical variational inference algorithm for that class of models and then derive the stochastic variational inference algorithm which scales to massive data so we can do things like analyze millions of articles okay so the class is called conditionally conjugate models and it's very generic so my observations we're gonna denote X hey there's n observations X 1 through n we're gonna denote what are called local hidden variable Z all right a local hidden variable there's one per observation so Z 1 Dan these can be vectors or collections themselves it's okay the global hidden variables are called beta and the difference between a local and a global hidden variable is that the ice data point X I only depends on the global hidden variables beta and its local hidden variable Z I in other words in the generative process the data point does not depend on the jate local variable okay here is the factorized Joint Distribution that respects those assumptions probability of beta Z and X is the probability of beta x conditionally independent the probability of Z and X given beta and here's the graphical model for that factorized joint okay where I have beta and I have Z and X in a plate and again if you study graphical models which I recommend you do then you can see things like the independence assumption of Z not being dependent on of x-eye not being dependent on zj from this graph our goal as always is to calculate P of beta and Z given X that's the posterior in this generic class of models so I wanted to find for you what's called the complete conditional the conditional is the conditional distribution of a latent variable given all the observations and all the other latent variables okay and so we're going to assume that each complete conditional is in the exponential family so I'm gonna tell you about the exponential family in a second but let me just show you the math here first so first look at this second equation here P of beta given Zn X so P of beta given Zn X is in an exponential family and what's important about this equation is that the parameter to this distribution is a 2 sub G of Z and X and it's a function of what I'm conditioning on okay this is the complete conditional it's the latent variable given the observations and the other layin variables and its parameters of course a function of what I'm conditioning on it has this nice exponential form okay the complete conditional for Z for the local hidden variable depends on only its observation X I and beta that's a consequence of these independent functions we're making it has the same nice exponential family form and its parameter is also a function of what its conditioned on okay these are the complete conditionals if you're familiar with MCMC all of em see all of Gibbs sampling is to is to iteratively sample each liam variable from its complete conditional and the exponential family is a general way to write down many of the distributions that we use Gaussian Bernoulli Poisson deras lay gamma etc these are all in the exponential family some distributions are not in the exponential family okay so as an aside it might be the most useful thing you learned today let me just tell you a little bit well how many of you are familiar with the exponential family raise your hand high okay so it's less than half so let me explain the exponential family since it's a useful idea to know about okay so just put everything else on hold here's a distribution of X and if it has an exponential family if it's in the exponential family it can be written in the following form P of x equals H of x times e to the ADA transpose T of X T is some function of X minus a of ADA a is a function of ADA okay now these different characters have names so ADA is then is called the natural parameter T of X are called the sufficient statistics so it's a function just of X a of ADA it's a function just of the parameter is called the log normalizer and H of X is called the base density or the carrier density it's actually got many names and I used to say you know it's not important you should just ignore H of X but it's important unfortunately but it's not important for us today so so so these are the names of the characters in the exponential family now the log normalizer is special so the log normalizer it ensures as you know a distribution has to integrate to one and it ensures that this integrates to 1 okay so a of ADA is log of the integral of e to the ADA transpose T of X DX and this if you plug it in up there ensures that when you integrate P of X over X you get 1 ok the log normalizer has a special property that we like which is that it's gradients calculate the expected sufficient statistics ok so e of T of X is some property of this distribution right if I hand you a distribution of X it has some expectation of T of X and E of T of X is a property of this distribution and in particular T of X is the gradient of the log normalizer with respect to ADA and if you like probabilistic modeling I encourage you to go home and prove that on a piece of paper it's it's easy start with the definition of a of ADA and prove this fact ok so as I said many common distributions are in the exponential family Bernoulli categorical Gaussian plus betta deer slay Gama etc etc the exponential family outlines the beautiful theory around conjugate priors and corresponding posteriors and connects closely to variational inference okay so after my lectures today and tomorrow if you want a challenge download this 300 page paper by Martin Wainwright in Mike Jordan and read it it's about the connection between exponential families and variational inference okay all good on the exponential family that's the exponential family so what we're assuming in this class of models is that whatever model I have I might be doing physics or social network analysis or I'm doing finance and I'm predicting stock prices but whatever model I made up I need the fact I need that the complete conditional of each latent variable given the others is can be written down in an exponential family it might be gaussian gamma plus on whatever but it's got to be in the exponential family okay so each complete conditionals in the exponential family and it turns out that the global parameter the complete conditional for the global hidden variable it has this parameter a de sub G of Z of X Z and X and it's gonna have a special form look like some kind of hyper parameter alpha plus a sum over all the data points of a function of the local hidden variable and the data I kind of overloaded T there that can just be it's gonna be a function okay and so this is gonna become important but the point is that the complete conditional for let me get back to it because it's so important so the complete conditional of beta given Z and X it has a natural parameter a - sub G of Z and X and that natural parameter is gonna have a special form I'll go back to the form now which is some vector plus a sum for each data point of a function of the local hidden variable and the data point and notice that when we're trafficking incomplete conditionals we can refer to two hidden variables as long as they're not the one that work that we are that we are defining the complete conditional of write the complete deriving the complete Canoe okay so we can use local hidden variables in the complete conditional because we're conditioning on them any questions all right so that class of models describes like many of the models that were developed in the 90s and 2000's in machine learning okay Bayesian mixture models time series models factorial mixture models and factorial time series models matrix factorization like factor analysis and probabilistic PCA and probabilistic CCA bayesian nonparametric models like dearly processed mixtures and hierarchical d'Orsay process these certain multi-level regression models like linear regression probit regression Poisson regression stochastic block models which are models of finding communities in social networks mixed membership models like Lda and some variants those are called mixed membership models because they model each document as a mixture but they share mixture components across documents these are all can be written in this way and they all have the property that they can be written down like this and the complete conditionals are in exponential families okay so you can probably anticipate what we're gonna do we're gonna derive a variational inference algorithm one second for this class of models and that then gives us variational inference for all these models and more yeah mm-hmm yeah so great two great questions so the first question was is the number of topics a hyper parameter and the answer is yes so you you are in charge of setting the number of topics and there's lots of ways to do that with things like cross validation or fancier things like Bayesian on parametric's hierarchical d'Orsay processes can be interpreted as a form of Lda where the number of topics is determined by the data those lists of words code those were the top most frequent words from topics right and you asked you know if I went further down that list just you know would that make them more sort of less interpretable you know I see that as a problem in that last phase of boxes loop like once I have my posterior what do I then do with that posterior to be useful if you run a internet company and you're showing topics to your customers maybe your customers only want to see three words and so you have to choose the right three words maybe the top most frequent isn't isn't the right the right way to choose those three words you want to use something else if you are have a different population that you're that you're working with maybe you want to show 30 words maybe you want to work cloud with the words the size of the words being big Quincy in the topic the point is all of those things are something that you're dreaming up right now though that I'm not saying those are all functions of the posterior okay so the idea is that I've got my model got my data in my model to give me my posterior and then whatever I do downstream with the model that's a function of the posterior so we need to approximate the posterior in order to implement that function and again this is the division of these activities good question okay so I'm gonna show you the then diagram a few time but here imagine the space of the slide is models okay these are just all possible probabilistic models and this isn't gonna be the scale but here are the conditionally conjugate models okay it's a subset of all models many models are not conditionally conjugate but like I showed you a lot of models are okay and so we're now gonna do inference for this class of models conditionally conjugate models now so we've set up our model P of beta Z and X and remember our goal is to minimize the KL divergence between some family of between a to minimize the KL divergence within a family of distributions of the hidden variables beta and Z and the exact posterior P of Z given X that should say P of Z and beta given X actually this all should have beta right so this is our goal now as you mentioned the you know how do you minimize something that depends on the intractable quantity that you care about right the KL divergence depends on P of beta and Z given X and so this is an important idea in variational inference called the evidence lower bound yeah actually so the evidence lower bound was coined in a paper in 2008 by John McAuliffe and Michael Braun before that it was called the variational objective basically and the idea is that the KL is intractable and so in variational inference we optimize instead the evidence lower bound or the elbow okay and this is the elbow I'll explain it in a bit it's a lower bound on log P of X so it's a lower bound on log of the quantity that makes inference hard and this is what's important maximizing the elbow is equivalent to minimizing the KL all right so the difference between the elbow and the KL divergence that we care about is an additive constant so that means that when I maximize the elbow and the elbow the elbow is negative KL plus an additive constant so maximizing the elbow is the same as minimizing the KL if I found the maximum of the elbow I have also found the Kayle minimizer now let's look at elbow here it is first of all so remember the characters at play I have Q of beta and Z with variational parameters nu and I have the posterior P of beta and Z given X and I have the joint P of beta comma Z comma X and the bow here it is has two terms so first of all the elbows calligraphic L is a function of the variational parameters alone right now I'm gonna optimize it with respect to the variational parameter so that's good and it's a function with two terms the first term is the expected complete log-likelihood all right so the complete log likelihood is the log probability of the hidden variables and the observations put a different way this is the model alright I call this the model log P of beta comma Z comma X is the model that's how you specify the model and what we're taking here in this first term is the expectation of the log joint with respect to Q okay and so here X is fixed that's the data Z and beta are both random variables but they are drawn from Q all right and so this is an expectation of a function of these random variables drawn from Q so this is a function then of the parameters of Q in other words the variational parameters nu is that clear okay so the X is the first term the second term is the negative entropy elog q of beta and Z with parameters nu all right so - this is the entropy but this is the second term it only depends on Q it's only depends on the variational family and it's a still an expectation with respect to Q of these random variables okay now these terms provide for us a trade-off okay so if you stare at the first term the answers on this slide unfortunately but you know let's suppose I just wanted to optimize Q with respect to the first term what's what's the best Q yeah exactly it's one that maximizes the likelihood there's no thing but right Oh log P abate well it's the one that maximizes the regularize likelihood wear your regular rising by the prior so log P of beta Z and X is equal to log of P of beta plus log of P of Z maybe given beta plus log of P of X given beta and Z in other words the log likelihood and if I only have that first term at play then Q is going to want to place it's all of its mass on whichever values of beta and Z maximize that quantity right because if Q spreads its mass to anything else you're gonna lose and as much as you spread your mass to other things right because they're not as good as the as the maximum as the app estimate which is what that estimate is the maximum a posteriori estimate okay so first term perfer queue to place all its mass on the map estimate so if I was doing map estimation I could just optimize that first term the second term however is the entropy I keep on tying my shoe ho think about the entropy while I tie my shoe okay so what is the second term want us to do the second term is the entropy the answers on the slide if you want to just read it if I want if I only have the second term and I wanted a cue that had a high entropy what would it want the cue to do yeah flatten it it would want Q to be diffuse right if I only want high entropy I want Q to be diffused so I want Q to spread its probability around many many different configurations of latent variables all right and so these these terms are at odds the first term wants all the mass on the map estimate the second term wants Q to be diffuse and so the trade-off is what the elbow does the elbow characterizes this trade-off okay one caveat is that the elbow is not convex okay so if you look at those mountains over there that's what the elbow looks like there's lots of different hills good question yeah well how does it know so it doesn't know anything right it's just an equation but don't tell the AI folks but the answer is an interesting quest the question is interesting and I don't have a good answer there is a trade off that's all we know right that's somehow you know making Q more diffuse is going to hurt the first term and making the first term higher through getting putting more mass on the map estimate it's going to hurt the second term and then and the trade-off exists in the particular way that you've parameterised log P and log Q and thinking about that precise trade-off I think is a great research question is there another question yeah I'm sorry I didn't hear you can you yell it ah yes so well except this is about KL divergence so yeah it's this objective function has this trade off other objective functions you'll want to stare at them and meditate on those yep these are all good questions okay so now if you again go to the archive I said it was a mixed blessing it's more of a blessing than a mixed blessing I love the archive if you go to the archive and read new papers about variational inference you'll sometimes see this decomposition which is the another decomposition of the elbow in fact there's a great paper by Matt Hoffman called elbow surgery different ways of decomposing the elbow it's gruesome if you think about it but and so this is so if you just look at here's the original elbow this is from the nineteen nineties work now more recently some people write the elbow down this way which which is it's the same quantity it's there's nothing different about it but it shows us a different trade-off so here the first term is the log-likelihood okay so this if I only had this term I would place all my mass on the MLE right I would ignore the regularizer x' and I place all my mass on the MLE the maximum likelihood estimate of Z and beta the second term now is the KL divergence between the variational distribution Q and the prior P of beta and Z okay and so what this does this is kind of the this shows you the classical Bayesian trade-off is it trades off Q being close to the prior versus Q placing its mass on configurations of hidden variables that explain the data okay that's the classical Bayesian trade-off all right again these are the same quantity but this is another way of writing that quantity that's illuminating okay so that's the objective function so now just to be clear we have our model Lda we have a posterior that we care about because we've got millions of documents that we want to calculate the posterior for we have an objective function where if I minimize that objective I maximize that objective function I find the Q that minimizes the KL divergence but we still haven't totally set up the problem because we still need to specify the form of Q of beta and Z right the the the form of that ellipse that's the variational family and we're going to use in throughout the talk today and tomorrow the mean-field variational family and what the mean field variational family is is it's a family where each latent variable is independent of each other and has its is governed by its own parameter okay and so for this generic complete conditionally conjugate model q of beta and Z looks like the the mean field family for Q of beta and Z is Q of beta with parameter lambda times independently q of zi with parameter VI okay so let's just stare at this for a little while too first of all notice that the data do not appear in Q all right Q is a family of distributions over the latent variables the data by definition is not a latent variable it's the data does not appear in Q it will later in a funny way but for now it doesn't a second if you're a statistician you might think this is really crazy and vacuous okay even if you're not a statistician you might think that what this is saying is each variable is independent but also have own parameters so this is just a bunch of wildly independent random variables each one governed by its own parameter each topic assignment of each word of my bazillion word corpus has its own parameter that tells me what topic that particular instance of the word peanut in that article was attached to and the key idea is that it is only when we optimize the elbow with respect to these variational parameters that we connect these wildly independent random variables to the data because when we optimize the elbow we're minimizing the KL divergence to the posterior the posterior is conditional on the data so when I optimize the elbow I connect these independent variables here on the right with the data via the posterior on the left is that clear it is in that objective it is in that optimization that I connect the data and the model to the particular realization of the variational family that I get three if you're a statistician you might also think it's crazy that everything is independent and then that's really bad right because if you're a statistician the first thing you learn in statistician school is that it's really bad to do that now here I want to emphasize that while everything is independent this is actually a very flexible family of distributions in that every single random variable in the model has its own distribution so it's not the same as iid it's not that everything is independent and distributed in the same way everything is independent like I said wildly independent and and can capture any distribution it wants so the word peanut in one article can be assigned to pick nine through its variational parameters and the word walnut in another article can be assigned to topic 11 through its variational parameters okay the two topic assignments don't have to come from the same distribution each is different so this mean field family can actually capture any marginal any set of factorized margins right any marginal distribution of any of the hidden variables in my model can be captured by this variational family okay and this is called the mean field family so when you hear the mean field family what that means every latent variable is independent and governed by its own distribution now we're gonna make another we're gonna we're gonna further trick this and say furthermore each factor is in the same family as the models complete conditional okay so let's just take Q of data with variational parameter lambda if the complete conditional P of beta given Z and X is in some exponential family let's say it's a Gaussian that means that Q of beta needs to be a Gaussian to now to be clear it has a free parameter so it's a Gaussian with a free mean and variance lambda okay but the point is it's in the same family as the complete conditional and and that's what's going on so imagine I have a model I've got complete conditionals each one is in some exponential family let's say one of them is a Gaussian and one of them is a plus on I set up my mean field family where I have two independent random variables one it has free Gaussian parameters the other has free Plus on parameters okay what variational inference does then is turns the knobs on those Gaussian parameters and those plus on parameters so that the particular realization of that family that I have in my hand minimizes the KL divergence to the true posterior of those two variables given the data is that clear that's the game so now we have completely specified our problem a elbow is a function of all the local variational parameters fee and the global variational parameter is lambda and here is the elbow all right and so now for every setting of these variables I can calculate this objective function and so I can optimize this objective function with respect to those variational parameters are there any questions hopefully we've made concrete now in black and white what why I drew that picture in the beginning right that picture is every point in that ellipse is a particular setting of lambda and fee and my job is to try to find the setting of lambda and fee that is closest to the exact posterior and that connects this crazy everything's wildly independent family of distributions to the distribution I care about which is the PO here okay so that's the variational inference problem mean-field variational inference now traditional variational inference uses coordinate ascent to optimize lambda and fee and what that means is that you know say on my arm are all my variational parameters I have my lambda and I have all the feeds for each of the local variables what I do in coordinate descent is I iteratively optimize each one holding the other ones fixed yeah I march through all those variational parameters optimize lambda optimize fee 1 V 2 V 3 and so on and then go back to the beginning optimize land again optimize V 1 V 2 over and over again I iteratively optimize each one holding the other ones fixed that's coordinate ascent and it turns out this is due to zoo bandar Monty and mat Beale that the updates the coordinate updates for each variational parameter have a very nice form if I'm optimizing lambda its optimal value is the expectation of the natural parameter of its complete conditional and stare at this for a second so remember the natural parameter of its complete conditional is a function of all the other hidden variables and the observations taking its expectation because of the mean-field assumption is only going to involve the variational parameters that are not lambda right because by definition in the mean field everything is independent of each other so this expectation only depends on fee which is makes us happy because if we're doing coordinate descent we want to optimize this parameter holding everything else fixed and if the optimum only depends on everything else and we have a valid coordinate descent algorithm and so this expectation with respect to fee is with respect to Z but only involves fee and since we restricted the factor the variational factor to be of the same form has the complete conditional we know that lambda and ADA sit in the same space okay so that means that this is a sensible coordinate update all right and the coordinate update for fee is analogous it's the X tation of the natural parameter of its complete conditional which depends on X I and beta the global parameters the global hidden variables and that expectation of course only depends on the variational factor for the global hidden variables which has parameters lambda clear okay so that's coordinate ascent variational inference march through and update each one according to this schedule notice the relationship to Gibbs sampling if you're familiar with Gibbs sampling like I mentioned earlier in Gibbs sampling we iteratively sample from these complete conditionals here we iteratively set the variational factors to be equal to the expected values of the parameters of those complete conditionals hey but they're very similar in spirit and that's how I got this picture okay so here I've got variational parameters for the color of each point and the locations of each mixture component and I'm plotting an approximate posterior predictive using my fitted Q and here I'm marching through all of those variational parameters iteratively updating them and getting close to the exact posterior and then here is a picture of the corresponding posterior predictive so let's just to make this concrete for the example model we were using Lda so here's LD a as a graphical model and here I embellished it with variational factors okay there's a variational parameter for each set of topic proportions like I said for each word there's a variational parameter fee and then for the topics there's variational parameters lambda so each topic has its own variational parameter the mean field distribution is just a big factorized distribution of these hidden variables and what do we do we're going to do coordinate descent inference so first we hold the topic parameters fixed and we update the local hidden variables theta and Z okay and there's nice closed form updates for those that's how I got this picture all right the topics are fixed and then I fit variational distributions to theta and Z and this is a picture of the variational distribution for theta in the global step then I hold all the local distributions although local variational parameters fixed and I update the global variational parameter for the each topic and like I said that's going to have a special form where it's a sum over the data of some function of those local variational parameters okay so that's a global step and that's how I got this picture okay that's how I from the this is a picture of the fitted local variational pores this is a picture of the fitted global variational parameters okay so each lambdas a distribution over distributions of words I looked at the poster expectation of each one and then I looked at the top under that posterior expectation in general this is the first variational inference algorithm we'll see coordinate is variational inference which if you look at the nips proceedings from you know whatever 1990 to 2005 and looked for papers about variational inference chances are somebody is read arriving a coordinate descent variational inference algorithm like this then the input to the algorithm is the data and a model we initialize randomly the variational parameters and then we March so for each data point we set its local variational parameter to its optimal and then we set the global variational parameter aggregating all of these updates to its optimal and then repeat okay that's LD a inference and just to emphasize we've derived this now for a big class of models okay next I want to talk about stochastic variational inference which builds on this okay so we just covered whatever 15 years of variational inference research but this is inefficient right classical variational inference is inefficient why what do we do we do some local computation for each data point and then we aggregate these computations to re estimate the global structure and repeat okay and so again with topic modeling you can see this immediately you know when we started working on topic modeling we were analyzing 10,000 15,000 documents it's no problem but then as the years wore on we started wanting to analyze millions of documents right in tens of millions of documents and and this doesn't work in that setting why because what happens use you initialize or variational parameters to garbage topics right uninterpretable topics total garbage and then the variational inference tells you that you need to march through all of your documents with those garbage topics and painstakingly analyze how they reflect them right I go through all 10 million articles analyzing how they exhibit these garbage topics before I can update the garbage topics to something more meaningful and the problem is you might not even you might have too many documents to even get through this once okay so this is inefficient so casick variational inference scales variable inference to massive data oh do you have a question okay that's okay any questions and pause for questions in a while yeah there is yeah right so good with a good question so I'll repeat the question the question was does it always converge the answer is yes and the second question is I mentioned earlier that the elbow is not convex is that a problem no okay good so it I'm just kidding it is a problem so I'm right about the first thing it's always gonna converge five minutes okay Oh take a break in five minutes okay I mean I'm dodging the question about the non convexity of the elbow right now so this is a really good timing yeah good so oh right we're talking about so right so this is gonna this is guaranteed to converge to a local optimum of the elbow and so in practice what happens in practice it's just like p.m. you're very sensitive to the initialization and one open problem in variational inference is how to initialize variational inference well okay so you know sometimes in the middle of the night I wake up with an idea of how to initialize variational inference well but then after like three weeks of implementing it and working on it I learned that it doesn't work and you know I hope that you will all now do the same thing and to see that it that it converges to a local optimum I refer you to the proof that e/m converges it's really identical now the elbow is non convex so in practice there's all kinds of stuff about things like randomly restarting the randomly restarting the optimization algorithm and and then choosing the local optimum that is best now that's not guaranteed to be the best local optimum overall local optima but it's the best one that you got anyway that's it this is another interesting open research problem I think that there's a lot of really good research lately about characterizing the non convex deep learning objective and taking inspiration from that literature and thinking about the variational objective is a worthwhile open research problem - there's a lot of research about how stochastic optimization which we're going to get to in a second finds better local optima in these kinds of non convex settings and I think looking at that research and thinking about the variational objective is also a good idea a good open research problem in practice you saw it with that mixture model and you see it with the topics and that the whole list of pictures I showed you in the very beginning of the talk which all use variational inference in practice we often do well as a practical empirical matter just being satisfied with the best local optima we can find but if you talk to my students and postdocs they will tell you that they do spend some time thinking about smart ways to initialize good okay so we want to scale this up it's inefficient we want to scale it up and what I'm gonna do now is to arrive for you stochastic variational inference and at the end of the derivation you're gonna see an algorithm that has this very intuitive structure where so I've got my model P of Z comma P of beta Z X and I've got a massive data set and I want to I want to approximate the posterior the algorithm is gonna look like this so at any point in the algorithm I have my idea of the global hidden structure I have my idea of the global variational parameters that could be started randomly I'm in a subsample a little bit of data I'm gonna ask how that data exhibits that global structure and then I'm gonna use those inferences to somehow update that global structure and I'm gonna repeat okay again just using L da as a concrete example I'm gonna grab a handful of documents rather than having I'm gonna start out with garbage topics as usual but I'm gonna take a handful of documents rather than having to march through all ten million I'm gonna ask how does this handful exhibit these garbage topics and then I'm gonna use the results of that little local inference to update the garbage topics to make them a little less garbage II okay then I'm gonna grab another handful look at them make them less garbage another handful and so on until the topics eventually look good and you can see just looking at this computational flow that this is gonna be much more efficient than having to take my garbage topics and painstakingly analyzed all the documents with that garbage topics okay so that's the key idea the key idea and I guess maybe we'll end after talking about the key idea the key idea is to use stochastic optimization and the idea behind stochastic optimization first of all let's just not talk about variational inference or anything else for a moment and just talk about stochastic optimization stochastic optimization is an amazing idea it was invented by herb Robbins in 1951 that's like 67 years ago or something like that and the idea is that when I'm doing optimization I can replace the gradient that might be expensive to calculate with cheaper noisy estimates of the gradient and what Robbins and Monroe showed is that when you do this this is guarantee to converge to an optimum of an a convex function Lian Bo - who popularized convex stochastic optimization - machine learning show that it's guaranteed to converge to a local optimum as well all right this is the idea like all these companies they're interested in machine learning because of this this is the idea that has enabled machine learning the reason we're here and are well funded and have t-shirts and pieces of gum in our bags is because of stochastic optimization okay and the idea is very simple it's it's very intuitive so when I teach stochastic optimization to my students this is how I do it so imagine I don't use South Africa as an example but imagine we're here in Stellenbosch and we're trying to get to Johannesburg and everybody's drunk okay so I don't mate it's New Years Eve is there a holiday in Africa where this happens okay are you st. Patrick's Day at home but here okay so everybody's drunk it's New Years Eve and we're trying to walk from stellenbosch to Johannesburg seems crazy but that's what we're trying to do you're not drunk okay you're sober you walk out and you see McKellar e and you say hey I need to get to Johannesburg with calories drunk and you say how do I get there and he just points in some direction before collapsing on the sidewalk now what herb Robbins wants you to do is walk 1,000 kilometers in that direction okay doesn't matter where he pointed just walk in that direction assume that there's no oceans okay you're the you now and wherever it's time to stop but this is like the story I'm telling you're now in the middle of whatever and and you see Ben and Ben's drunk too and and you say hey how do I get to Johannesburg and Ben points in some random direction you walk 500 kilometers in that direction you then see Shakir who's telling you to stop talking and he's drunk as usual and you say Shakir how do I get to Johannesburg I'm walking Shakir points in some direction before falling over walk 250 miles in that direction you see bernhard he's drunk you ask him how to get to johannesburg you walk 125 miles in that direction and here's the deal this is stochastic optimization as long as you shorten how long you walk each time you ask a drunk person how to get to johannesburg you can just imagine and 1 & 2 if you could magically revive Shakir and ask him over and over again how to get to Johannesburg he would the average of where he pointed pointed directly at Johannesburg right so in other words Shakir provides us an unbiased view of where Johannesburg is even if he's pointing in the wrong direction because he's wasted then then if you reduce that distance each time you walk you are eventually going to get to Johannesburg you might be less than efficient in that you are walking in lots of wrong directions as you ask drunk people how to get there but you'll eventually land right at Johannesburg and that's exactly stochastic optimization if I can find a cheap noisy gradient of my objective function I can follow it with a special schedule that we'll get to tomorrow and get to the optimum and that's the idea that is driving all of machine learning today also by the way herb Robbins invented bandits which is the foundations of reinforcement learning and he invented empirical Bayes which is part of the foundation of probabilistic machine learning so whatever he put in his pipe is what you are interested in putting in your pipe as well and good so we'll get back to more technical things tomorrow and we'll discuss how to apply this idea of stochastic optimization to variational inference [Applause]

Info

Channel: MLSS Africa

Views: 9,073

Rating: 4.9455781 out of 5

Keywords: MLSS 2019 STELLENBOSCH SOUTH AFRICA

Id: DaqNNLidswA

Channel Id: undefined

Length: 86min 57sec (5217 seconds)

Published: Fri Feb 01 2019