Topic Models

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

okay welcome to September first my name is Dave blye I'm going to talk about topic models I'm from oops I didn't mean to do that I'm from Princeton University in New Jersey where it's four o'clock in the morning and so this is what it's going to look like to teach at 4:00 in the morning okay and please interrupt me a lot in this talk so that it'll help stimulate me to wait me out okay good so what topic modeling is about is that as you all know it's almost redundant more information becomes available to us it becomes more difficult for us to be able to quickly access it and search for things in it and understand it and basically get something out of it and so we need new tools new algorithmic tools to help us organize search and understand these vast amounts of information things like text archives and image archives and all different kinds of data that we're all just both creating and having access to constantly so what topic modeling provides our methods for automatically organizing understanding searching summarizing exploiting these large electronic archives and the basic idea are is is are these three steps first we take a big corpus like Wikipedia say and uncover the hidden topical patterns that pervade it so Wikipedia has got some number of millions of articles and they're about some things those things overlap and we want to uncover what those things are what are the different topics that pervade this collection then we want to annotate the documents according to those topics so I understand what those say you know 250 topics in Wikipedia are now if I pluck a document out of Wikipedia what topics is that article about how can I that article according to the topics I've discovered in step one and finally we want to use these annotations to organize summarize and do whatever it is we want to do obviously well not obviously but doing this in a vacuum is only so interesting we really want to say okay now that I've been able to essentially mark up my collection according to these automatically found topics I want to use that markup collection you can think of it as if I had 10 million people to go through Wikipedia and carefully organize it then what would I do with their organization since I don't have those 10 million people I build algorithms to do that for me okay oh this is backwards it's the same I've been almost hit by cars about six times already so same idea okay so what we can do with with topic model for things like discover topics from a corpus so here are some topics so a topic is going to be a distribution over terms in a vocabulary and these are the top most probable terms and for topics that were uncovered by analyzing the texts in in the journal science so here's a topic human genome DNA genetic evolution evolutionary species organisms disease host bacterial diseases computer models information data computers these are words that seem to go together in some kind of thematic thematically coherent way but I should say that part of this talk probably on Thursday we're going to discuss the pitfalls of overly interpreting these probability distributions over words but for now since we're learning about it we can pretend like those pitfalls don't exist and interpret them so these are four topics discovered from the journal Science okay oh yeah there's no good reason not to SEM yeah so uh you know sometimes stemming I mean when I when I fit a topic model sometimes I use stemming it can be when you're looking at these topics if you're doing this for like a system for someone to be able to to quickly browse and navigate through documents some stammers can be overly what's the word aggressive and you can't really recognize the roots that they that they give back but then there's one that I know my student likes I can remember what it's called but it's a more conservative stammer just it will remove things like computer and computers okay but no there's stemming is a good idea other questions okay things like so discovering topics discovering topics and understanding how the words change how in within the topic through time so here are here is time from 1880 until 2000 and here are two different topics that I've named again ignoring the pitfalls of naming topics and here are some words and those topics and you can see how their probability is changing over time okay we can we can uncover these time changing topics with topic models we can model connections between topics so here again are lists of words that we are interpreting as topics and again this is from science the science articles and here are topics and how they tend to co-occur with other topics so you see here we have this topic ancient found impact million years ago in Africa and it connects to this topic fossil record birds fossils dinosaurs fossil okay those two topics tend to co-occur whereas this dinosaur topic everyone loves dinosaurs doesn't connect as much with say this topic about mice antigen t-cells antigens there's this issue again and immune response okay so finding connections between topics and finally we can use these tools not just on text data but on lots of different kinds of data for example on images topic modeling has has helped make some progress in computer vision and we can also combine these different types of of data type these different types of data here we're using the topic model to automatically annotate images so in this model you the input to the algorithm are images and words that describe them and again by using this this set of ideas that we're going to be discussing for the next while you can take a raw image after fitting the model and automatically annotate it here we have this image of a fish and that annotation is fish water ocean tree coral okay it's not perfect as I don't think there's a tree in here but this is another another application of topic modeling and there the idea is that the same way that we think of I don't have a picture of a document anywhere the same way we think of a document as a collection of words you can think of an image as a collection of image features either by running some algorithms on it in advance to segment it and then computing features of each segment or by doing something silly like just gritting it and computing features of each square in the grid and then you can treat the image like a document okay I was deciding between coffee and water just then okay so that's kind of the one perspective or motivation for topic modeling since this is a machine learning summer school you know sometimes I think of topic modeling on one hand it's about taking our intuitions and maybe other scholarship about the structure of language and images and encoding it into into good machine learning algorithms but from the machine learning perspective topic modeling can be seen as a case study in applying basically hierarchical Bayesian models to grouped data like documents or images and and thinking of topic modeling as an application of these ideas it really touches on a lot of different pieces in the applied statistical machine learning world things like directed graphical models conjugate priors and non conjugate priors time series modeling modeling with graphs hierarchical Bayesian methods fast approximate posterior inference like MC MC or variational methods exploratory data analysis model selection nonparametric Bayesian methods and mixed membership models so in this talk we're going to touch on all of these all these topics I know you're going to go into these topics in more detail in other lectures in the school but if you're interested in any of these things I would encourage you to think of topic modeling as a possible application for for your ideas in one of these contexts it's a nice way for example to test out new methods of approximate posterior inference and so on okay okay so the way I had planned this is that we'd start by talking about latent Theory allocation which is like the simplest topic model I guess and then discuss in some detail approximate posterior inference I was going to talk about Gibbs sampling and variational inference and then talk a little bit about comparing the different approximate posterior inference algorithms that are available to you and give some advice apparently I didn't I don't know I think I might have written that before I made the slides because I don't think there's much advice in there and then in the second part I want to talk about taking the assumptions of Lda and relaxing them in various ways so one is to build topic models for prediction relational and supervised topic models the other is to relax the dear Ashley assumption and hence the conjugacy assumption and look at the logistic normal as a tool to building different kinds of topic models there we'll discuss dynamic and correlated topic models and finally briefly I'll talk about infinite topic models which is otherwise known as or which is an application of the hierarchical DSA process I know you itay we'll discuss that in much more detail next week then finally I want to discuss interpreting and evaluating topic models of a thorn in the side of topic models to be frank but we'll talk about it yeah we make the assumption of documents are lines of words all the time well I haven't talked about anything yet but right it does seem like I'm gonna make that assumption the whole time and yes I will yeah but I'll be but I'll be totally honest with you about it yeah that's right yeah yeah no III that's right yeah so to answer the question I like that question yes I think we're going to assume that documents are bags of words everywhere but I'll try to point you to some there's been some excellent work out there that that takes the same ideas and relaxes that assumption basically allows documents to be to have Markovian structure for area but yes we will be explicit and we will assume that documents or bags of words here these are exchangeable exchangeable models good question I'm sorry I answered it with a joke I often do that it's no offense okay other questions that will be politely answered okay okay talk about that all right so now we'll start this talk in earnest so late and d'Orsay allocation is a probabilistic model and I'm sure you've heard this already in this in this series but the idea behind probabilistic modeling is to one treat your data as observations that arise from some kind of generative probabilistic process this is generative probabilistic modeling I'm saying and one that includes hidden variables structure that that that we want to find in the data so for documents those hidden variables reflect the thematic structure of the collections that we don't have access to step two is to then infer that hidden structure using posterior inference basically we're going to contemplate and compute the conditional distribution of the hidden variables give the observations which is which are the documents themselves third we want to situate new data into the estimated model typically sometimes you might want to just analyze your documents and then look at the hidden variables as is and never never believe that you'll see another piece of data again but often you want to take your model and then you have some new data coming in that you want to do something with and so you need to be able to situate that new data into the estimated model in other words how does this query or new document fit into the topic structure that I learned in the first part okay okay so the intuition so with with this kind of general recipe for probabilistic modeling the intuition behind latent dear Ashley allocation or Lda is simply that documents exhibit multiple topics okay so here's an example I think there's a way to do this oh okay it doesn't matter oh good so here's an example so this is an article from the journal science called seeking life's bare genetic necessities and this article is basically about computing the number of genes that an organism needs to survive evolutionary evolution narrowly and what I've done is I've highlighted different words in this articles with different colors so words like predictions computer analysis computational computer numbers I manually highlighted those in blue these are words about data analysis words like genes genomes sequenced these are highlighted in yellow these are words about genomics and words like life and organisms and survive words about evolutionary biology I've highlighted these words in pink okay so the intuition that I'm trying to convey is that this document somehow combines words about computer analysis words about evolutionary biology and words about genomics okay in contrast to the assumptions made by a mixture model which says that all of these words come from a single component a single topic okay so with that intuition in mind the idea is to express it as a generative probabilistic process and the way that works is this so first we are going to posit that there are some number of topics which will now formally define that live outside of the document collection okay so here I have four topics listed there might be 96 underneath it and each topic is a distribution over terms in the vocabulary alright so there's a fixed vocabulary we're going to assume that and every topic is a distribution over that fixed vocabulary but different topics have different words with different probabilities so for example at the top I see a topic that has worked and here let's say I've ordered these words in order of their probability so at the top we have a topic that has gene with probability 0.04 DNA with probability 0.02 genetic with probability 0.01 and so on ok so I'm calling that the genetics topic underneath it the pink topic the word life is probability Queen oh to evolve 0.01 an organism point 1 then I have a topic in green about neuroscience with words like brain and neuron and nerve with high probability and finally a topic data number and computer with high probability ok so when I say topic what I'm going to mean is distribution over fixed vocabulary but that's cumbersome so we say topic yeah can they went yeah that's an important point that every topic contains a probability for every word so even though it doesn't have high probability the word data has some probability in the yellow topic and it might be that if you have a word that a word can have high probability in two topics for example it's a example you're probably sick of the word bank could have probability in a topic about financial instruments and also high probability in a topic about bodies of water for River Bank other questions okay good well we'll get there yep good question that's going to be the next the next thing okay so let's assume for now that those probabilities are all there and we've got our topics and and there's a hundred of them the generative process for each document then works like this first we're going to choose a distribution over our topics so while topic is a distribution over terms a distribution over topics is a distribution over these 100 elements okay so there are a hundred topics then this distribution has 100 possible possible values each one color-coded by its topic and here I've chosen one that has pink with some probability yellow with some probability and blue with some probability clear and we're going to draw that from a d-rush light distribution have you seen the derailer yet have you seen the deer so yet in this series no okay so good we'll talk about that this is drawn from a distribution over distributions called a deer Ashley and that's the first step in generating a document the second step is to repeatedly draw a coin from this distribution so here I drew a blue coin look up what topic that blue coin refers to the blue topic that's why you got a color code it's real porn and then choose the word from that distribution okay so here I rolled this a 100 sided die I got blue I drew a word analysis from the blue distribution okay here I drew I can't point this way let's see I'm not good with shadows so do it that way here I chose yellow I got the yellow coin and then I looked up the yellow distribution and got the word genome and somewhere else I chose the pink coin and I got the word organism in life and now going back to your question you can see that oh you know Thanksgiving the jumping kind of was good but this can't last what that also good oh that's nice it's green Green is that's good you know it's it's more positive than red which is like stop green is go what were we saying oh yeah so so so we choose these coins and then we choose the word from the correspond distribution here we chose the yellow coin we choose the word genetic here we chose the pink coin we choose the word organism and so on and so forth and this gets to your question I'm implicitly assuming now that the order of words doesn't matter okay because I'm choosing these coins independently of each other now of course documents aren't really made this way if they were they would be totally unreadable but if you I should say when they are they are totally unreadable but if you had a document whose words were all shuffled and you looked at we looked at those words you might be able to get a sense that oh this is a document about genetics computation and evolutionary biology by looking at the various types of words that occur in the document okay so it's important that the generative process doesn't have to be a precise description or a plausible description of how the document arose in fact doesn't even have to generate documents that look realistic it just has to be a process that makes sense for your goal which is in looking at the posterior finding these thematically coherent terms and a little later we'll discuss how one well we won't discuss it much but we'll try to think about how this generative process leads to these types of probabilities in the posterior but we will get there but is this generative process clear so this is important the whole rest of the talk depends on it first choose a distribution over topics then for each for each word draw a colored coin from this distribution look up the distribution over terms associated with that coin and draw the word from that distribution and then that's the generative process for a single document and then for another document we repeat the same process so notice that documents are going to have different distributions over topics so while this document is about the pink yellow and blue topic another document might be about the green and blue topic a document about computational neuroscience for example ok we repeat this process for every document clear any questions yeah we're going to get right we're going to get though this is this is just the imaginary generative probabilistic process that we're assuming our data came from and what we're assuming in particular is that we've got these and we've got a process for this ok other questions ok so the problem is of course we don't get to see any of this right this is what we're imagining is there and this is what we'd like to exploit and use but in reality all we have is a big stack of 10 million documents so our goal is going to be to infer the underlying topic structure given all these documents first of all possibly most interesting what are the topics that generated them under this is under these assumptions what are the distributions over terms that generated them under these assumptions and for each document what is the distribution over topics associated with that document and maybe we care about for each word which topic generated each word okay and that's the algorithmic problem here is to is to compute this this posterior distribution this is a conditional distribution of all of these latent variables given the observations which are the words of the documents clear not clear no not not clear good okay so that's sort of the fun part and it's over I'm serious so we're going to encode this model in a graphical model direct a graphical model so I want to I know you've seen these already I want to just quickly review them so that we're all on the same page about the semantics of directed graphical models in a directed graphical model a node is a random variable so here I have nodes Y in X 1 2 X n edges denote possible dependence between random variables so here the edge between y and x 1 means that X 1 might depend on Y X n might depend on Y as well observed variables are shaded so here I've observed X 1 through n I have not observed Y Y is a hidden variable or a latent variable those use interchangeably and plates denote replicated structure so instead of having to waste your ink on this big figure you can shorthand it with Y goes to X N and draw a box around X n is called a plate and put big n in the lower right hand corner to denote that we have big n of these little X ends and they're all dependent possibly on Y clear ok magical listing so the structure of this graph defines the pattern of conditional independencies that are encoded in the joint distribution of these random variables it's so from this graph we can read off all of the possible conditional independencies that among these random variables and and the structure of the graph also defines a factorization of the joint distribution of these random variables in particular here this here is the joint distribution of all of those random variables Y and X 1 through xn and this graph means that this joint distribution can be written as P of Y times the product from N 1 to n of P of xn given Y so you notice that that just from this joint distribution you can see that X the X's are all conditionally independent given Y okay so so this and this and this all mean exactly the same thing questions okay good so with this simple little language in our hands we can write down the latent three really allocation model that I just described for you okay and you can see why we need those plates so each piece of this structure now is a random variable and let's the easiest way to understand this is to sandwich in from the outside into the middle okay so the first thing I'll tell you is to ignore ADA and alpha for now and theta theta sub D D is the document replication are the topic proportions that was that cartoon histogram that I drew for you with the pink and the blue and the yellow bars okay so that's called theta D topic proportions we have one of these for every document it's in the D plate now oh I should have started over here so beta K are the topics themselves that's where we start so each beta is a distribution over terms and we have K of them so K might be a hundred all right so so beta sub 96 is some distribution over words so beta lives on what we call the simplex the the vocabulary simplex it's the space of all possible distributions okay and I'm assuming here that beta comes from a dear Ashley distribution which is a distribution over these that over in this type of space okay so these are our 100 topics and that's the K plate now we have the document plate this is the corpus and we first have theta D these are the topic proportions as I mentioned the cartoon histogram and it is of dimension K because there are K topics that's not really illustrated here then for each word that's the end plate inside the D plate if you want to be real picky you might want to put a little D here we have ZD n call that the topic assignment that's the colored coin from the picture okay so you can see that that depends on theta because it's drawn from a distribution with parameter theta so if theta has probabilities for blue and yellow and pink then zdn might be pink and it's drawn from that particular theta clear and and there is a Z for every word remember there's a colored coin for every word ok wdn depends on ZD N and beta all the betas wdn is the nth word in the dthe document and notice that double WD n which is difficult to say is the only observed random variable in this whole model okay all we observe are a bunch of words organized by document and why does w depend on Z and beta that's not rhetorical it could have been that's right selected from the topic and and and so why is it does it depend on all the betas Oh okay so because all the topics contain this word and what did you say that's right oh right and so if this is going to encode the distribution of the words then how do I what is the probability of this word given Z and beta probabilities that topic existed period Berk right so well I'll say close oh that's almost right so we're conditioning on Z and all the betas so the probability of the word see if I can do this so in the in the joint distribution that this graphical model represents we have probably the word given Z so this is BM given Z DN and all the betas and what you said is that it's the probability of seeing that topic and the probability of seeing that word under that topic it's that what you said no what did you say probability of doing that were given the topic of probability of observing Network given topic assignment topics like require to right so what it is what this is is I think I think you said what it is it's the probability of observing this word from remember zdn is a number from 1 to K because it comes from theta and so you index the zdn topic from beta 1 through K and you look up the probability of wdn from that topic okay so it's theta z BN comma w BN okay it's basically we've got our k topics each one is a distribution over words so if this is V and this is K then our our topics this is sometimes called the topic matrix our topics are each a distribution over all the V words like we said they all sum up to 1 and in the generative process once we've selected Z we basically know which topic this word is coming from so it might come from this topic here and then we look up the particular cell of this word so if this is WDM that cell is wdn this row and this column is z BN then we basically look up in beta the in the zdm column the WDM word and get its probability from there okay so that's why we have the observed wvn depend both on z DN and all the betas because without observing them we don't know which one it is right we just know that it depends on both those things and you can kind of see that functionally here the probability of the word depends on both Z and beta clear the probability of the word is the so the probability of the word given the topic assignment and all the topics is the WDM entry in the zdm beta it's the word WDM entry in the topic assignment sorry in the topic beta with index topic assignment zdn did that make things less clear possibly if you have a question yeah when you're writing this formula here you're doing is it said again you know when you want value I'm sorry you're saying that the word depends that's right instead the end Publius is a distribution and therefore yes okay because that word might be coming from civil okay ah repeat the questions oh yeah sorry sure yeah the question is that it looks like in this formula I'm assuming that well the question was with Zed but I say Z but we could say Zed is Z Z n a distribution or is it a I'm assuming it's a single value I'm saying it's a topic assignment a number from one to K when really we don't observe it so it's a distribution that was the question and the answer is I'm very glad you asked that question so here we're still living in the fantastical magical world of treating our hidden variables as observed okay so here I'm asking for the distribution of WD n given ZD N and beta 1 through K now we know you and I both know secretly they're not observed so we're going to be putting posterior distributions we're going to have a posterior over these things but for formulating the model the conditional the the conditional distribution that this graphical model begs us for is the probability of W given that I've observed the topic assignment and all of the topics and in that fantastical world zdn is just a number from 1 to K that's an important distinction theta D is a distribution over topics even in the fantastical world because each Z comes from theta and there can be multiple Z's for different words in the same document but ZD n is a number from 1 to K it's an index from 1 to K is that clear everyone okay yeah exactly so that's not me included yeah uh encoded where the graphical model doesn't tell you that that's right so the graphical model really just tells you what kinds of conditional independencies you can you can assume here and write the particular functional form of W given Z and beta I'm specifying at the board here good point yeah some kind of aggregate functions from the powers of W D into the dough to elect the maximum probability you can do um and the other one can come thanks mo with where there no let me write down the joint distribution here that'll clear up everything the answer is no we're not assuming that there's any maximums or aggregate functions at play so yeah I'll turn the light on I did exactly what Subin told me to do earlier so it's got to work all right good all right so let's just go let's just put everything in bright lights so again fantastical world everything's observed what is the joint distribution of all the hidden and observed variables according to this model first we have each topic coming from some distribution that's appropriate over topics that's going to be the deer slay okay and so those are all independent of anything else that's because these betas only depend on ADA okay that's the first part of this joint distribution now we have our documents we have d documents and what's the first random variable we generate for each document these topics proportions which depend on alpha again that's a distribution the D relay distribution that's appropriate for distributions over distributions okay that's easier to understand if it's written down then within each document we have the words of the document okay um this the prints may not be necessary but they make it all clear so for each document there's n words in the document just for simplicity we assume that all documents have exactly 505 two words and we first draw the topic assignment from theta DS sorry right that's akin to rolling the 100 sided die okay finally we draw the word conditioned on z BN and beta 1 BK okay that's the joint distribution of the observed and hidden random variables and just so I said that this comes from a dear Ashley and this comes from a deal a which we're going to talk about in a second what is the probability of z BN given theta d so if theta D is here we have 100 topics and here we have our probabilities of various of those topics this is 13 26 39 and 42 total coincidence and and let's say z DN is 42 what's P of Z DN given theta D Oh point three three right in general what is P of Z DN given theta D what's that exactly well theta it's theta D comma ZD n it's this z DN indexes theta D okay same trick as before and I should say that in the in the in the nitpicky mathematics of this it's not exactly how you represent these random variables but don't worry about it for now we'll get there maybe we won't but you'll get there so that so p ZD n given theta D equals that and P of WD n given Z DN and beta 1 through K we discussed was a slightly more complicated thing beta z DM comma WD n ok so with this and this and these two facts and this Joint Distribution here we fully specified the model subject to telling you about the D relay distribution clear good ok so other questions okay good possibly have D motivated you to ask questions now but please know I like those questions it's important okay right it's been a while since I've spoken summer good if the dearest eye distribution that's convenient so this is the this is this process described as a graphical model and fully specified and now I just want to tell you about the dear Ashley distribution so the D relay distribution is an exponential family distribution over the simplex have you seen the exponential family yet yes okay so it's an exponential family distribution page in all that information over the simplex which are positive vectors that sum to one okay and the dearest lay distribution so theta here is a point on the it's really called the K minus one simplex it's the its distributions over K elements and the D relay is parametrized by a K vector alpha alpha is just K positive values that parametrize the D relay and the density function of the deer Schley is given here so it's a product of each component to the power of alpha i minus 1 times okay times this which is the normalizing constant and this gamma function is a is know is one of the special functions and it's essentially like a real-valued extension of the back of the factorial function you can think of it as the factorial function this constant here this is just a function of alpha so it's a constant with respect to the density with respect to the random variable theta this constant here just makes sure that this whole thing integrates to 1 which as you know densities must do so this is the D relay distribution and like I said it's a distribution over the simplex over positive vectors that sum to 1 which are basically parameters to discrete distributions to multinomial distributions ok and the D relay is special because it's conjugate to the multinomial have you learned about conjugacy I think I know you did because I came I was here for that the D relay is conjugate to the multinomial what that means is that given multinomial and multinomial observation or multinomial observations the posterior distribution of theta given those observations is still a deer Ashley I'm going to think it's worth giving you the details of that in a second the parameter alpha controls the mean shape and the sparsity of theta by sparsity of theta I mean how many atoms in this distribution over atoms tend to have positive probability or or they all will have some positive probability but tend to have high probability and yeah and so we're using the deer slate twice in this model the topic proportions are K dimensional D relay and the topics are v dimensional deer Schley ok you want to hear more details about the deer slay yes good Zubin does so I feel like he speaks forever for you I think you wanted to it's really fascinating okay so the dearest lay oh I I I used to have like a really lame figure showing different examples of dear slays and then I looked on Wikipedia and found a much cooler figure and took it so and anyway so I'll explain this figure okay so this figure shows a couple let's forget the K minus one nonsense for a second let's call them three-dimensional dear Ashley okay really it's two-dimensional dear slays I'm calling them three-dimensional dear slice so um here's what these figures look like so remember the dearest way represents a distribution over some number of elements and let's say theta comes from a deer's way with parameters 1 1 1 okay that means that as I draw random variables from theta as I draw from this directly I'm going to get distributions over 3 elements okay so one such distribution for example so here's element a B and C one such distribution over three elements is the distribution that places all of its mass on a ok on this triangle that's here okay that places all of its mass on a another distribution places all of its mass on B that's here placing all of its mass on B ok I won't insult your intelligence and draw it again the third one places all of its mass on C now the way this triangle looks works is that every point say between a and B here represents C having probability 0 and a having some positive probability and B having one minus that probability okay so in other words if I clamp the probability of C at zero and I put some probability on a and some probability on B and let's assume that sums to 1 then that point might be let's see B has a little more than a that point might be right here on the simplex okay analogous the points between a and C and now I guess the points between B and C clear so we can clamp one of these at 0 contemplate all the possible essentially binary distributions between the other two points and that represents the the continuum along the edges of this triangle okay and the points inside the triangle are the points where a has some probability B has some probability and C has some probability so that might be that point okay every point on the triangle is some points in the distribution space over distributions of three items okay the dearest lay places a distribution over that space and the D relay 1 1 1 is very special if all the alphas are equal to 1 then D relay 1 1 1 is the uniform distribution on this space every there's a nuance here which is that these actual edges the points on the line are not possible you can't have something have zero probability under a deer's lay that has that has zero probability but but you can get arbitrarily close so 1 1 1 on the interior of this triangle puts unit puts the same probability anywhere it's called the uniform distribution all right and in general so we're on this triangle is the space where a B and C have equal probability yeah ah let me okay thank you so know one one one so remember the bearish lay is parametrized by alpha and alpha is a vector of length K of positive values they can be anything so in the 3d relate alpha has three components and one one one refers to alpha 1 equal alpha 2 equal alpha 3 equal 1 ok and so those are the parameters to this distribution and what I'm saying is that when you set those parameters exactly to 1 like that that leads to a distribution where any point on the simplex has the same probability okay good but where is the point where alpha a B and C all have probability 1/3 where is that point it's in the middle right it's right here ok and so if you have a dear sleigh where alpha 1 equals alpha 2 equals alpha 3 equals say 5 what that does is puts a bump in the middle and spreads the contours around that bump like this somehow okay in general some properties of the dear sleigh that are worth knowing the expectation of theta given alpha equals so theta hi the expectation of the eighth component so this we have a random we have a distribution over this space we can contemplate the expectation of any of these components just a multivariate distribution an expectation of theta I given alpha equals alpha I over the sum of the output okay so with 5-5-5 the expectation of each of the components is going to be 1/3 because 5 over 15 is 1/3 okay and that's always true the expectation of theta I given alpha is this and so if you have a an asymmetric theorist lay distribution if you have like alpha 1 equals 10 in alpha 2 equal as well I don't say so here we have our simplex and here is a distribution centered somewhere not in the middle right here's a distribution center not in the middle these are all distribution center not in the middle okay so that that determines the location of this hump yeah yes that's the next thing I'm going to know yes and no are the answers to your questions the first question was the greater alpha the less peaky yes the next question was you can't have alpha less than one no but I'm going to I'm going to explain those in a bit okay but first just to look for the location of the hump the location of the hump is is determined by this expectation okay now let's get to the penis of the hump so and let's assume for now that alpha is greater than 1 the alpha less in one case we'll deal with separately okay I'll go over here so Receptus can you see that can you all see the board over there that way I only have to work with one set of lights okay good all right so the so this is one important piece of the Duras light parameterization the other important piece is the sum of the alphas okay so the sum of the alphas determines the peak eNOS of the dear Ashley when the sum of the alphas is is is small then the D relay is going to be very spread out and the greater the sum of the alphas the more peaky the dearest lay becomes at this point at its expectation okay and sometimes you see this called sometimes called s and this is sometimes called M okay so this is like the mean and this is the scaling okay and just to to so an alternative parameterization of the dear Ashley is as a point on the simplex the mean and a scaling parameter S which is which is which determines how peak it is around the mean and I'm saying this to foreshadow the various lectures on nonparametric Bayesian methods that are coming up next week where you parameterize the infinite dimensional D relay in precisely the same way okay but just put that away in your brains somewhere safe okay now I need to go to my safe brain space think about what to say next pickiness alpha less than 1 and then posterior ok this I think will be useful later this these boards are heavier than at my university I can see why you guys are so buff here it's serious it's very this is very solid stuff this not all mathematics is for these boards okay good it's kind of invigorating alpha less than one what happens okay equivalently that's less than one that is when alpha is less than one is you get sparsity okay so on the three simplex shoes our our friendly triangle so you know I've been drawing these dear slays like hum somewhere in the middle of the simplex but on the on when alpha is less than one you end up with a different shape you end up with a shape that that places increased probability at the corners of the simplex okay so it's contour is way let's see let me I think as contours this looks like this for example okay so it's like a shape like like that god that I draw that wrong what's that why is it going to s less than 1 is less than K thank you yeah it's not Tesla's okay thanks okay the board gets heavier as you make more mistakes actually so it was there another question or if that's it did you have a question like that they are to the outside okay however you're supposed to draw this let's say in two dimensions it's really easy so in two dimensions you have between zero and one right you just have the probability of one thing and then probably one minus the other thing so when in the dear sides that I've been drawing it looks like this where you've got some mean and it's spread around the mean and when alphas are less than one you get this kind of shape however that looks in three dimensions I'm not sure okay exactly it's like this yes yes fine very good okay now I can't even move it all right erase this horrible picture except pretty and kind of a spirograph sense and yes yes just take those contours s picture is the same I don't know it still doesn't seem right to me but it doesn't matter let's move on okay so this is what happens when alphas last one but really when alphas lesson one don't waste your time with two and three-dimensional D relays think about 50 and 100 dimensional dear slaves okay so now let's draw that space so the way you do that is of course you can't but what we can do is I can so let's say we have a 50 dimensional D relay so there's 50 components here okay and I encourage you in you know in our to sample a bunch of dearest leis and plot them so that you can see what happens with different parameterizations of the dear Ashley and in saying that I'm encouraging you to do two things one is to do that and two is to use our so I can really go on and on about our now so let's say alpha was greater than one so so let's call let's say alpha look like this okay and let's say there's 50 of them or whatever okay so let's say that's alpha alright then when we draw from this dearest leg we're going to get shapes that sort of look like this but that vary from this in some way or another subject to how big alpha is right so again if you do this in R you can you can easily plot you know 100 points on the simplex drawn from this distribution and see how it kind of looks like this but sometimes this one will be a little bigger sometimes this one will be a little smaller sometimes this one will be a little bigger and so on is that clear I'm not going to do it because they could it be too time-consuming okay when alpha is much smaller than 1 however when each alpha is smaller than 1 or some of the alphas are smaller than 1 let's let's be simple and let's talk about what's called the exchange of Avira swing exchangeable D relay is bearish lai alpha alpha alpha and so on okay there's just one parameter one scalar and we assume that M is the is is right in the center of the simplex okay so when alpha is smaller than 1 in the exchangeable D really what you get are sparse distributions so you get things like this okay we're only some of the atoms have positive probability and many of them have probability very very close to zero again file this away in your brains because this is going to have an important connection in nonparametric Bayesian methods now which especially in the exchangeable D really which components have positive masks those are up to the those are totally at random however as alpha gets smaller and smaller and smaller fewer components will have positive mass okay so this is a sparse exchangeable D relay that ends your question about alpha less than 1 okay finally any other questions yeah you have all people's one that's right five equals some key group precisely but when alpha is less than one as you get further and further below towards closer to 0 you get sparser and sparser and sparser draws from your dear Ashley so when theta comes from a deer's way and Z comes from theta okay or let's say Zn comes from so that's that's like a little piece of our LD a model right theta comes from dear lazy n comes from a multinomial with parameter theta this is also called the discrete distribution technically it's not a multinomial mess you have another number there which is how many draws but don't worry about it so Z again comes from multinomial theta theta comes from 'dearest like you can ask the question what is p of theta given Z one gram in other words if I observe and topic assignments what is the conditional distribution of the topic proportions given those topic assignments okay and what conjugacy means is that if we let n z1 to big n be the counts of each atom in other words if I saw topic thirteen six times then n sub thirteen of Z 1 through N equals 6 ok it's just a little counting function clear ok then theta given Z 1 through n is a Girish leg with parameters alpha plus and 1/2 and okay we just add the number of times we saw topic 13 to the Alpha parameter for topic 13 and that's the new dear Ashley parameter so notice that you will rarely have sparse well known don't notice anything yeah that's the posterior dear Ashley okay and so notice that your instinct earlier that as the as the sum of the alphas gets larger you get a peak your and peak your distribution that's mirrored here as we see more and more observations our posterior get speaker and peak here and P keyer and that's very intuitive as I as I roll the die more and more and more my idea of what the distribution of the faces are is going to become more I'm going to come more and more confident about it and so it's the it's the whole idea of the prior speaking so loudly and the data speaking loud as loud proportional to how much data you saw as you see more data you become more and more confident in the estimate that that data gives you okay we've totally diverged from topic modeling so any questions about this and then we can get back to it for 15 minutes okay yeah what would you say what you believe that the other the other ones are si know close to zero they're going to be like 0.0001 you can see that in our yeah I will turn the lights off by the way when you sample from a deer's lay in order to do it well I'll tell you how to do it later well let's keep going now okay okay very good we're back so I'm going to turn the lights on in one second so here we go again we have the LD a model LD a is a mixed it's called a mixed membership model in statistics and it really builds on the work of the seminal LSA work latent semantic analysis of dear Westar at all and probabilistic latent semantic analysis from Thomas Hoffman and actually let's not go into details about this the reason it's called a mixed membership model is that you can think of the model where each document comes from a single cluster as like if that's a mixture model where we have each document is associated with a single Z and here since documents are associated with theta a distribution over clusters then each document can be associated with multiple components that's what that's what the mixed membership idea is so when you're reading the Journal of Bayesian analysis or the annals of applied statistics and you hear about mixed membership models of rank data in this data and that data you can think about Lda as an instance of a mixed membership model okay that's in statistics that's what this is called and for document collections and other grouped data I should have embolden that group data this is rather than thinking of a document as a single data point it's really a group of data points the data are the words and the document represents the group of words for group data often the mixed membership assumption is more appropriate than a simple finite mixture which is the natural alternative okay and I should mention that in statistics this same model was invented for population genetics analysis and had a lot of impact there by Stevens and Pritchard so there they don't care about documents what they care about is real science and they are modeling people as being mixtures of their various ancestry so you know that's really my my case is a bad case because everybody in the same small town and Hungary but my wife like her family her mom's from Denmark and her dad's from he's he's a pretty he's a character he is from Canada basically let's say and and and so you know her jeans are kind of a mixture of the very of the different populations that she comes from it's a it's a lovely combination I should say that to the video and and and so this model is a model for for looking at how people's genes mix and then what what are the results of it you can think of the populations as the topics and the people as mixing over the topics okay so let's get on before I make some kind of marriage error alright good so again the from a collection of documents this is all a nice story but we don't get to observe any of this stuff we don't get to observe the topics or the topic proportions or the topic assignments and that's really the central algorithmic goal of working with the model like this you want to infer all of this nice structure so we can use it so the idea is to infer the per word topic assignment zdn the per document topic proportions theta D and the per corpus topic distributions beta K okay and then to use posterior expectations basically the expectation of all those things given the words here's the hidden variables here are the words use the posterior expectations of these things to perform whatever task it is we care about such as information retrieval document similarity classification whatever it is you're doing okay see if this works good oops okay so there are a lot of approximate okay we will in the second part of this talk we're going to see how we can't actually compute the exact posterior like we can here in this nice conjugate model and so a lot of approximate posterior inference algorithms for this model have been developed including main field variational methods this is what we're going to describe later on expectation propagation which I know you learned about a bit yeah today collapsed give sampling which I'll talk about later on and collapse variational inference which is very exciting but I will only allude to it later on and there's also been some work on how to compare these different types of inference algorithms and and and how well they do there's a great paper from ICML this year comparing them and also some theoretical work about collapse variational inference versus mean-field variational inference that I worked on with a student we're going to get into details of approximate posterior inference later on I guess on Thursday but for now I want to show you a little bit of let's assume we have an approximate posterior algorithm that we like and let's look at some real data and what this model does with real data ok any questions about this so this is the model the only things that are observed are the words and so we want to fill in the rest of this with approximate posterior inference okay okay so for these first pieces I want to look at the ocr'd collection of science magazine from 1990 to 2000 so just to give you an idea of what this data is the basically have you all seen jstor org okay good so JSTOR is a an online archive of scientific articles and what they do is an amazing project is they take the original bound articles and they scan them and then they run them through optical character recognition software and then they index the original scans via the output of the OCR algorithm so the it's cool because so you know it'll screw stuff up like this is two columns and maybe they're OCR software can't handle two columns and maybe it doesn't get all the words right but it's pretty accurate and maybe the punctuation is totally bogus but that's okay but for doing search when you want to search for a word like you know phenotype then as long as it got phenotype a bunch of times in the article you're going to get the right scan back and since you never interact specifically with the noisy OCR it's a very effective for searching online for searching articles online that are archived and they have science all the way back to 1880 will in fact see that later on okay but of course JSTOR has problem which is that they have millions millions of articles but they're only organized by journal and by date and what they want is a system where people can can go through and browse and examine these articles in a topic oriented way but since they've scanned so many articles it's it's out of the question to annotate these scans with keywords moreover that's an they want this to be automatic they don't want to have to read every article and assign a keywords the whole point is that you don't have to do that you can just put them through a machine and then let scholars use them effectively so they're interested in looking at the kind of topic decomposition of their archives so here we're taking their collection of science magazine for ten years we're looking at 17,000 documents this is 11 million words and we used a vocabulary of 20,000 terms basically as a practical point you typically remove stop words words like the of but or and and very rare words which don't affect the inference much or somebody recently told me might affect the inference badly we can talk about that later but so this is the the documents were analyzed and we fit a 100 topic LD a model using variational inference but could have been anything okay so here's that same article that I showed you I should say that this article was actually in a test set so this article we didn't we didn't fit with this is the article how many genes as an organism need to survive in an evolutionary sense and using posterior inference so here are the WD ends for this article right I didn't draw the craft model but here are the WD ends and with posterior inference this is the expected value of theta given those w's okay so one pound white to this but it's just the expectation of theta D given W 1 through n and the topics that we estimated okay that's what this is an approximation of so this is like the real version of the cartoon histogram that I drew for you before it's saying look I've got these hundred topics we haven't looked at them yet and given this new article here is here are the topic proportions that are associated with that article clear okay and you can see that it's not like it's using every topic for this particular article it's saying this topic has high probability this topic has high probability this topic has high probability and so on and even though it has a hundred topics to choose from it's not as though it's using all of them to describe this article so we're getting some description of this article in terms of these topics okay what's that we so this was all done through posterior inference so let me just whoops going the wrong way don't look don't look this is secret okay so remember these things are all latent the topics are latent the topic assignments are late and the topic proportions are latent we never observed them so with approximate posterior inference we inferred them just from the collection of articles and we're going to look at them in a second unless you look before at the secret stuff okay so here are the topic proportions and now let's finally look at these topics let's look at the top words from the top topics in this article and these are the topics I showed you before you can see these so what I've done here instead of showing you all 20,000 words and the probabilities and squinting and looking at the numbers I've ordered the words by probability so the most probable words in this topic which is the most probable topic in that document are words like human genome DNA genetic gene sequence gene molecular and so on another top topic are words like evolution evolutionary species organisms life origin biology and so on and here is disease host bacterial diseases resistance and the computer modeling words okay and I want to emphasize that we didn't have to set any probabilities here all we did was was arrange painstakingly with JSTOR and their lawyers to get their data then we ran our algorithm on that data and this topical decomposition of this article comes out of running that algorithm there's no probabilities need to be set anywhere this is what's called unsupervised learning I should have mentioned that 16 times by now and I didn't okay you do still need to set the number of topics which we can talk about how to do that in the later part of the series yeah yeah yeah so again we should we you could stem and and if this were done in some kind of industrial-grade way then we would probably stem to make the best possible interpretable topics for the jstor users yeah other questions okay and so you can see that this is capturing what seemed to be thematically related words which I'm going to want us to contemplate in a little bit okay here's another example this is about chaotic Beatles and oh michael hassle and and this is about mathematical models of insects and beetles and again we can play the same game we do posterior inference we get back topic proportions and we look at the most probable words from the same model for this document and you see words like words about mathematical problems worth about words about models and statistics words about like selection and I don't know what the word is for this population theory and words about ecology and so on going with the Beatles and so again these these topics and these topics are all in the same model and it's the the approximate posterior inference algorithm is deciding to use them in different ways depending on the different observed words okay and so these tools then once you've marked up your entire corpus with topics and with topic proportions and so on you can use these tools to browse the collection so here for example is another article from science here are some top topics from this from this article here I'm showing which words were assigned to different topics this usually makes a little bit less sense because it has to assign every word to a topic even ones that have low probability and but you can see things like statistics evaluating data statistical means comparisons in one topic where it's like acid and proteins and spacing and amino in one topic and worth like general text provides recent that seems like a bogus topic or not a bogus topic but seems like these words it had no better place to put them but you can see that it's decomposed this article into multiple overlapping recurring patterns of words and then you can do things like ask for similar documents according to these topic proportions so rather than use words to find document similarity we can use the topics to find document similarity and here are the titles of the of the documents that are most similar to this document by topic and that can be a form of what's called query expansion for example where if I write an article about my cat and I use the word cat and you write an article all about your cat and you use the word feline then a traditional document similarity metric won't say that our articles are similar even though our cats might be extremely similar and whereas a topic model knowing that cat and feline are somehow in the same group will say that our articles are similar and that indeed our cats are have very similar personalities so this is the way that you can use these these types of models and in fact if you look at Rexha org I don't know if you've seen this this is a project that came out of UMass Amherst from Andrew McCallum's group and they they're they use topic models to help you browse a large collection of computer science academic articles so it's worth worth checking that out questions yeah is it there another word that they are like little meaning good question so I said that we remove stop words where it's like a but and or the but really well not really but what that means is that we really remove words that occur in all of the articles so if a word occurs in all of the articles then we remove it in fact if it occurs in more than 90 percent of the articles then we remove it because your intuition is correct that the model is going to want to place those words with high probability in all the topics because they occur so frequently that they don't help decompose the collection that's typically why we do that there are ways of doing that post hoc after you get the topics looking at the instead of ordering the topics by probable most probable words there are other scores you can use to order the words I can talk about that on Thursday if you like that's a good question and leads me to this question for you so actually I won't show the answer thank you yeah yeah is there any addition why in that is it slow why doesn't it happen that rather than putting those words into all of the topics yes other than the model is covered immutable which only has those words at least hoping will appear in all the documents there is an intuition for that but I'm not going to answer it because it relates to my next question which is for you you plural what why on earth does this work that's the question so you know I told you don't don't look at that that's secret I you know I showed you this generative model and that's even let's look at this picture it's better look at this picture so here's the generative process and you know we seem to just make it up and consistently we get back these distributions over words that look like thematically related terms and when you're doing applied Bayesian modeling it's important to think both about the prior which is what we're specifying here essentially and to think about the posterior so why could we expect to find these kinds of combinations of words as highly probable words under the posterior and if so why so it's time to go unfortunately but I want you to contemplate that and then on Thursday we'll begin by hearing your answers to those to that question I have an answer but who knows if it's right and I'm interested in you thinking about what the posterior distribution means here and why it might do something like this and what its exact what actually this is capturing somehow in plain English I mean I want the description to be in plain English you don't have to analyze English documents what is this capturing I want to know that answer in plain English yeah okay thanks you

Info

Channel: VideoLecturesChannel

Views: 55,262

Rating: 4.9491096 out of 5

Keywords: Machine Learning, Computer Science

Id: DDq3OVp9dNA

Channel Id: undefined

Length: 82min 23sec (4943 seconds)

Published: Thu Dec 13 2012