Matti Lyra - Evaluating Topic Models

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

okay hello everyone and hope you are enjoying the conference so far if you are not you can blame me yes so my name is Mattie Lehrer and the talk was originally titled evaluating unsupervised models but as I was making the slides and I sort of figured that I'll only have time to talk about topic models so now you only hear about topic models there's my Twitter account if you want to tweet at me there is a there is a companion notebook that goes with the talk and it has quite a lot more detail on the things I'm going to say you can find it from my github page I'll try to get it posted on the berliner pilot org blog as well so this talk is really an overview of what's available for evaluating topic models there's quite a lot of links to relevant literature as well as software which is why the companion notebook exists you don't have to try to write everything down now you can just go to the notebook and you'll find all the references there I'm going to cover quite a lot of material and the aim here really is to give you an intuition for why the metrics exist and what is it that they're trying to measure so I'm not going to go into detail in actually almost anything but just try to give you some kind of an understanding of why we have these metrics and what are they trying to do now focus on tools that are available that are open-source and also research papers that are open access so this is not an exhaustive list of everything just a summary it's three parts I'll start off with just eyeballing the models so evaluating for yourself whether you got something reasonable or not then some intrinsic evaluation metrics and then some metrics that are based on human judgments essentially trying to capture into a metric what the first part with the eye balling does there's an obvious omission here which is extrinsic evaluation metrics if you're doing say document classification with the topic model well then the document classification task itself is your evaluation metric you'll need to know whether or not the topic model has captured some semantic information you just optimize whatever classification metric you want to use and as long as the topic model performs well on the task well that's all you care about right so there's quite a lot of different kinds of topic models and I expect you to have some kind of an understanding of what topic models are how they work I'm not going to cover any of that in anyway the idea is that they work on a corpus of documents so the context for doing all of the inference is a single document however it's somewhat poorly defined what a document means it can be anything from a sentence like a tweet to a paragraph in a book or a chapter and so on once you get a model you run new documents through that and you get a decomposition of the kinds of of the topics that are contained in that new document and yet as I said there's there's quite a lot of different models that you can use this talk the way you evaluate the model is it doesn't matter which way you use to infer the model the kind of information that you would wish to get out is pretty much the same regardless of the algorithm you use so this talk is it doesn't really matter whether you're using Lda or pls AI VLSI or it's very glaze DB or whatever so this is usually what you would want to have it's you have a bunch of document so each dot here is supposed to be a single document and there are all sort of neatly clustered entities into these sort of discrete groups of documents and you can sort of point to a cluster and go right that's machine learning and that sport and that's this and that's that you can also see that the so it's captured some sort of the representation here you know tennis and sport are close together so they're kind of in the same part of the semantic space whatever that means this is usually however what you get so this is a topic model that I trained on the fake news data set available on Kegel link is in the notebook and you can see that we've captured some structure here so topic 9 for instance is clearly related to the Dakota access pipeline protests there's also topic 11 Flint water crisis there's something related to Russia and weapons in NATO and so on so there's some structure there but evaluating this evaluating whether this has actually captured all the things that you want it to is is cumbersome at best how do you know if did the model meets something that was in the corpus and you're just not seeing it this also takes a lot of time okay so okay I said part of that so you want to evaluate the model to find out whether the whether whether the model has really like captured the internal semantic structure that you have in the corpus that's kind of the end goal usually what you have here you want to know if the topics are understandable maybe we can quantify that somehow whether they are coherent do they like overlap with each other a lot are they missing topics and so on and then ultimately the evaluation that you do is the utility that the topic model has for you so if if you are using the topic model for instance just to understand the kind of semantic structure of a corpus well then then the ultimate evaluation metric is how much did the model further your understanding of the semantic structure of the corpus that is of course difficult to put a fixed number on but it's good to keep in mind that that that subjective measure is the thing that we are actually wanting to make measure okay so then looking at the sort of lists of top words from a model is a little bit cumbersome so what are some tools that can help with with that kind of exploration so here's a bunch of research from the Stanford based group the first paper focuses on on on kind of re-ranking the top words from the model so in the previous table you saw a sample of six top words from from each topic and on the left here you see words and here are topics just labeled with a topic number and the terms are ranked by their sort of overall corpus frequency and you can see that there's really no like you can't really pick out any immediate structure here then on the right you see the methods that they develop in the paper there's a few different metrics that they actually developed this particular one is uses kind of conditional probabilities on the kind of on the sort of pairwise conditional probabilities on how likely the words are to occur and close together within a window and you can see that the you can hopefully see that the method that they've developed it preserves word order so you have things like aspect ratio and social networks and so on maintain their order within the ranking but you also see that the ranking kind of reveals these clusters that are semantically coherent so this blue cluster is about sort of social networks and network analysis and so on so that gives you a much better understanding much faster of what what the model has captured then the second paper moves on to the document level and they're basically doing the same thing but for documents and now so here you see labeled a labelled reference corpus so these are the gathered information from domain experts they asked them to define through keywords what how is the their domain kind of what kind of keywords capture their domain and then they also asked them to like label give them some example documents from research papers that like here's some representative research from this field and so on so that allowed them to build this reference corpus and then they build the topic model and then you can see that there's some topics that map on to several kind of reference concepts there's also some reference concepts that map onto several topics so it's one one concept is split into several topics or one topic is trying to model two different reference concepts and so on and there's also like this missing reference concepts there's information that that exists in the corpus but the topic model has not captured okay so that is that then I move on to LD IVs which was originally an r package and there is no Python port of that this is another sort of interactive tool so this term ID stuff this is actually available on github for you to play around with the pile the Avis is another interactive tool so you get an HTML output that you can then open in your browser and the way this works is you have your semantic document romantic topic space on the left and your sort of word term space on the right and this so the proximity of these bubbles here so each circle is a single topic and the proximity is supposed to reflect the semantic proximity of those topics or the first picture that you saw of the ideal case this is trying to capture that so this is one sort of discrete set of kind of concepts and this is another set of different concepts and so the way you use this is you can highlight a word and then when you highlight a word you can see that there's a bunch of topics where this word is very sort of prevalent where it's used a lot or it plays it it's important in these particular topics and you can see that when you highlight Clinton or indeed when you highlight Trump the same sort of area over all of the topic space gets highlighted so this is right this is the same fake news data set from Kegel we're clearly talking about US politics there's another way to use it you can also flip this around and you can select or highlight just a topic and once you do that the list of terms changes here so if I go back to the previous slide you have one set of one set of words when you select a topic you get a different set of terms which is these are now specifically related to that topic ranked in terms of how important they are to this one topic okay so this gives you another way to sort of interactively poke the model and relatively quickly sort of figure out whether it captured something interesting what that might be and so on there's a few problems however first of all the so the river I forgot to say so these the sizes of the circles represents the importance of that topic in the model flash corpus so how much of this topic does the corpus overall contain you get a sort of a general understanding of yeah okay some topics are more important than others the problem is that it's really difficult to determine the relative sizes of the circles this is now this is a different model but it illustrates the point this is from the pie Lda this example notebook now topic 1 here is clearly the most important but can you tell what the relative importance between 10 and 15 is the reason for why this is bad it's nothing to do with peyote haze as such or Lda vs as such it's just really difficult to determine the relative sizes of circles this is can you tell what what the area of these circles is with relation to it to each other are they is this 10 times the area of that or is it two times or five times that's the actual answer so the blue circle is twice the area of the orange one and the blue circle is actually five times the area of the green one so you can't really like it's just difficult for humans to determine that mainly because of this because the area grows is the square of the radius it's really difficult to make those distinctions just visually the other problem is that the semantic proxy or the proximity in this topic space is also like it does capture something but it's also somewhat arbitrary yes okay so your question is on the graph what are the axes I'll explain this just now so this is supposed to capture the semantic similarity of topic so some topics go here and some topics go there now the problem is that there's various different ways to do because this is obviously two-dimensional well the topics aren't two-dimensional so this is projected from the projected down from the sort of inter topic similarity matrix down into two dimensions now there's two different there's three different algorithms for doing that projection on the Left you see principal coordinates Alice's and on here you see metric multi-dimensional scaling again the question is are topics 10 and 15 close together or far apart are they semantically similar or semantically not similar now these the other problem relating to your question is that yet so these two axes are the principal components but that's also somewhat meaningless in this context you don't like these might might as well not be there it doesn't the axes don't give you any information you also lose all sense of scale because the size of the circles and the location of the circles are not related to each other in any way so while the tool did I answer your question these these axes should not be here they're completely meaningless it's a projection from a multi to high dimensional space on the two dimensions it depends on how it depends on how many topics you have so it's a it's a pairwise topic to topic similarity metric similarity matrix project it down onto two dimensions but it what those two dimensions are you don't know sorry you would hope so but it depends on the algorithm that you use it depends on whatever whether you're using this or that it depends on what the algorithm actually gives you as an output so you need to understand like in order to really understand what these axes mean you need to look into the details of how the projection algorithm works and what the output of that is okay so then let's move on to intrinsic evaluation which will hopefully give us a better understanding of what the model has done if you look online you will often be pointed to using perplexity don't use perplexity the problem is that perplexity doesn't like it comes from the sort of from the language modeling community and it doesn't capture the kind of semantic information that we want if you look at what like ignore everything left from here that's just the log-likelihood of the of some held out data so D prime here is data that your model hasn't seen before if you have a model that always predicts cat and you have a held out corpus that only contains the word cat well then your look like is going to be good but that's not a particularly good topic model so doesn't it this measure doesn't tell you anything of how well the semantics of the corpus have been captured so how would that what would a measure look like that would capture that now ignore that we have a model there is no model let's just for I'll think about what kind of how would we describe concepts and how could we capture the semantic coherence of those concepts if we want it to so here's two documents that talk about ice hockey the first one talks about the first one is about demo seven the second one is about Pekka Rinne both finish ice hockey players now if I I now assert to you that these are the words that sort of capture the semantics of the concept of ice hockey this is what it means for something to be related to ice hockey sort of roughly speaking so we have words that are clearly related to ice hockey the names of ice hockey leagues and ice hockey teams names of ice hockey players but then there's all sorts of general stuff like shots which is definitely related to ice hockey but we just see shorts sort of out of context without these documents you might associate that rather with a no shooting so let's take this further and I'll just I'll just remove the words that are the unrelated words and now we have two sets of words and now we can say that okay well this is now what we say is the meaning of ice hockey you can see where this is going right this is the output that you get from a topic model so okay so what would then capture if we were given this description of the category of ice hockey how would we capture the semantic coherence of this description of the category from some data well we can use a measure that does that and this is there's quite a bit of research that has been done on this this is one of the measures and what you see here is essentially the conditional likelihood of the words occurring with each other but notice that the conditional likelihood is capturing information about two terms so there's some context there it is in context-free like the log like the perplexity measure was before so now we can take those words that we defined and then go and look at some reference corpus and measure probabilities and then use this measure and there you go sorry this is iterating over words or word well I'm sorry I'll get to that in just a moment so this research has sort of encapsulated the various different ways of doing of measuring that confirmation that metric that well yeah the metric that I just showed so this is the last paper I read at all this is actually implemented partially in Jensen not all of the measures metrics are available there but most of them are so what this is doing is so first you do segmentation which is the exact sort of putting the words together that I just showed then you have some way of calculating the probabilities get to that later then you compute this confirmation measure which is the thing that I just showed you and then you do aggregation which basically just means that you take you take the mean they do a number of different aggregation measures and but essentially it's take the mean so let's see what this looks like in practice I'm going to take this pipeline as or this topic 9 here this is the same model that you saw before and then you do segmentation this is what you compute the measure over tau yeah okay so you do segmentation there's a lot of different ways of doing segmentation so you take the top words from the model and then you do pairs of words or you do pairs of pairs or you do pairs of triples and so on and so on you decide how you do that you compute your confirmation measure which uses in this case probabilities not all of them use probabilities others are sort of vector similarity things so you compute these probabilities you need to have a way of computing the probabilities so maybe you just use the document level probability of the words occurring this is from a reference corpus that the model hasn't seen ideally and then you get a number this first one so right so you get a model this is 35 topics you do some segmentation you do some way of calculating probability then you compute some confirmation measure for the similarity of the stuff and then you do aggregation which this you should just replace internally as the mean and then you get didn't you get a number out which this measure is the one that they develop in the paper and it's a it's a combination of vector similarity of the words and the NPM I so I'm not exactly sure what the scale of this function is but I think it's between minus 1 and 1 or you can do you can take the same thing would compute a different conformation measure and then you get a different number out which is in a different scale so that's not exactly helpful then you can do that you can feed in a model that has more topics than this model and you get a number that is again different from the second one but now look at the scale here how'd like I place this model three times as bad as that model well I don't know because the scale is just undefined okay and you can write you understand the problem here so this is I think it's a step in the right direction but there's a few issues here so first of all you're these pairs of words that you're computing from your top top words in the model you're taking the mean over those there's going to be a lot of pairs so you're computing the mean over a lot of things you have no understanding of what the variances this isn't included in the pipeline so you have no way of knowing whether this value that you get out is that actually representative of the overall sort of similarities that you've just inferred the different values are not really comparable I'm pretty sure you could do some normalization but that would be extra research you need to be really really careful what your reference corpus is so you're measuring the probabilities that are fed into this confirmation measure from a reference corpus now if you're just doing Wikipedia you're probably fine you have enough data on Wikipedia to do an 80/20 split and use the 20 as your reference corpus for instance but if you're doing some domain specific thing like I don't know topic modeling on legal documents well they tend not to be that many legal documents openly available and the data sets that you might have may not be big enough for you to split them the problem is that legal text doesn't look anything like the text that you get from Wikipedia so if you compute the if you use as reference corpus Wikipedia you're going to get really weird sort of probabilities or similarities out of the reference corpus not because your model is wrong but just because the two corpora that you're using don't actually match in terms of the words or in terms of the statistics that you see there and then the other problem is that you so it could be the case theoretically that this 100 topic model which is clearly worse than the 35 topic model actually has so this hundred topic model has all the 35 topics that that this model has captured it then has captured say an extra ten topics that this thing didn't capture but then all the rest are junk right so you have 45 topics that are good and then you have 55 topics that are junk well those junk topics can cause this to be worse but does it really mean like if you look at the model you can tell what the junk topics are you can just see them right away and go okay those are relevant I'd like to just ignore them but the 45 that you got are actually better than the 35 that you got here but that's not reflected in this measure so finally human judgment based methods again research this is sort of coming off the fact that at some point people notice that oh yeah perplexity is a bad thing to do so let's try to figure what's what's nicer so in the reading tea leaves paper they develop two different tasks to measure how well the predictions of the model correspond to our intuitions of what topics are or should be so they do word intrusion so you have a set of words again these are taken from the stop words from the model from a topic and then you insert a word that is has low probability for a topic and high probability for some other topic so it's intruding in the semantics of this one topic and here most people will say that it's pig so the the the predictions of the humans on what the intruding word is converge to a single entry whereas if you have a topic that is semantically incoherent they'll just be all over the place and you can tell from the gist from the spread of the predictions that you get from the humans that that this particular model is not sort of semantically consistent with itself then there's our task which is topic intrusion so this is how you get a the title from a document and the snippet and then when you run this document to the topic model you get a probability distribution over the topic so you take a few of these top ones and then you take one or two maybe just one of the really bottom ones the really like this document is really not about these topics and then you get something like this so you have the two top ones are sort of supposed to be the like belong to the document like the document talks of is of these topics and then the bottom one is sort of this is one of the low ranked ones and again the same time it's the set up is exactly the same as for word interview the problem here is that it's relatively difficult to determine you usually very rarely get really sort of clean-cut topics so it's very difficult often to the Terman what the intruding what the interviewing topic is okay so in summary topic models are difficult to evaluate the you can do qualitative qualitative evaluation but it's it's labor-intensive and often very very subjective quantitative analysis relies on all kinds of issues that you did have to deal with basically when you're dealing with humans open-source software is available there's also lots of research that's available you know as open access but the real take-home point here is that it you cannot define a universally good model it does not exist because it depends on what you want the model to do it depends on what you're trying to achieve now the final point if I may just run over time a little bit is the question of so the reason why these things are difficult to evaluate is because we don't have a proper definition for what I don't know what coherence means or what do we mean when when we say a concept right it's essentially the question is what are words so I have a short video here is the sound on there is no Sun okay forget it so a news clip it's funny basically this this guy is talking about Trump and he's asking the question what does he mean when he says words does he mean the words or does he mean something like the words and this is indeed a good question because what is the what like how how some how reasonable is it to expect sets of words to consistently capture concepts that we have in our heads of what exists in the world like how many words would you need to I don't know capture ice hockey and would the person next to you pick those same words would they understand the same category so cat dog pet rabbit horse are pets animals but if I change a few of the words then this topic is these cats or is it inconsistent snowball could refer to the cat in Simpsons but then dog is a kind of an outlier so maybe that's an interesting word and at all and then also cat and dog are not like often when you're this problem isn't limited to just topic models this is all unsupervised sort of NLP including word vectors often cat and dog will get a similar vector you'll there because they're sort of similar things right but they can also be the opposites of each other so if you're talking about cat people and dog people then they are semantically precisely the polar opposites of each other but there's currently no like we don't really know how to capture that at the moment thank you very much [Applause] I'm going to answer your questions really slowly we also have one and a half hours for lunch so it's not Lev I can I can repeat the question it's fine yeah maybe I'm balling here we go you mean this one yeah this light is very confusing to me because very confusing for most people I've learned like principal components analysis we have principal cordons it's principle coordinate analysis it's not principal components analysis I see but okay what's the difference in principal coordinates now just a metric Automotion scaling you're gonna going to read the papers yourself okay man yeah also so the third algorithm for this is tease me uh-huh but again they're like they just all work they do dimensionality reduction but how exactly they do that take a take a weekend off and go read the paper like if you look at the file Davis for the okapi Oh Davis for example they're they use PCA which is principal why people they use know by default they use principle coordinate analysis on PC Oh a okay let's let might explain things better cool but just ok then I'll have a side note about coherence so you mentioned that coherence is not useful sort of no way yeah ok no that's not what I'm trying to say I'm saying that it's not without its faults you shouldn't but if you use if you like run the the coherence pipeline in Jensen and you get a big number out it will give you an understanding but it's a more mundane everyday problem which coherence solves and Jensen there isn't hydrants was added to Jensen is because when you train LD a model like convey TPJ over like two days and any state running no no this is I I presented some criticism of Jen's no just like no like just the reason it's been edited there's more mundane technical problem when your trainer of the model for a while like your perplexity yeah like okay that's fine okay no but I agree with the point that you're trying to make that it is useful my motivation for this talk is to say that evaluating these things is really difficult the fact that you get a number out don't it doesn't mean that you have a good model necessary it doesn't mean that you have the model that you want it to have okay Mira PCO a so bcoa versus MDS one thing you kind of hinted at but you didn't make very explicit it's that these are actually very complicated based in their own right and you're taking Lda or HTTP whatever completely unsupervised lots of moving parts some have more free parameters than others but you get something very different each time and then you can't see what it's doing obviously because it's a hundred dimensional if you have a hundred topics so you go all I know I'm just going to reduce that down to two dimensions but the way you do that is completely master all these things are throwing away a different amount of information and they're making different sacrifices and even worse they have a lot of non-determinism they have a lot of free parameters like T's Nino and I don't know if seen it there was a really good blog post recently about tease me that have like 20 things you can turn and then going on I think I put the blog post in the in the node that's a very good one no reading on how exactly you set those knobs you end up with completely different rails so bear in mind that this is a very harsh reduction of information down to 2d just because we can think of it in any more than 2d yeah yes that is a very valid point thank you but it doesn't so that doesn't really like as such reduce the functionality or the utility of this this is you do as you see here but you do actually get these groupings of documents or topics that are semantically related it's just that determining exactly how related is again a little difficult this is the exact same model with the dimensionality reduction in two different ways so are 10 and 15 close semantically or I mean in one picture yes in the other they're on the opposite side of the coordinate system what are you supposed to make of that information other questions yes in practice should I just choose the model a little closer yeah in practice do all of the above well again but it depends on what you want like yeah use the visualization tools to gain a kind of an understanding whether roughly the sort of thing that you are expecting to find in the model is there look at the metrics look at perplexity if you want look at the metrics and then sort of compare the results on the outputs of those but don't just like look at one thing and go okay that's fine we're we're done it do you think that is the long way right so kind of a good practice I've learnt I don't always do is kind of to define what is the good result before I actually start evaluating right yeah that's yeah so that's one way of doing that's often difficult because often what you the reason why you pick up topic modeling is because you don't know what's in the corpus and you're hoping that the topic model will tell you so it's but yeah may again depending a little bit on your on your use case may be you can define what a good what a good output would be but if you're just trying to understand the semantic composition of the of the corpus then I don't really know how you would even define that that was you know when you spoke about the current evolution measure when it actually has a high measure does it mean that most of the words in a single topic are actually found closer in the reference text or something yeah yeah so the way the again this hold on well we're talking about this right yeah yeah so overall all of the confirmation measures these things that give you the actual metric work in precisely that way so lower number is worse and higher number is better yeah okay but it's again you can't really like telling the difference between the scales of the different measures is there was a question in front here and then Barkov yes thanks for the talk I was wondering first practical amateur question if you've got your model with a lot of topics a lot of dimensions maybe for each topic and you reduce it to two so you can plot it nicely and as the guy over there said you can do that anywhere you want and you can probably end up getting more us any representation you want like we saw two comparisons completely different yeah absolutely the first amateur question once you do that and you put it and you say okay it looks good like findt and then trump next to each other so I think that's probably a good model mm-hmm when what do can you use the two dimensions that you drop down to to do further modeling or do you always go back and use the bigger either I don't know what the two dimensions mean yeah but do you know what the 100 mean that went down to two anyway well so the yeah so the 100 is or so okay so let's say that you have a 100 topic model so you have a 100 by 100 topic similarity matrix so those those dimensions you do know you you know exactly the similarity of you know topic 10 to topic 12 because it's in the matrix that's so they're using the normalized kullbackleibler divergence to measure that similarity so that gives you a like a precise understanding of how it's being measured and what what that number represents but that matrix is then cut down by one of three different algorithms or I don't know a fourth one if you can propose one or have one you know ready what the interpretation of the remain dimensions are depends on the algorithm okay then my follow-up question is more just like an idea than our comment which is if you were to interpret there compressed version and your plot and you say your model is good based on that visualization which we've seen can be completely bogus like so what's to argue against just saying where you want trouble Clinton to be workout an algorithm that would get them there based on your stuff so is there any yeah point in these tea slaving you know you normally don't insert the answer that you want into the method that is trying to give you the answer that's I'm you a will that I will you cover with that yeah yeah you could do that and you would probably get the answer you wanted but how do you now know if the answer is valid okay but also the barkhov hadn't had a question I'm sorry I'm not sure if your yes answer this in the first half of your talk because I was in there furth yeah I want to know if you think are there different kinds of metrics depending on what you're using the LDA model for for example if you a and he's all for clustering or something is a better metric do you think any of these measuring but what do you want to cluster but I'm just saying if depending on what you want to do the traffic models do and you have so the obvious like the obvious answer to that is yes there's if you're using an external task that's your metric if you're if you're using law if you're doing clustering well then you can do oh yes oh one thing that I didn't actually mention at all is we've in relation to clustering so you can also like measure this hold on right this is the ideal output from an Lda from a topic model so you could there's a bunch of sort of an unknown network analysis methods and clustering metrics that are completely unrelated to topic modeling they're just generic clustering metrics that allows you to measure how sort of how separated these clusters are between themselves so you could use something like that if what you want to get out of the model is clusters that are separated that's not always the case thank you very much [Applause]

Info

Channel: PyData

Views: 17,582

Rating: 4.9475408 out of 5

Keywords:

Id: UkmIljRIG_M

Channel Id: undefined

Length: 45min 5sec (2705 seconds)

Published: Wed Jul 26 2017