Bhargav Srinivasa Desikan - Topic Modelling (and more) with NLP framework Gensim

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
so we're going to first start by talking about topic modeling and what topic modeling is and why you might want to do topic modeling so let's all picture for a second they're all working for a newspaper agency and our unfortunate job is now to organize all of both the internal documents and the documents which are part of the newspaper articles so you have sports articles you have other articles you have articles on politics and none of this is labeled so our idea or motivation for example here is to try to get some structure out of completely unlabeled textual data any form of textual data so now one approach to actually go about doing this is to try and see if we can find out what topics are inside our data set so for example if you're looking at newspapers if you are very neutral particles broadly say we have three topics politics sports and the weather gradually boring but this is our job now and we have to do it so one way would be to say okay I have all these news articles and I'm going to guess around three topics and I'm going to fire up my topic modeling algorithm and what I'm going to get are a bunch of words which represent a topic for example if I run this on maybe 2017 newspaper articles I'd get an article unfortunately about American politics and Trump and so on and so forth I'd get one thing about weather and if you live like me and lead it's going to be depressing weather cold and rainy all the time but basically depending on the kind of articles you're dealing with will be getting different kinds of topics and depending on the numbers so you can identify can try and figure out why this might be a nice way to sort of just put all of your documents into different topics it's also interesting to try and see what kind of words or kind of ideas are in your data set so there's a lot of motivation for using topic models and what we're going to see throughout the tutorial what exactly topic models are and why it might be useful so now that we all have an extremely rough idea of what topic models are that's completely enough to actually just dive in now the second part about the tutorials I probably should not have actually named it a topic modeling tutorial because there's not too much of topic modeling there's actually more of text pre-processing now why does more of text processing is when I was first doing this tutorial and it was only a topic modeling tutorial I realized I was getting really really rubbish results when I wasn't pre-processing my text and I figured hey this is actually more important than my actual topic models national topic models are one line of code with Jenson helps me do so what we're going to do in this tutorial is actually figure out how to clean a data set and how to set it up to get it ready for topic modeling but not just for topic modeling for any kind of texture analysis so our focus here is more about good practices in text analysis in general which is going to be super helpful for actual topic modeling you can also of course do all sorts of text analysis with this work and so on and so forth which genzyme also has now what I would want everyone to do if everyone has models downloaded in an ideal world would be to start the Jupiter notebook can everyone see yep so oh yeah this is the Jupiter notebook which I'm going to be using which everyone should be using now there are two things you folks can do you can either go to my repository and find the already run Jupiter notebook which has every single cell of this notebook already run so you can just follow that notebook and look at that or you can watch me while I press ENTER on every single cell which cans also fun I guess so really the only thing you are missing out by not having the language models is not pressing enter on the Jupiter notebook but it could be real the reason I spend so much time on actually insisting on setting up the environment is so you can go home and do what I'm doing right now because the real fun of text modeling is you know going trying it on your own textual data so it's not a huge deal if you do not have the environment set up but it would be worth your while to go home and actually set it up so you can run this notebook later so let's start so for those of you who do have everything set up your first step would be to get your imports done and let's talk about what data set will be using for this particular task it's a terribly boring data set it's the news data set it's well in particular is the lead back on corpus you can actually see you can click on the link and go and check out the research paper which started using the corpus as far as we're concerned we're going to use the data set because it's small and because it's available in Jensen Jensen has this is one of its default data sets and it's super easy to get running again for those who have the Jupiter notebooks run but are using Python 3 please comment out this piece of code yeah for me because I'm using Python 2.7 because I figured it's some people might still need to so I may need to have these functions this should work normally so what we're doing over here is we're loading our data set so text as you can see over here ok let's actually see what text looks like yep so you can see that text is yeah you can see that it's a new line separated for each different article you can see that more clearly after we run the Spacey language pipeline but so far our idea I mean until here is just to load our textual data set so this is the data set which should be playing with and now and this is super important because it's actually the bulk of our problem I mean we're going to be solving every single pre-processing problem with just one a small line of code which is over here but this is what I wanted everyone to download which is the Spacey English language model now I would like to also talk about NLT Kay because it's also a popular choice for text pre-processing in fact most of the text processing tutorials have seen do they're stemming or limit izing or whatever cleaning using NLT Kay I would really recommend not using NLT K you can see in this article which is linked over here in the notebook that he talks about Matt horrible who is the creator of Spacey talks about how terrible NLT K is because of how much it's research-oriented not really fit for industry or production level code so I would really recommend reading this article and really feel motivate to stop using a no TK the article is actually it's pretty mean it's it sort of says why dead code should be buried yeah so he really does not like an Ock I'm not a big fan either I would really recommend using Spacey and a huge part of this tutorial is actually motivating people to check spacey out and to show why it's super awesome for text processing so this one line by the way if I'm going too fast or if you have any questions please put your hand up and I'll stop but yeah so in any machine learning problem if you're going to give rubbish input you're going to get rubbish output so let's really focus on cleaning our data set here over here what we're doing when we do NOP is equal to space it out low DN is recruiting a language pipeline so NOP is now our English language pipeline now what I'm going to do right now is I'm going to start walking you through the process of exp processing so in this particular case let's talk about stop words stop words or words which we think had add no information to our textual analysis so in my experience and this I actually figured out after doing my topic modeling I ran a couple a couple of topic models on the textual data set the news data set which you guys saw what you folks saw before and I kept seeing words like say mister and will different kinds of words which are built on the word sin I figured that hey say is not adding any actual information to my stop words so what I decided to do was add the words stay to my stop word list now how do you decide whether a word is a stop word spacy when you download the English language model has built in a lot of Word which Spacey thinks your language are stop words there's also a German language model so if you are working on German language data and you download the Spacey German language model you'll have Spacey German stop words and if you want to add more words the stop word list you do it like this now how did I figure out that you have to do it like this Stack Overflow I had no clue so basically what we're doing here is my stop words is my list of stop words I want to add to space these list of stop words I'm going to loop over these and for every here lexeme is basically I'm applying my n I'm applying my Spacey pipe line on the word the stopper and I'm making sure that it's a stop word by saying stop is equal to true so that simple if I want to add any more words I just add more words to that list and I have more stop words now about the motivation of stop words again it's because I knew that these words were words I needed to add because I knew how terrible my topic models were once I saw them so really adding stop words and knowing which words or stop words it's really a child error game where you sort of look at your topic models look at words that you think are relevant to that data set which your which are not going to be relevant to the topic models and figure out which words to add so are there any problems until here yeah cool so now in this particular cell we are doing it's going to take a small while and for those who are using Python 3 you're going to have to remove the clean part because if you remember in cell number 2 we commented it out so we would need to doc document is equal to NLP of text so once it's done you actually be pretty much done with our pre-processing the moment we've done this which is why I really really really like Spacey because you don't need to do step by step when you don't even need a moment you do you put your text through the pipeline through Spacey's pipeline we now have a lot of interesting information I'm going to show you what kind of information we have has anyone run this particular cell yeah cool awesome a couple of people ok so now we're going to see what this looks like now I think this is the most important cell now on book so if you let's sort of look at our output you can see hundreds of people have been forced to work at their home so on and so forth that's the beginning of the text so what's happened over here is each word in the document so it's sort of looping through every single word in the document and it's assigning certain properties to these words so for example you'd find out if a particular word in this document is it a stopped word is it a number is it um is it a word is it a noun is it a name place animal thing you find out different properties of each single word by running doc so by passing your text through Spacey's language model you're assigning properties to every single word let's see how this works I'm gonna really emphasize on the magic of Spacey over here which you can see in this cell where we're constructing our corpus now a corpus here at least from Jensen's point of view you can actually see the link which sort of takes you through the page or you can see what a corpus looks like but from our point if your corpus is a list of Lists where each individual list inside the list is your document and it's actually easier if I show you I think so if I look at texts you can see the list of Lists where every single word is separate and broken up again so for example we saw hundreds of people a hundred of people have been forced to okay they're homeless voila what we've done after that I'm going to walk you through every single line in cell number eight to show you exactly what's going on so we have to start here so we have to tell lists texts and articles so again a Jennsen corpus is a list of lists where every single list inside is your document cleaned up and separated so what we're doing out here is for every single word in my document if the word is not a newline which means if it's not the new document if it's not a stopped word remember we don't want stop words in our documents because top words add no interesting information if it's not a punctuation mark and if it's not a number I add the lemma ties a form of this word to my article now what is lemma ties for mean it's basically the root of the word for example saying said would have say as its root word so we don't really care about the kind of where we only care about the context so we only add the lemma ties form of the word so here using space II I'm figuring out if if it's a punctuation mark if it's a stop word if it's a number and if it's not any of these I'm adding just the root to my document can you see how easy Spacey makes our job we can also for example say we want to add only words you could actually make sure you could check what kind of part of speech that particular word is and you can have a topic with only words or only nouns or only numbers so what I would like to emphasize on over here is that makes you understand every single line because depending on the kind of data set you're dealing with you're going to have to be pre processing your text in a different way you might want different things in your final text model topic model and you might not so what we're doing over here is extremely simple we're just checking if it's not a number if it's not a stop word and I'm making sure to add the lemma ties form to my final corpus let's take some time here to try and figure out what's going on are there any questions there's anyone slightly confused here yes it no the part of speech a noun or verb and so on and so forth you could also find out if it's it does NER which is named entity recognition for example your language model is trained to know if it's a place a location a geopolitical a work of art so for example if I would actually talk about this here um okay it's actually in the end of my this thing I basically talk about what else Pacey has to offer apart from text reprocessing so in our particular case space is really useful for the job because once we do a doc is equal to NLP of text we're doing basically all our text is being processed by P spaces pipeline the only information which we are using from this is whether if it's a stop word or if it's a number pretty basic things but you can get really funky with this and use of all the properties of spaces like seem to try and figure out what kind of data set you want to build so well we won't be talking so and also we have lots of different language there's also the French German English and I think they have things like Hungarian in alpha mode and alpha so you can like whatever a lot of language support over there you can actually check spit I would really recommend checking spacey out and seeing all the cool things that can offer in our case we're only going to use it to clean our data set which is again in this particular cell you can see exactly what's going on if it's not a newline and so on and so forth but you can really go crazy with this I'm going to leave this census has nothing to do with topic modeling I won't go into it but this just shows some of the power which space he has that's an unsure question oh and by the way the question yeah yeah yeah it's super cool it's so it's chained using a neural network and before they use more traditional linguistic rules to figure this out now it's all done on your networks of course accuracy it's not going to be perfect every single time but you can keep training it and making it better and like since the shoes modeling thing is the fact that they have language models in many languages French German and I think Spanish is being worked on think someone was doing Norwegian so you can yeah check spacy out it's totally cool and as for creating our corpus this is what we're doing you can act I would recommend seeing text let's look at just your first document so you can see how it's been cleaned up from hundreds of people have been forced to okay their homes we have hundreds of people for silicate home you can see what we're doing out here is only keeping key information from the original text and this is extremely important in text processing because I don't need words like offs and uh - they're not gonna add any information to my topic model so by using space C and on an otk we have super fast text modeling and you can also use spaces other things I've linked to different notebooks and different projects you can use only using Spacey at the end of this digital notebook so totally check spacey out which is why I was insisting on getting a language model downloaded over here but even if it's not you know what kind of problems to experience abound doweling language small you can go home and do it I would really recommend that are there any questions about why a jensen corpus looks the way it does or what if you understood what a jensen corp is supposed to look like and also what space is doing yeah yep so it's a good question the question is why did it all say so all the different kinds of words which stem from say and not just say this is because the stop words does not look at the lemon ties form the stop words look at the original word so what I could have done I could have done W dot lemma is stopped as well but I just decided to add all the different kinds to give more of an example of how stopping words stop water but I hope your questions answers I could have done that too yeah is there anything else we can really spend lots of time over here because after this it becomes rather trivial are just going to be using Jensen's topic models to find your topic models this is really the most important part of it like cell number eight is what you really need to understand and appreciate because in a few lines of code we're really creating a corpus which is really cool and the best part is you can use this for Jensen for scikit-learn for whatever kind of text processing machine learning algorithm you have in mind yeah oh yes it does so you can you can do that yeah again so this is since this is not relevant to generating our topic models I'm not going to get into it but there's actually uh there's like for example they've done something called I think LD a two-cent so they actually use sentiment analysis and lda and the analyze read it and there's a lot of cool projects on the spacey page please you're not paying me by the way I just really like Spacey but yeah you should you can do that you can find out sentence structure trees and a lot of cool things so again just to emphasize we're by running a Spacey's language model on our text file we're also learning about things like nouns and parts of speech but we're not using them the only part which are using is stop words and numbers so it's still extremely useful and also limit izing so it's so extremely useful for us but we're only really touching the surface here with Spacey's power so as anyone is this part clear have you understood what a jensen corpus is and why it looks like that yeah cool awesome yep so like I've just been saying the whole time we're only using this much now we're going to use our first Jensen the first bit of Jensen which is actually something called by Grahams so again by Graham's or something which I figured I should be using after I got really rubbish results in my first topic model the same way figure out with stop words use so what by Graham's is we're going to be in for example when some words tend to appear around each other more commonly like New York makes more sense as New York rather than new and York so we decide to combine them I've sort of given the motivating example over here and if any of you are following my Jupiter notebook which is already run you can actually go down to the first topic model and you can find words like Bin Ladin instead of bin and latinus two separate words so you can understand the motivation of why you would need to use a diagram it's because sometimes the word Bin Ladin makes more sense to me than Bin Ladin in New York makes more sense to me then another important thing out here is our model our textual data set which we're using is actually of the Australian Broadcasting data set so New York is not a word which is going to come up so if the word New York will existed in one particular sentence in this data set it would actually not be found because what by grounds do is for that particular data sets you'll find things like New South Wales and Bin Ladin because these were diagrams which are brought up quite often in that data set but not other ones so you can actually see what goes on by over here though I'm not sure if you can see it immediately but oh yeah can you see New South over here and if you look at text over here it was new and south to the Fords so we're creating diagrams super easy and it's yep why did I use by graham's is because again like I said I looked at my topic model and I found out I was getting more information when I use diagrams you could so it's really something which you sort of figured out you might or might not need but I think it's an important concept to know when you're working with an Okie in general is that clear yeah all right here Oh actually so over here in phrases we shall first ensure love would you happen to know if it does a default if they pick up by grants or does it yep Sorokin the question was whether I'm supposed to repeating questions oops the question was about whether I had to specify it was by grams of trigrams no the default is by grams like let's sort of view so is that part clear can we move on yeah yeah I'm sorry could you be though it's just the most probable by grams so for example I know what else for a couple hundred people has not come across so many times so it wouldn't pick up hundred people but it's seeing New South Wales a bunch of times so it's yeah like for example if I was running it on the Harry Potter books then I might find hi Potter but not Ron Weasley because no one really says Ron Weasley they just see Ron so uh okay umm Salama 13 is also quite important because what we're going to be doing is we're going to be converting our well we're going to be converting what our data set looked like over here which is texts we're going to be converting that into a bag of words format does anyone here so I'm bag of words is a representation where we count the number of occurrences of each word in that particular document so I'm going to show you what exactly it looks like so basically what this means is that again it's just a list of lists but here document number one says that word ID number zero occurs one time in this particular document so what we're doing is we're basically mapping every single word to an ID and then we're making a bag of words out of it is that clear or just do I need to explain this yeah oh cool okay so we're pretty much done with what I think is important part of this which is creating a clean dataset which we can actually work on this is sort of the key takeaway and right now we're going to sort of be talking about what topic models are and how we're going to use this corpus which we created yeah so question I'm sorry is gonna repeat the question the question was about if you what's going to happen if you want try again this one and so forth and the answer is if you run the by exam again it's going to create trigrams and so on and so forth yeah can we use what would you the question is can we use it for semantic modeling but I'm not sure what it is what did we create apart from the corpus we just created a an easy way for us to do our text analysis we created a corpus well it depends on what kind of input your semantic modeling algorithm accepts but if it accepts a bag of words format then of definitelya so the main idea here again is sort of to introduce everyone to what Spacey can do and how to include Spacey in your natural language processing pipeline so now that we're done about Spacey um it's actually moved to topic modeling now I briefly talked about what topic modeling was and why we might need to use topic modeling before we start off so let's see what's going on and one of my favorite parts about topic modeling in general or in this particular case about Jensen is how easy and how you don't need to worry about any mathematics or any of the it's a black box and you can use it really easily for example out here it's absurd what all you just need to tell the number of topics and pass a dictionary and corpus and if you don't have a corpus and dictionary then you're not doing any text modeling so literally the only thing which we needed to tell Jensen here is the number of topics that's it so they'll run the algorithm which we are going to be using is LSI or latent semantic indexing yeah so the question is about by grams why did it not pick up South Wales and only picked up new South so all my intuition is an unlock Letitia but in the data said I think that new and South could have been something else as well or no okay that's probably the right answer yes I'm sorry I've not actually worked on the phrases part of the modules so I'm really not sure that would be much better at answering these questions oh yeah yeah so it'll be faster yeah okay so um yeah so we're doing over here on line number 16 is very running our first topic model on the corpus with ten topics now we're going to look at top five topics over here and yeah and all of you who are running a super inner book yourself might be getting slightly different topics that's completely normal so yeah so we can see exactly what a topic model looks like I previously talked about how each topic model is a collection of words so you can see over here that topic 0 is about the israeli-palestinian conflict and so stop it - but topic 4 which is ID number 3 is little more interesting or less interesting depending on it's about a firefighter's and yep so you can see we only use ten topics over here but this is what a topic model looks like yeah that's also because the minus and plus well I move on to HTTP and Lda which is I think more so over here the numbers can be deceiving because if only the modulus counts it's so this is just to give you an idea what a topic model looks like basically we can interpret each topic as a distribution of words and we can think of each document as a distribution of topics so this means that topic zero has the following words and each document would be made of a distribution of topics so if document number 73 we had 45 percent topic zero and 20 percent topic 1 and maybe the remaining 25% in topic 3 you can understand what each document is made of so a document is no longer just a bunch of words put together it's a bunch of topics this yeah so I actually get asked this question at every single time I do this true and everything so I always google it and actually it does not matter is a I'm going to need left to again confirm on an analysis list on yeah so in this particular I'd ask you to actually just ignore that and but so but in Lda in HTTP I'm yeah I'm sorry so this is just a given exact like normally I put LSI over first because I don't like to talk about what it is and also because I just want to give an example of what a topic model is supposed to look like when we move on to Ald HTTP which is right below this so I can actually explain you're going to figure it out but yes well that's actually because of the number of that's actually showed the show topics method over there I put of 10 because if I do 20 or 30 or even longer it's going to stop making as much sense because then all of them are going to look like the most probable words in each topic I can hold on well to answer a question basically each word in your vocabulary has a number associated with it we only look at the highest values or the most relevant ones you can see there's a lot clearer in the HTTP model where unlike the LSI you can actually interpret the topics for example you can see here in topic 0 that it's 0.007 Israeli which means that topic 0 has a that percentage of chance of having is really being represented by the topic I'm sorry oh yeah so here this would make a lot more sense you can actually see oh yeah so this means topic 0 has that topic distribution of words and topic 1 has additional words and so on and so forth and you can also see there are more words so you you asked why I only have the top 10 words it's because I chose the top 10 words the default number of words I'd like to see you can see the top 20 top 30 but they're not going to make as much sense because every topic is going to start looking like each other after that yeah yes oh yes you would because it's randomized we're not gonna be getting the exact if we fix a random feed then we'll get the exact same topic distribution but because we're all running different random seeds our topics you're gonna look slightly different there yeah so we are yeah of course definitely so I would so I would call it a my Israeli topic but it could also be topic I would rather call the proper way to color to be topic ID number zero and topic ID one and topic ID two and so on and so forth so well there are a bunch of ways you can do it one were to actually look at the topic but I'm also going to be talking about a really handy kit called PI LD a wish which sort of allows you to visualize these topic models so you can interpret the topics in a cooler way but the way I would do it is this actually go through it so it's manual would actually have to go through the topic and figure out what you'd want to call it because for example you can choose to interpret a combination of words different people can interpret a combination words in a different way so it's just what I would call the Israeli topic and the firefighter topic and there's one about Hare Krishna and gain jizz and stuff like sure it's gone over there but yeah so each topic can be interpreted by the person in different way ya know so what happens is you would also notice that for example of Australia is in a bunch of topics yeah what it means is that topic 3 or topic ID number 3 has extremely high probability of having the word Australia in it another topics would not so every single topic has every single word with different probabilities so the reason I would not do top hundred words would be because it's highly likely the last 90 words have no end no relevance to the topic you can you can think of it like that yes so is it clear what exactly a topic model is because this is this is a topic model so please we can take some time here to answer questions so that's a good question so each topic model performs frankly differently and we'll also be briefly talking about how to compare between different topic models at the end of this notebook but yes each topic model is slightly different and each it might not be the top leg model for you so each topic model again depends really on the person and for the task at hand and why you want to use it so we'll be talking about comparing models at the end of this yes yes yes there's also going to be a talk tomorrow complete only about comparing topic models yet but the after thing 3:00 p.m. or something so please that would be a really nice way to make sure you actually understood what's going on here yeah anything else cool so we're gonna now so the HDP is the record rhetoric process this is super awesome if you notice we only put the corpus in the dictionary we could not be lazier we did not it's almost requires no parameters and it's really neat now let's look at Lda so you're going to see that these topic models behave in pretty much the same way in sense of what the output is it's just the conceit considering this like we're treating them like black boxes at the moment we don't really need to know what it's doing but apart from the fact that we need to realize why the result looks like the way it does so you can actually if you're running this on your own machines you can actually look at your LD a topic model your HTP topic model and see how it's different and the quirks and play around with it any questions you'll also be probably getting slightly different topic models compared to the ones I have so now we're going to we had a question about how you can interpret these topics or pile us is one of my favorite ways to do this so for those who manage together everything set up you should be able to see something like this just pretty neat so you can actually see if you can actually see each topic and it's quick afford sometimes only do you spend some time to explore this click on your mouse you have to set up okay so you can also see how much of the topics overlap like over here these words you talk about some words being in multiple topic models for example what I once did a really really bad topic analysis on the Harry Potter data set I had all the books and every single topic was just Harry Hermione Ron went over and over again ten topics so that was this all stacked on top of each other and was absolutely terrible so this is actually a really neat way to figure out if a topic model works or not so after looking at this I went back and added stop words and buy grams and so on and so forth but yeah well that depends on yeah it mean this less of an overlap that's what I would like here but sometimes it makes sense for certain topics to be similar it means that they have the same words yeah this is for topic 0 LG model is right I thought isn't it yeah so we can out we can move on to we can talk about pile daters right after this we're going to start running up so I just want to sort of round up what we've discussed so far oh yeah I'm sorry are you looking at pile Davis again oh that's yeah so we're doing like I was mentioning since we're doing a PC age of break break it up it's access one and two it should be wood yep so my answer lives answers I'm not really sure because pile deal is is uh not something I worked on before but the way I use it is as a tool to find out if my topic model is rubbish or not I don't actually use for example there's a they talk about salience II and relevance over here in the bottom over here and these are actually two currents have not actually come across anywhere else on which I really don't know anything about so I'm not sure the internal workings of pile da with it's a third party to which I just used to visualize my topic model so unfortunately I cannot answer your question sorry so you can actually use you can use different topic models in pile do is uh you can have also there's some tutorial notebooks which I've written which allow you to use like your old topic model and you can sort of break up so basically you need to pass matrices which are documented words and topic word matrices and which allow you to visualize it what Jen simple pile they will this does is when you pass an Lda model it does everything under the hood so if you're interested in visualizing other kinds of topic models then I'll be you can use the link at the end of this tutorial to do that so we only have one part after this which is topical parents so because we're the end of our tutorial I just want to talk about everything we discussed today the main takeaway is really just Spacey and how to pre-process your text because like I said topic modeling is something which you can really easily do topic modeling is just importing Jen sim and running your topic model of choice and then depending on your use case figuring it out what I wanted to use was topic modeling as a motivating example of why you need to properly pre-process your text and why you should probably be using Spacey to do so it's because you can use not you can after you finish using the spacy pipeline to clean your text you can also go on to your space sea for using other things like someone talked about your language your roots of your sentences and how you can visualize them they have tools for visualizing your sentence routes you can now apply you can build neural networks into sentiment analysis using Spacey so there's a lot of alternate uses for Spacey apart from using it for Jensen for topic modeling but in general it's really important to know how to clean your data set depending on what you want to do after it the topic models I got before au Spacely were quite terrible and it's really important to clean your data set before you go on to analyze it so whether using topic models or work to a core whatever you want to do just make sure that you're being really careful and each data set is going to look different the one which we used over here is actually quite simple to clean up for example I work with things like facebook messenger tax texts and whatsapp tax wishing it really chaotic in which case space is really good for cleaning things up and you can also use NLT k if you want to but the real focus here is yes Spacey and I've added links all over the tutorial about other things which you can use using Spacey and Jensen as well so yeah please actually go back and make sure you have your language model set up and actually check these links out and use their own fun data sets and play around with it you'll get really cool results and before I say before I finish up we're going to talk about deciding which topic model is better I had a question about someone mentioning how HTTP look better than LSI now we can actually see if it is so topic currents was sorry so you can see the next few couple of cells over here we're going to run the last four selves and what we're going to be doing is get a coherence value no the whole idea of this is again some people talk about how you decide whether a topic is a good topic or not just by looking at a bunch of words because each person interprets the topic differently topic Arends is a way for us to find out which topic is better from a human readability point of view this was implemented last summer with the umass and different kinds of we have basically have new kinds of topic events models set up right now so you can actually get numbers now you don't have to think anymore and we can get a 0.56 which is greater than 0.5 0 and we know that topic model works better so some important points remember is this number is only for that particular data set you cannot compare top top occurrence values on LD on topic or estranged on different data sets so you would need a topic model B to be trained on the same data set before you actually use any currents on it if you can see what you're done over here is we create our topics for LSI for HTTP and Lda we then create a current model and we passed we got our current values and just that we can plot this someone said HDPE was better than either side oh yeah you're hunters right according to the topic Lawrence measure which is being used HTTP gives you the best value again these are just numbers and you should take it a pinch of salt your topic you might be absolutely convinced that your Lda topic model is better that could be correct top occurrence is actually just one way to compare topic models from a quantitative point of view you're actually going to find out a lot more about comparing topic models tomorrow there's a talk on just about comparing topic models so you're probably going to get more into the mathematical theory of topic models in general tomorrow but for how to actually create a topic model well yeah so I saw a hand up yeah okay so I don't actually know the complete I just know the intuition the intuition I think is human readability which is what they talked about in the research paper I'm not sure about the mathematical details I don't know if love might know the exact so the question was what is the mathematical basis what is the intuition behind it and as love explained I'm going to be quite brief today basically uh they do use a grid search to find out which was a better way to figure out which is the most human readable topic so I would have not read the research paper myself so I'm just yeah you should read the paper really interested in the mathematical basis but the intuition is what's the most human readable topic I don't use topic currents too much it's usually nice to just sort of have a rough idea but normally you sort of look at the topics and use PI LD a ways to get an intuition again I use PI L do is just to visualize what my topics look like I don't really read into the mathematical part behind it which I probably should but oh yeah so now you know how to clean your extremely dirty datasets and how to get it ready for whatever text analysis you want to do I'd like to thank Lev who was my google Summer of Code mentor last year for help for being here and helping me out especially with the Matt bits not so good at that and if you have any questions I'm here around tomorrow and day after as well I love to talk about topic models and text analysis I hope we're all going to use Spacey and install the language models I'm sorry about that in the first half an hour it wasn't as smooth as I would have wanted but I hope everyone got something from this and well thank you for coming [Applause]
Info
Channel: PyData
Views: 27,771
Rating: 4.8290596 out of 5
Keywords: jupyter
Id: ZkAFJwi-G98
Channel Id: undefined
Length: 48min 25sec (2905 seconds)
Published: Wed Jul 26 2017
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.