Topic modeling with R and tidy data principles

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

hi my name is Julia Sully and I'm a data scientist at Stack Overflow and the author of text mining with our at IP approach in this video I'm going to demonstrate how to use our and tidy data principles to analyze text how to take some raw text do some initial exploratory data analysis and then train a structural topic model using the STM package we're gonna be using a collection of Sherlock Holmes short stories as our text in this tutorial I'm going to be working in IBM's cloud environment for data scientists the data science experience so I'll demonstrate how to implement these kinds of tasks working in a browser with my code and packages and everything like that that I need installed on a container based infrastructure on IBM cloud alright let's get started so I've logged into my IBM cloud account here and I'm on the data science experience part of the platform and I have gone up and you can click on tools and then I chose our studio and now here I am on I'm in our studio server which is running here on my Anna in the container on my account and I'm in my familiar environment here and I'm ready to get started with some text mining that's what we're here to do right alright so I am gonna go up here and I am going open a new our markdown file I'm gonna be working in our markdown here today I'm gonna delete most of this and then gladsome lets you know what we're working with Sherlock Holmes story so let's let's give this a title of the game is afoot all right and let's save our file here in my in my project that I made so game is foot alright so I've got my my file here that is ready so the first thing we need to do is we need to get our data downloaded and prepped a little bit so let's get that started so I am going to be using in this tutorial a data set of Sherlock Holmes short stories that are available they're public domain and they're available through Project Gutenberg the Project Gutenberg website so in are you can you can access Project Gutenberg works through the Gutenberg er package this is developed by David Robinson and it has a function called Gutenberg download now you can one that you can pass various things to this but one of the things you can do is the Gutenberg ID and I a head of time went to the Project Gutenberg website and found the ID for the for the the works that we're looking for this store this is the gutenberg ID for the sherlock holmes short stories that we want to do so let's download that and assign it to something here so we are going so what's happening is that we went to one of Project Gutenberg servers and we downloaded this text so now it is it is stored in Sherlock raw here so we've got two columns Gutenberg ID which is that same number from before and then the text which is the text that we're going to be analyzing here today so let's let's do some some prepping of this text for some analysis so this that I looked at this text on the on the web site and I noticed there are 12 short stories in this this collection and I would like to analyze differences between these 12 short stories so what I need to do is I need to make a new column that it's going to annotate the text and and tell me where the stories are the stories all have titles before them that say things like adventure one adventure to adventure three so we are going to use oh you know what I need a light of load a few more packages here so let's load deep liar and tied ER and stringer I'm not loading the tidy verse package altogether because of some issues with the IBM cloud environment and making that a little challenging so I'm gonna load them all separately instead so let's load all these guys here all right so we're good so now so you can connect to data via sparkley are in this but I have that turned off right now so that I can have some different packages so so yeah let's use string detect and let's say we want to detect in that text column we want to detect when it says adventure in all caps and in those lines we want to keep the text and the other lines we want to we want to just have it be n/a so we can see here we're at what we're at what's happening is we're detecting when something is says adventure in all caps so let's count story and we can see that we we now see all the places where it says adventure and there's the twelve the names of the twelve short stories so that is great what we want to do is we want to fill down this column that is going to tell us what story each of these texts is for so we're gonna use fill from the tidy our package to fill those story to fill that column down until it gets to a new value then it will fill down and so forth and you know what we do not need that first a little bit of preliminary information you know when we had the Adventures of Sherlock Holmes right here we don't need that first information because that's not really part of one of the stories we're not interested in that so this will give us some information that we need the last thing that let's do let's make let's make the story right now is a string but let's make it into a factor because when we go to plot these things it's going to plot them in order alphabetical order but we want to plot them in the order of they are in this story so that is we can alike Adventure one adventure to adventure 3 so we can do this by setting the factor levels by the order that we find them in and that will look a little something like this so let's let's make this Sherlock so we have we are taking our raw text and we're doing some annotating so that we can see how things are going so we have our text and now we have a story we have a story column see now that the story is in order from 1 to 10 to 12 and that these are the how many lines each one of the stories has so we've got that going and then the next thing we want to do is we want to take this text data set we have and transform it to a tidy data structure to a tidy text data frame here so let's take Sherlock again and let us first let's annotate this with another another new column let's let's call it line this is going to keep track for us which line every word keeps from every word is coming from so we're going to say this is line 1 this is line 2 and so forth throughout the whole collection of short stories after that let's use the unless you and let's unnecessary row we're gonna have one word per row so let's see how that works so we have remember before we had a text column that had a whole line of text so now we have a word column and we have so okay in line one it said adventure 1 a scandal in Bohemia and then it says to Sherlock Holmes is how adventure Abel a scandal in Bohemia starts and so forth so that's so we've done that we have cut rands formed this to a tidy text data set we're going to be doing topic modeling so we need to remove stop words that's an important thing to do when you go about doing topic modeling I'm gonna do that here with an anti join and I'm going to use the stop words data set that is in the tidy package so let's do let's do this and let's assign this to tidy Sherlock so now we have a tidy data set of the Sherlock Holmes stories that we have here so now we do we no longer have the stop words in there let's count the words and see what are the most common words are here so this is after we have removed a big fairly liberal set of stop words it's going to remove a lot and what we have here the next is time door matter house and a night but notice you know what look at that look at Holmes this is about three times more common than the next word and that's going to have a big effect on our topic modeling so let's let's remove that as well that is not going to be so useful and it's going to have a not-so-good effect on our topic modeling so let's look at it one more time so now the word Holmes is gone as well so so there we go we have done it so we have taken our text we did some some annotating of what it is that we wanted to be able to have to know about these the this text and then we we tighted it so next let us do let's do some exploration of tf-idf because because what we have here is basically twelve documents in this set of short stories so what we can do is use tf-idf to see which words are important in which of these stories this will be a good thing to explore before we look go ahead and do the topic modeling and we can implement it without too much hassle using tidy data principles so let's take our tidy Sherlock and the first thing we need to do is we need to count how many how many times was each word used in each story so let's count with two arguments the story word and then say let's say sort equals true so let's look and see what that so here we got in this first in the adventure of the noble bachelor the words st. Simon and Lord were used this many times and then we see we've got the adventure of the copper beeches and we say miss rucastle and so forth so this is these are all the counts here so now if we want to calculate tf-idf we're gonna use the bind tf-idf function from tidy Tex and the first argument is the data that we're piping in and then after that we're gonna pipe in word and then that's the term the word and the next one is the document the next one is the document column and that in this case is story and then the last one is the counts and that here is also called n so we're gonna call it word story M and let's uh let's run that let's see how that goes so here we can now calculate at EF tf-idf and tf-idf so that these that what this statistic tells us is how important is this word to this document in the compared to the other documents in that collection so let's um let's group by story and let's take the top oh let's take the top ten words in each story so we're taking the top ten highest tf-idf words in each story now we can let's make a little plot so we can visualize this and see what it is that what it looks like so if we want the words to show up in the plot in a reasonable fashion we need to make this into a factor we don't want them to show up in alphabetical order we want to show them we want them to show up in tf-idf order so let's make this a factor in order of tf-idf and now let's plot pass this to GG plot so I need to load the package GG plot to you so let's say GG plot and what are we doing here what do we want to do we want to make a bar plot that has word MT tf--idf on it so let's do this here word on the x-axis tf-idf on the y-axis and then let's make this in different colors so that it's a little easier to parse let we're gonna make this a bar plot so let's say G on call and then we don't need to see the legend because that's not the kind of plot we're making now we don't want all of these stories all plotted on top of each other we want to see each one in a separate little facet or a subplot so let's facet by story and say scales equals free and then the last thing we want to do is that we want to flip this so that we can read the word so that it's so that it is um flipped on its side so let's let's uh look at this plot and see how it turns out let me put this here for you also okay this is great let me is it possible for us to read all the titles okay so for example a scandal in Bohemia is the first story and we see that the most important words in this story compared to the others are we see Irene Adler in the story it's about a photograph it's about the king and these are all the important things from that story if we look at a different one like the adventure of the blue carbuncle this is some this is a mystery that involves geese goose and geese are important bird are important in this story but we're not in any of the other stories and so that's why it has high tf-idf we see lots of proper names and plate people in places in these in these words we we opium is over here and the man with the twisted lip in the adventure of the engineers thumb we see hydraulic so these are the words that are characteristic for these documents in this collection of documents so looking at tf-idf is a great exploratory tool before you get started with topic modeling speaking of which I think it is time for us to get started here so let us implement some topic modeling so we are going to be using the STM package to do our topic modeling we also are going to load the quantity' package which is another great text mining package we're going to use that as the the data structure that we're going to use as the input to the SCM topic modeling so let's take our tidy Sherlock and what we need to do is we need to make a I want EDA document frequency matrix this is kind of like a document term matrix but it's a specific implementation of that kind of idea so first we're going to do our counting like so and then we are going to cast to a DFM so we need to say the document we need to say the term and then we need to say n what it is that we're gonna do here so this is that didn't work did I not let's see if I I may not have loaded this package okay now it works so now that I loaded quanti des so now we have a document feature matrix that has 12 documents which is what we were looking for and this many features which means words in this case so let's define this as part of the Sherlock let's say this is a Sherlock DFM which like I said is kind of like a document term matrix and let's get that done so now we have that saved and now what we can do is we can train our topic model so we're going to use an STM so let's I'll just show you this here so it comes from the STM package and it we can call it like this so we're gonna say will you pass it the data so we're gonna pass it the DFM let us I experimented a little bit but with this in preparing for this video and I'm going to say we're gonna do a six topic topic model so and we're going to say this is the kind of initialization that we're gonna do so let me start this running and then while this this is a lot slightly more computational part of this process so while it runs I'm gonna say what I like I mistyped it sadly it's gonna be topic mood I guess for the rest of this video so I am a big fan of these STM topic models because they they is easy to install it does not have anything like an AR Java dependency and they are quite fast compared to their implementations of topic models in our because it's written in C++ under the hood and the results that I have been getting is I've been experimenting with this have been really great so it is going to go through a number of image like an initialization here it is then going to go through a number of iterations as it goes through and tries to fit the topic model so what a topic model is doing and so topic model is saying okay we are going to say how many how many topics do you think there are in the in in your text oh and then we're going to be basically doing unsupervised machine learning to say which words contribute to which topics and then which topics contribute to which documents so there's freedom in this model for different for different balance of words like words can contribute at different proportions to different topics and then topics can contribute in different proportions to different documents so we're gonna see what results we got here so we see we we went through the topic model fitting procedure and it is now done so we can do a summary of topic model oh man hi I'm so sad that I mistyped that we're gonna we're gonna do that because man I can just I I don't want to refit it again on the video okay so here's a summary this is like a print method that we have in here but what I really want to do is I want to tidy this because I I like to deal with things in a tidy data structure so that I can plot them using ggplot2 easily so I can use D plier to handle the things that are going on and in the in tiny text we have a tidying tidying method for this kind of topic model so we can look at the beta matrix what the beta mag matrix does it is saying what are the words that contribute to each topic so let's look at that and make a plot so less here's what we're gonna do we're going to group by topic remember that we said we we were we said we're gonna have six in this model and let's let's take the top ten again let's just look at the top ten words that contribute to each that contribute to each topic so we're going to group by topic we're gonna take the top ten terms or words in each topic all right so unless we're gonna make we're gonna make almost the same plot so I'm actually gonna go up and copy this and put it down here and what we're so what we're gonna do here is we are going to now we have whoops we now have term and so term is going to be what we're just looking at and we are going to put beta on is the thing that we're comparing and the the faceting now is topic so we're saying which words contribute the most to each topic so let us look at this whoops I have a story in there somewhere fill equals topic there we go let's try again all right let's wait for that plot to come up excellent all right let's look at this a little bigger so here we can see which words are contributing the most which topic so we interestingly we see some proper names again where there are some some words that are so names they're so dominant that they that the topic modeling procedure gave them a lot of weight we see things like saint-simon we but then we also see things here like father and time in in topic 5 we see things like mist or house matter night these seems like spooky Sherlock Holmes mystery kind of words topic 2 we we are back to the goose we're back to the goose here topic 3 we have things like street woman photograph so there are other kinds of words here so 3 it has more actually has quite a few words there that are related to women so this is so this is interesting so we're looking at what words contribute the most to which topics so that's one kind of probability that we get out of these kinds of out of these kinds of this modeling procedure the other kind that we get is called gamma and so this we're gonna tidy the same topic model but we're gonna say we don't want the beta matrix we want the gamma matrix and we're going to we're gonna say that we we know the names here because we have it in the in the Sherlock DFM so now let's look at this so the TD gamma this is we're taking the other matrix so we're saying how in this document how much did this topic contribute to it that's what these probabilities are made are are our measuring so let's make a probability here so I mean I plot here to show these probabilities so let's um let's see I'm gonna say TD gamma and we are going to because we only have six topics this time so we don't need to take a top in or anything like that so we're gonna put Gama on the we're gonna say gamma if we're gonna make a histogram here of gamma so let's let's say gamma let's say the fill is gonna be the topic so that we can see it a little more easily and then we're gonna make a histogram so let's do the histogram and we don't need to see the legend in this case and let's we want to sit we don't want to see the histograms all on top of each other we want to see the histograms for each topic separately and there's how many topics six so let's let's make this three and look at what this looks like all right let's wait for this to come up ah all right pretty interesting okay so what let's look at what this does what this plot is telling us is that our topic modeling procedure did a bang-up job of taking the stories and putting them into one topic so look at the this gamma probability is the probability that a document belongs in a topic so if you see if we listen look at topic one there are two stories that have probabilities close to one that they belong in that topic and then I'll 1210 that do not in topic two there is one story that belongs in that dock in that topic and eleven that do not so we see here that it stays one two three here to here to here so the the document the topics are all being associated very strongly with stories either one to three stories and then the and then not associated with other stories topic modeling doesn't always work like this this is probably because we're dealing with such a small number of documents and that the that the number of topics is quite is actually only half the number of documents so this is this is what has happened when we've done this kind of modeling if you do modeling with say a much larger data set with a much a larger number of documents you would end up with different kinds of results but this is how you interpret them this what this gamma is telling you is how likely is this document to belong to this to this topic alright so that is that is that for now so we did it we trained a topic model on this dataset of Sherlock Holmes short stories and we were able to understand which stories are more similar and which stories are focused on which topics you can read my blog post for more details on this code and check out the shiny app I made to explore the results of our statistical modeling of these stories

Info

Channel: Julia Silge

Views: 42,569

Rating: 4.9564691 out of 5

Keywords: datascience, nlp, topicmodeling, textmining

Id: evTuL-RcRpc

Channel Id: undefined

Length: 26min 21sec (1581 seconds)

Published: Mon Dec 18 2017