Python Lecture : Natural Language Processing by Alice Zhao

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
okay thank you all for coming today this is the natural language processing in Python by our speaker Alice Zhao she's a senior data scientist at Metis and I believe that she has a bit of a tutorial that we can follow along with so please join me in welcoming her thanks everyone it's so great to see a full house here today I'm very excited to speak at the PI Ohio conference one of my colleagues spoke here last year and he loved it and so he recommended that I come as well so today I'll be talking about natural language processing in Python and the first thing I want to mention is this is a two hour long tutorial so you just make you so that everyone signed up for sitting here for the next two hours also I'm going to be walking through a couple of Jupiter notebooks so Jupiter notebooks are this nice way of doing Python code and we use it a lot when we teach our data science boot camps so if you go to github.com slash the - of data there's a readme file up there as well that has some setup instructions so that includes downloading anaconda which is how we recommend our students use Python because it comes along with a lot of data science packages as well so I recommend you download that and then once you download that you can try to see if you can open up some of the notebooks so we won't be touching the notebooks for about 20-30 minutes so over the next 20-30 minutes try to get that set up and I'll check in with you again to see if you have any issues all right also today we're going to be walking through an end-to-end project so I see a lot of tutorials out there that our algorithm specific so they'll be like okay we're going to teach you this algorithm or this algorithm but it's hard to see how it all comes together so my goal for you guys today is to show you how to start with a question and then go through all the steps of cleaning the data and do some exploratory analysis and then applying the algorithm so you can see how an end-to-end project works okay so here's a schedule for today so first I'm gonna give you an introduction to NLP and then also data science because I'm a data scientist so I'm going to show you how I approach problem and then we'll get and go into the tutorial which is gonna be in the Jupiter notebooks and then I'll end with the conclusion so let's start with natural language processing so all of you signed up for this tutorial do you know what NLP is any thank you here any suggestions I see some nods yeah yeah I love that finding meaningful information from random text so when I think of natural language processing I think of it in two parts so the first part is now a natural language so what is the natural language when you think languages you see all these languages up here which one of these is not like the other Python so Python here is not a natural language so Python was written for coding whereas natural languages are languages that have naturally evolved over time that humans actually used to communicate and then processing I think of like a processor on a computer so it's how a computer carries out instructions and so natural language processing is how a computer is able to process these natural languages and then more simply put just like you said earlier I think of natural language processing is just simply how to deal with text data so today we're going to be working with text data and then the other thing I want to point out is that NLP falls under the greater branch of artificial intelligence I want to mention this because AI is such a hot term these days and AI is all about a computer performing tasks that a human can do so that includes things like image processing right humans can see so how do you make a computer see well in this case since we're dealing with language humans can interpret language so how can we make the Machine interpret language okay so now I want to go through a couple examples of NLP so let's say that you're the manager for a customer service center so you are working for a company and you sell hats and you also saw shirts and then you get a bunch of calls from people that talk about like them loving your hats or them hating your shirts and you want to get an idea of how people are feeling about your hats and shirts overall so let's say you have a thousand calls you're the manager you could go and listen to all those thousand calls but that would take a very long time you might have your customer care reps maybe take notes on who's happy about what but another thing you can do is you can use NLP techniques to actually read that text and figure out is this a positive emotion or is this a negative emotion so by applying NLP you can see that people are generally pretty happy about your hats but they're not really happy about your shirts so this idea is called sentiment analysis and this is one of the things that we'll be going through today okay so another example so say you're working at a legal firm and you're working on a case where someone at a company may have embezzled some money and so you need to figure out who is doing all the bad things at the company and so you go through all the emails at the company but then you realize there's like 10,000 emails that you have to go through so how do you prioritize which emails are likely related to embezzling money versus something random and so what you can do is you can look at all your emails and then you can label them not you manually labeling them but you can have a machine using and I'll key technique to label them and say okay these emails up here are about project work I see these are about money this one's about some of those honeymoons very personal so this is probably a low priority email that you don't need to read but then maybe the ones about money are the ones that you should make higher priority and read first out of the 10,000 email set so this idea here is a very powerful a topi technique called topic modeling and I find this very very interesting and this is one of the ones we'll be talking about today as well and then finally this is a kind of fun example but say that your writer and you work for an inspirational quote company and you have been tasked to write inspirational quotes for 365 days of the year and you've ridden it for 11 months but you're just completely out of ideas and so you're thinking okay I have all these inspirational quotes I'm a Python developer why don't I just code something up that will write inspirational quotes for me based on my past codes so you can do just that with something called text generation which is the final thing that we'll be talking about today so just to recap the three NLP examples will be going through today are sentiment analysis topic modeling and texturing so these are just some of the final P techniques out there but there are ones that I find very interesting and also a good introduction to data science and so before we go through all these techniques we're actually gonna go through a lot of data cleaning so a lot of people think okay these are such cool techniques let's use them right away there's a lot of stuffs we have to do beforehand which I'll be walking you through so that's natural language processing what is it again yeah out of language out of text data okay so that's a no fee all right so now I'm going to talk about data science and I'm very passionate about data science because I'm a data scientist and I like to give a little intro about this because this is how I see these problems so has anyone here heard of data science okay what is data science exactly I love that getting meaning out of data or using I wrote here using data to make decisions so falls under analytics which is generally using data to make decisions or getting meaning out of data but other under analytics result other things like business intelligence which is like creating dashboards and then there's data science so when I think data science I always think the data science min diagram so this was created by Drew Conway data scientist about ten years ago and I always have this at the beginning of my lectures so what he says is that every data scientist should have three types of skills programming skill map skills and communication skills so if you have all three you're a data scientist so let's say you have two of the three let's say you have programming and math skills but no communication skills so then you're more of a machine learning engineer because you can code you can you know all the algorithms but you don't necessarily have to create presentations right on this side if you have math skills and communication skills you're probably more of a researcher so you don't have to deal with a lot of data but you know the algorithms you have the domain expertise and you can do great research so over here this is the danger zone and I want to mention this because I've seen a lot of people fall into this danger zone so these are people who know how to code like everyone in this room and also have communication skills so like know how to communicate ideas and so if you have these two skills without the math skills then you fall into the danger zone because what ends up happening is you'll just pull packages or libraries out there use them and not know how to interpret the results or interpret them in correct so because of that today we're going to be going over all three of these types of skills so first we're gonna be talking about a couple different Python libraries so for data analysis my favorite is pandas so I'll be talking up an Dez awesome you talking a little bit about regular expressions does anyone have regular expressions before so powerful right it's great for text data and then also scikit-learn which is used a lot in data science and it's a great way to have know a lot of machine learning tools available but we're gonna be using it to format some of our data we're gonna be using a couple NLP libraries so the most popular one by far is and I'll TK natural language toolkit and then text blob is built on top of NLT K and then Jensen is specifically used for topic modeling and don't worry about knowing all this right away we're gonna walk through all this today and so that's from the Python side from the math and staff side this is where I really want you guys to try to understand these concepts it's so important in interpreting the results so we're gonna be cleaning the data and we're going to be putting it into a couple formats so first is just a general corpus format which we'll go over in detail and then also a document term matrix and then we're gonna do some exploratory data analysis which I'll be calling EDA from now on and that's gonna be a lot of word counts and then finally we're going to be going through those three techniques that I mentioned earlier so sentiment analysis topic modeling and text generation and then finally there's this communication piece and usually all the soft skills kind of get lumped into this area but I think of it as two parts so there's a design piece which is all about how you design a project and this is so important whenever I advise my students I find that this is where they struggle the most so just figuring out a table question like where to start how do you scope out your project what insights can you draw from that data what visualizations can you make to communicate things more effectively so that's all about design and we're going to be doing that throughout the tutorial and then finally having some type of domain expertise is really important so I'll be sharing with you the project we're going to be working on throughout question we're gonna be answering and hopefully you'll know a little bit about the subject area that you can help me interpret the insights okay so before we move into the toil the last thing I want to talk about is the data science workflow so today the science workflow this is the order of steps out we usually take to solve a problem and this is the order that we're going to be following today so the first is to start with the question it's a lot of people think well data science that has data in the title so you have to start with the data but that's not the case you should always start with the question mine because that will influence what data you're actually going to collect the first step is to start with the question then you're going to get the data perform EDA which stands for exploratory data analysis great then you're going to apply some NLP techniques and finally share all those insights so let's walk through a simple example so let's say I had the question if I study more well I get a higher grade what are your initial thoughts yeah right so that's your guess but then within the data science workflow the next thing you're gonna do is you're gonna get some data so what type of data money you get number of our studied grades perfect yeah let's start there so let's say I go when I talk to a teacher and I get this information but for all these students this is the number of hours they studied this is the grade they got so take a look at this data what do you think of it what's not consistent about it yeah there's there's definitely bad things happening here right we see a - here we see someone might have had a typo here and so at this point I need to clean the data so this is what we mean by clean the data get it all in consistent format so at this point good of them have to make a few assumptions so I want to assume that that 2 means a 2 and that 98 was actually 98 I might not be always correct but I think it's a pretty good assumption to make at this point ok great so I clean the data so the next thing I want to do is perform EDA so what this means is I want to quickly understand and quickly see if my data makes sense so this is typically done through visualization techniques so knowing that what would you do at this point with this data yeah scatter plot perfect I'd create a scatter plot so just by looking at the scatter plot what what are some things you can learn from this data yeah great well you see a positive correlation right you see that the more hours you study the higher grade you're gonna get anything else yeah they did yeah they did pretty decent the other thing that I found here was this person up here that was Charlie he did pretty well without studying that much right so those are my two findings that's what you'll typically see with numerical data first you want to see if there's any correlation and then you want to see if there any outliers so in that case that's an outlier great so that's my EDA my data makes sense right more I study higher grade I get great so only after you've done your EDA now it's time to apply some techniques so if the most basic data science technique is applying a linear regression so fitting a line to our model and you can see here that with the linear regression I can actually get a specific equation to see what that relationship is and so at the very end now I can share all the insights basically summarize everything I've done so my question was if I study more will I get a higher grade and the answer is yes there's a positive correlation and then specifically with linear regression we saw that this is the relationship and finally we know that Charlie was really smart and then he was kind of an outlier so at the end of the day even if linear regression predicts you're gonna get an 80 you're probably gonna get slightly less because of that so that right there was our whole data science workflow make sense great okay so at this point for this section I talked about data science being part of analytics and it's all about using data to make decisions talked about the Venn diagram and not fall in the data to the danger zone because you need to know the math and then finally the work flow all right so let's get into the tutorial which is what you've all come for so was everyone able to open up a Jupiter notebook does anyone have issues that's pretty amazing nice job guys that's amazing okay great so at this point we are gonna go through all of these steps so the first thing we're gonna do is I want to note that for the getting clean data step the only difference with NOP analysis is that this step is going to be using text data and this is gonna be the format for today so I'm gonna have a presentation at the beginning of every section and then you see here there are five notebooks for you to follow through I follow along with and so I'll be doing a notebook after these middle three steps here okay so the first step is defining a question so this is the question that we're going to be focusing on today does anyone recognize this woman anyway she's a comedian okay so her name is Ali Wong and about two years ago I was not into stand-up comedy at all and I saw her special on Netflix and I really liked it and I was kind of surprised because I I'm not into stand-up comedy and then at the time I was thinking okay yeah do I like her because she's also female and Asian and I was also pregnant at the time I'm thinking maybe but maybe there's something more to it maybe there's something actually in the language that she's using that makes her different from other comedians and so that's the question that we're going to focus on today okay so our goal is to see what makes Ali Wong's comedy routine stand out that sound good okay great all right so now we've got our question great so the next step is we have to get our data so there's two parts to this we have to get our data somehow and then we also have to clean that data so the input into this step is okay how is Ali Wong different other comedians and the output to this step is we want clean organized data that looks something like this because this is what a machine can work with okay so for this first step data gathering my question to you guys is how are you gonna get this data you haven't a github but how did I get it how did I get this data transcript so luckily when I was thinking of doing this project I just googled Ali Wong baby Cobra transcript and then it came up so very very lucky there and then the second question is how much data are you gonna get so I said I wanted to compare Ali Wong with other comedians so how would you decide how many comedians to compare her do so who are the top ten comedians how you did you to find that so I actually did exactly that I googled who are the top ten comedians and every list was different right so how are you not yeah that's actually a great idea so you could do that so I tried to make this as much of a data centered approach as possible so the way I did this was okay first this is where I got the transcripts from and then second I went to IMDB because IMDB has this advanced search option and I tried to adjust the filters so that I would get about ten comedians and it would include Ali Wong and I think those are my two cards I would get about ten comedians and it would include Ali Wong and so I ended up kind of tweaking the features tweaking the filters to make this make sense so at this point is one of pods because these questions that I've asked you seem really simple they're very hard and this is always where you have to start your analysis is thinking like how do I want to scope my project and this is where your domain expertise matters a lot because I'm gonna get this top list of top ten comedians but I also have to make sure that my list makes sense like I don't know too much about stand-up comedy but my husband knows a lot about stand-up comedy and so I asked him and he's like okay this list makes sense you got you can start here so the way I decided to limit my scope was I looked at comedy specials from the last five years the hats had at least the seven point five rating out of 10 on IMDB with over two thousand votes and then I ended up with about twenty comedians and then some of them showed up multiple times so I had to figure out how to deal with that so then I decided I'm just gonna take for every comedian their top rated special and that's a good place to start so these are the comedians that I ended up with does anyone here into stand-up comedy okay so looking at this set is this a pretty representative set so like the ones I recognize here there's and then I know Dave Chappelle is popular I know Lou CK is famous I know I don't know many of the others but having a few in there makes me feel good about my dataset so great I've locked in my scope let's move to data cleaning okay well actually before we move into data cleaning I just wanted to mention a few of the Python packages that I've used here and we're going to walk through this more in the notebook but throughout my presentation anything that you see in teal that's going to be a Python package so for a web scraping has anyone got any web scraping here before okay so then you'll recognize these so I used requests so request basically allows you to enter in a URL and then you can get all the data from that URL and then beautiful soup is great too because then it actually looks at that HTML page and then it can you can pick out certain sections from that HTML page so just think of a web page that has transcripts on it but it also has a ton of other stuff with requests and beautifulsoup you can just pull out the transcript text that's the web scraping side and then also I did a lot of pickling so you'll see I've done five a creative five different notebooks here what I do is at the end of every notebook I'll pickle some objects what that means is I can save the object for later so think of an object like a list you can pickle a list so that you save it and then in the next notebook you can load it up and read it and use it again so you'll see a lot of pickling throughout okay so we've gathered our data we have 12 comedians now we have to clean that data so our goal for this step is to get the data at a clean standard format that we can use for further analysis and we're gonna get our data in two types of formats the first is just a general corpus I'll talk about in a bit and the second that involves a little bit more work is a document term matrix so the first is a corpus and a corpus is just a collection of text very simply so the goal for this step is just to get the data in this nice table and the way we're gonna do this is you think pandas so again pandas is the Python library for data analysis and specifically in pandas there's an object called a data frame so a data frame is essentially just a table looks like this so every row of a data frame will have an ID and then every column of the data frame has the same data type so we're gonna be creating a panda's data frame to create this corpus dress so that's the first format very easy so the second thing we're going to create is a document term matrix and this is a little bit more complex so the things we need to do here first we have to clean our text then we have to tokenize it and then finally put it into matrix form so let's walk through this step-by-step so this is the first line of John mulaney's stand-up routine and if you look at it it looks pretty messy so if you are a computer and you want it to only understand the most important parts of this line right here what would you do to make this data cleaner great idea get rid of punctuation any other ideas what's that mm-hmm yeah yeah yeah these are all great ideas and these are all things that you can do to this text so there are many different ways that you can clean your text but there are a couple of standard ones that we're going to be going through so the ones that we're going to be doing today is first removing punctuation also making everything lowercase and a very common thing to do is to remove numbers and any words that have numbers in them all the things you guys said can be done but this is just like a very quick first pass so if you did all of those things then your text would look something like this so at this point we've done a first round of cleaning the text so everything's lowercase and we we have them in a mostly a standard format so the way we're going to do this is using regular expressions in Python so regular expressions are very powerful if you think of like control find on your computer we can search for one word with regular expressions you can search for patterns so you can search for anytime a word starts with a capital letter do something with it that's the power of regular expressions I'll walk you through that later as well and so now that we've cleaned the data the next step is to tokenize the data and tokenization is very standard term and NLP so to tokenize something is to break it down into smaller parts so you can tokenize things by sentence you can tokenize things by to word combinations are called by grams but the most common way to tokenize is by word so at this point if we tokenize this sentence up here by word it would look something like this so now every word is its own item okay so at this point we've done from tokenization and now every single item is its own word so at this point we can remove things called stop words so stop words are words in a language that have very little meaning so every language out there has its own set of stop words so you'll see that English has its own sort of stop words including things like the or words like that so it's very common at this point if you have all these words you can remove these stop words because they're not going to add much meaning for the machine to process so if we move the stop words these are the words we end up with and at the end of the day you end up with just these words here so we went from that really messy looking data to just six words here and it's much easier for the computer to process so this type of format for the words with an LP it's called the bag of words format because if you think about it it's saying that this document here is essentially just a bag of words and the order of the words doesn't really matter it's just a group of words thrown into a bag so it seems like an oversimplified way to represent text data but it's actually really powerful just using bag of words you can do a lot of analysis which I'll show you later as well okay and then finally we're gonna put this all into a matrix so this texture was just for John Mulaney skit but what if I want to within one table include data for John Mulaney and Ali Wong and Dave Chappelle well I'd have to put that all into a matrix so the reason we put this into matrix is because we want to store these terms for multiple documents so at the end of the day you get this document term matrix so this was our goal we wanted to create a matrix that contained contains for every row a different comedian or a different transcript or a different document and then every column here is a different term and then all the values inside are the word counts and so if you think about it we started with for every comedian a big transcript of really messy data but using those steps of cleaning tokenizing and putting in a matrix form we've been able to really simplify it and put it into this document term matrix that we can now easily use for analysis it's almost just like having numerical data at this point it's very easy to process so at the end of the step we've created this document term matrix and the way that we do this is within scikit-learn again this is pythons machine learning library there's this function called count vectorizer which helps us create this document term matrix so you can do a little bit of data cleaning ahead of time and then you put it in count vectorizer it creates this matrix for you and then you're even able to remove stock words along the way so count vectorizer is a great tool for this so again our goal at the beginning of this was to get our data in a clean store clean standard format for analysis and now you see that we've got it in two standard formats just a general corpus with every comedian the transcript and also this document turn matrix with all the documents the terms all the word counts so to summarize the input into this data the data step down here was how is Ali Wong's comedy different and we first gathered our data we've scoped our project then we put that data into a standard format using data cleaning techniques and our output is a corpus and a document term matrix so let's get into the Jupiter notebook okay so for those of you who have a Windows machine what you should do at this point I set it in the steps but you want to launch anaconda navigator so I open an account and navigator and then launch Jupiter notebook and then within the browser that opens up you want to navigate to that to the first notebook data cleaning and then if you're on a Mac you can just go to a terminal and then type in Jupiter notebook and then you see this window pop up and if you can navigate to the folder you can see all of the duper notebooks here so at this point I just wanted to point out this extension IP ynb that stands for interactive Python notebook so it's been rebranded as Jupiter notebook but you'll see whenever I talk about a notebook it's anything that has that extension dot I py + B so we're gonna go to this first one here called data cleaning okay has anyone here used Jupiter notebooks before okay a lot of people great so one of my favorite shortcuts in Jupiter notebook is shift enter and what shift enter does is allows you to run a cell so at the very beginning we have our introduction here our problem statement and I'm doing shift enter the whole time so you're able to actually run the cells so I'm going to walk through the code here of what's actually happening so for this first set Shin my goal is to get the data do you guys remember what the library is were that I used for this section to get the data requests and beautifulsoup yep those are the two most important ones for web scraping so what I'm doing here is I am first importing requests and beautifulsoup and then here I'm creating a function that will pull the transcript data specifically from the scraps from the loft website so if you take a look at that I can go to scraps from the left and the way I scrape data is you can inspect an element in this browser here so that is command shift C and then you can't really see it right here but you'll see that behind every web page there's a bunch of code here and so for me the code that I specifically want is this transcript data so if I highlight over this you can see here that all this transcript data is in this div class called post content so that's how I was able to identify where that transcript data was so if I go back to this action here you can see what I'm doing is first I'm using a request to get all the data from that web page that URL and then I'm simplifying it next in this step called beautifulsoup basically telling Python that this text is actually an HTML document I want to read it as an HTML document and then this part is where I find that specific class called post content which I knew by hovering over that section of the web page I knew that that section was called post content and then I want to find all the paragraphs in there and pull out the text from those paragraphs and that's it that's my simple web scraper just from this little function here and so at this point I've listed all the 12 URLs that I want to scrape and then I scrape them so this cell here if you uncomment it you can actually scrape it it takes a few minutes so instead of all of you trying to hit this website which is already a little bit slow to begin with I've pickled the files here and then you can just load the pickled files here all right so once you've loaded the pickled files you can see that I've created a dictionary here and in my dictionary every key is a comedian and then every value is the transcript anyone have any questions so far okay great so that was the data gathering step so looked really simple there but that actually took me a whole day to figure out because I was trying to figure out like where on the website all the state content was and how to write it in a short format so that it does take a while just want to put that out there okay so the next step is to clean the data and so what were some of the things you can do to clean data again that are not that I'll hide there yeah remove words with numbers lowercase punctuation right so these are a lot of the common data cleaning stepped up here and then there's some more data cleaning steps for later as well that I'm not going to talk about today but things like limitation and stemming so what that means is you can take words like driving Drive drives and it knows that those are all the same words and I could group them together parts of speech tagging diagrams and so on there's a lot you can do but you want to start simple and so we're going to start here with just these common data cleaning steps so I always like to keep looking at my data to make sure that it looks right so I'm just gonna again see that I have this dictionary called data I've called data all the keys are the comedian's and all the values are the transcripts and so the next thing I'm going to do is I'm going to create a function that takes this transcript text and puts it into one large chunk of text so right now if you remember when I scraped the data every the data was split into lots of chunks of paragraphs and so I put that all into a lift so I had a list of lots of text which is kind of tricky to deal with with text data a lot of times you just put it into one giant string it's a lot easier to deal with so in this case I'm using this function to put all that data into just one giant string and that's this part here so now I've created this data set that has a key of a comedian and then the value is instead of a list of text it's just one giant string of text and at this point you can keep your data in this dictionary format or what I like to do is I like to put it into a data frame so I'm very used to working with pandas data frames so what I'm gonna do here is I'm gonna import pandas and pandas has this nice function that allows you to take a dictionary and make it into a data frame so at this point if I run this you can see that I had that dictionary of train of comedians two transcripts and now I've been able to put it all into this nice data frame and this was one of the format's that I wanted which format was this I think I heard a corpus so at this point this is not the document turn matrix just yet this is just the raw corpus right because I have just all of the transcripts so my column is just transcripts so now you have the document turn matrix this is just a corpus okay so that was the the easy thing to create was just the corpus so the bit harder thing to create is that document term matrix where instead of every column being so in that case my column was just a transcript and I want to create a different column for every single term so the way I'm gonna do that is I need to first clean the data remember we have to use regular expressions to clean that data so again I just like to look at my data to make sure it looks good it does and at this point I'm going to start applying data cleaning techniques to create that nice document term matrix so I'm importing re which stands for regular expression and then string which allows me to get all the types of punctuation out there so I'm going to create a function that allows me to clean my data so you can see at this part I am making all the text lowercase this next regular expression here is a bit trickier so what i'm doing here is if you look at this data up here let's see okay so if you look at this data up here you'll see that there's data in brackets and i want it to get rid of the data in that brackets in those brackets because those aren't actually part of the routine you can see that there just sounds and so that's what this regular expression here is doing it's saying anything that's within these square brackets if you have square brackets and there's some character in them then get rid of that whole thing that's what this line is saying what this regular expression is saying is we have string that punctuation which is just literally like a list of punctuation marks and so it's saying if anything is any of those punctuation marks then get rid of it you said you want to get rid of this part can you give an example oh yeah yeah I see what you're saying yes so that would be great so I did do that but yes that would be great so you'll notice like in this these two sections it says announcer and then it has that other one yes that would be a great second round of clean to do okay so going back here though so I'm getting rid of just very easily for a first pass I'm getting rid of everything in square brackets the punctuation then does anyone know what's happening in this one right here the numbers yeah so what this is saying is dash D these are all the digits and then W is alphanumeric characters so A to Z 0 9 and then the star is 0 more times so it's saying if you have a number and then there are any letters or numbers surrounding that number then that means that that's a word that contains a number in it and you want to get rid of any words that contain numbers in them so these regular expressions are a bit harder to decipher so I encourage you after this tutorial to go back and look at them to understand them a bit more so right here this is my first round of text cleaning all right so it's just those three basic things and then the one extra thing I added there was the square brackets so let's run that and then you can see here that our data looks a little bit cleaner do you notice anything here that hasn't been cleaned yeah there's some single letters there that are kind of weird the first thing I noticed when I looked at this was some of my quotes didn't disappear and it's because the string that punctuation didn't capture this particular type of quote so then I went back and I like copy pasted these quotes and then I did a second round of cleaning so I added those specific quotes and then I also notice that these line breaks are in here the backslash ends are in there so then I did that as a second round and actually those square brackets I mentioned I actually didn't add those into my data cleaning step until like three days after I did this analysis and I realized it actually helped my case but it helped me better more if I added that in there so there are always more ways for you to clean the data but I say like two rounds is a good place to start so if you apply that second cleaning step I would say this is looking cleaner there are these like weird CIA i/o things there there's things that the announcers saying but it's pretty good to start so we're going to move on from here so this is what I mentioned here this could go on forever just stop at some point and move on with your analysis at this point we've organized the data into a corpus and a document term matrix and we're going to pickle them because we're going to use them in later files so first we have this corpus which is just every comedian and a transcript and then you see down here I'm going to add the full names next to it as well because I want to use that for my data visualizations later and then I'm pickling it so you can see corpus topical I'll use that later and then there's also this document term matrix so this is where we're going to use count vector right here from the scikit-learn and the way they use count vectorizer is you first instantiate a count vectorizer object and what you can put in there is stop words so if I say stop wars equals English it automatically knows that I don't want those word says columns which is really really helpful there's a ton more that you can do with count vectorizer you can even say I want to include by grams as well so anything that's one word I want to have that as a column any combination of two words I want that as a column as well so it's super powerful if you another trick that I like in duper notebook is if you do shift tab no here let me run this first if you do shift tab you can see all the documentation behind a function so if I expand this you can see there's a ton of things you can do in here you can strip the accents it defaults to making everything lowercase I mentioned the Ngram range here they're just a lot you can do in there so after you create that count vectorizer you can fit it on to your data and we're specifically fitting it onto our transcript data and then these two lines are a bit trickier but you convert it into an array and then you label all the columns but just know that if you're you can just copy and paste this this code for a future use the main thing you have to know here is you have to specify the stop words and then you have to put in your clean data text data here and at the end of the day you get this pointer matrix so remember every document and all the terms so you can see that this is a pretty pretty big matrix that has 7,000 columns so if you include things like diagrams those pairs of words this will really blow up but I found for that a lot of analysis it's really useful to include those that as a second pass because then you'll find really meaningful things in your data so at this point you can pickle it I'm going to pickle this for later use and then I'm also going to pickle two other things I'm pickling you data clean and data clean is this data frame from up here so you can see this is before I put it into count vectorizer form it's just the clean data where I removed all the punctuation so I pickled this for later use because we're going to use it from another thing and then at the very end I also picked her pickled the count vectorizer object which we're going to use for a later thing as well so that's it there's some additional exercises you can do on your own after this tutorial mmm-hmm so what's happening is this just sterilizing it and so it's just think of it as you're taking that object and you're just saving it as a file on your computer so that later you can just call that and then you can open it up in the next document that makes sense so like if you see that I've pickled a file here so once I ran that pickle step you can see that it created this pickle file so in the next notebook I'll open a pickle file and then you can see that the object so pickle file is just the easiest thing you can create like you can also if maybe like compress it in some way but pickle file is not compress at all so it's just like a very simple way to just dump that object into a file that you can use for later yep it's just it just stays there so okay how's everyone feeling pretty good okay so then let's move on okay so we've cleaned our data so the next step is ETA and for EDA the input into this step is your text in some standard format so again who created our corpus and our document term matrix and then the goal for EDA is to summarize the main characteristics of the data preferably in a visual way so if you remember from our simple example that's when we created the scatterplot and we were like okay there's definitely a positive correlation and also we see an outlier and then the output of this section is just to see if our data makes sense or not so you might have heard the term or the phrase garbage in garbage out so it that means if you put in garbage data into your model even if you have the coolest model it's not gonna be your results are not gonna be good so you have to do the step to make sure your data is actually good so if you think about our data set with all our transcripts what are some ways that you think you can explore that data perfect frequency counts like word counts maybe else nay person great idea like seemed like the actual proper names of these what's that sentiment I would I would owe to me I would say that's like a little bit more advanced analysis but yeah these are these are great ideas like you just want to get an idea of what what's going on in your data and so I thought of three things that I want to get out of this data so the first thing I'm gonna do is I'm just gonna look for top words that's the most obvious one so for every comedian what are the top words that they use right the second thing I thought of is taking a look at the vocabulary so maybe some comedians have a larger vocabulary that others that might be interesting to look at and the third thing is the amount of profanity so I didn't realize that this would be really a thing until I looked at the top words and there's a lot of swearing it sounds like maybe I should see how much comedians swear so those are just some things I decide to look at this is really up to you whatever is interesting to you that would help you figure out they don't make sense or not would be good to do here so how did I find the top words well the first thing you want to do is figure out okay we have our data in two standard formats right the corpus and the document term matrix which one do you think would be better for figuring out the top words yeah the documenter matrix right because you already have everything in this every word as a column so it's very easy to aggregate your data at this point so the whole reason we did all that data cleaning and standardization was so our life would be really easy at this point okay so we got this data the next thing we have to do is aggregate so how do we aggregate this data to find the top words perfect you have to for every comedian select the columns with the largest value basically like sort across the row and then find which one which words have the highest word counts okay so we've got yep we've got the data we've aggregated the data and now we have to visualize it in some way so let's say we found the top 30 words for every comedian what's the best way to visualize that word clouds okay so it's funny because where I teach half of us love word clouds and half of us hate word clouds and I love word clouds so I never expect people the same word clouds when I'm teaching five so glad you brought them up because I really like word cuts so it's just it's a great way to visualize text so you can see here this is a Lee Wong's word cloud and here's John mulaney's word cloud and so what we'll do in the notebook is I'll show you how to create word clouds for all the comedians yep what's that sure you can do like a bar plot so something like more classic word clouds are specific to text but yeah you can create like a bar plot or so the bar planner is the easiest if you're comparing two words you can do a scatter plot okay so you visualize this but what was the whole point of this why are we visualizing this what's that we want to find the top words do remember why we want to find the top words yeah we want to know why she's different from everyone else we want to see if these results are actually making sense so when I first did this analysis I didn't do as much data cleaning as I showed you and when I was getting these work these word clouds like they did not make sense at all they just had like very common words in there I hadn't removed stop words and so they didn't look very good so it took a lot of cleaning to get to this point where it made sense that's where EDA can help so you look at these word clouds and you're like okay do they make sense if they don't make sense and you have to go back you have to do more data cleaning until this part makes sense yeah yeah that could be the case so I'm not sure I'm the exact logic behind the word cloud package but it has to do something with that I'm sure you can tweak the parameters okay so I mentioned the last thing you want to do is you just want to take a look at the visualizations feed your data make sense and figure out if you have to clean your data somewhere and then also you can get some initial findings from this so how are the comedians differ from each other well you saw that Ali Wong let's see oops sorry guys you see that she says get a lot and then John Wayne talks about Bill Clinton the line is good so those are some ways you can see that it's different mmm-hmm it is so it is a bigram that I detected okay so at this point these are the EDA steps that we followed so we figured out what format we need the data in so we're finding the top words we need the data in the document turn matrix format then we had to aggregate that data visualized it and then see if it made sense so what we're gonna do in the notebook next is we're gonna do this for three things so top words which is word cut which we create word clouds and look at the sides of the vocabulary for every comedian and then the amount of profanity so the way I'm gonna do this is I'm gonna use word cloud specifically in its the word cloud package you can create and then there's also matplotlib which is the most popular data visualization package and python so one thing I want to know is within the readme file I updated the readme file about two days ago and there's an extra step you have to take to download word clouds so if you look at that readme file there's a Conda install step for you to actually download word cloud because it's not included with the base anaconda navigator okay to summarize what we did for this section the input into this section was our standard document formats what we're gonna do here is we're gonna do a ton of EDA to see if our data make sense and then the output is to find some big trends in the data and see if they make sense so do the notebooks alright I'm gonna go to the second notebook here and what we're gonna do here the first thing is I'm gonna read in the pickle files so someone asked earlier about how this part works so within pandas what's really nice is if you use pickling in pandas like you can pickle a dataframe it's really easy to read it again so if I import pandas you can see within pandas there's a reed pickle function and I can specifically use that to read in a data frame so here I've read in my document term matrix but one extra thing I've done here is I've transposed it so now it's a term document matrix and the reason that I did that is because now my aggregations will be a little bit easier so I talked about like I wanted to find the top words for every transcript well if I didn't transpose it it it's harder to do things across rows than it is to do things over columns so that's the reason I transpose it here I just wanted to mention that so my data here is now a term document matrix and so let's say I wanted to find the top 30 words for every comedian so what I've done here is for every comedian I've looked in that data frame and I see for a Lee Wong these are her top words and how often they occur and so I'm gonna do that for every comedian so these are the top 15 words said by every comedian so if you take a look at this what do you think about these top words yeah people say like a lot and so like is not part of the standard English stock word list but if everyone says like a lot it's not going to be really meaningful in your data right and so what you can do is you can actually add these common words to your stock word list so what that means is in my document turn matrix I'm gonna remove this term and in your case like so what I want to do here is I'm going to for I'll show you so I'm gonna first pull the top 30 words for every comedian here and then I'm going to use counter which is super helpful so what's going to do is going to look at my list of the top 30 words for every single comedian and then I'm going to see how many documents contain that word so at this point you can see all 12 comedians have like as one of their top 30 words all 12 comedians have I'm as one of their top 30 words so these words aren't really meaningful so I'm just gonna set a limit I'm gonna say if over half of the comedian's have one of these words as their most common words I'm gonna say it's a stop word I'm gonna add it to my stop word list so you'll see here these these words are in a lot of the words from the top comedians so I'm going to hear what I'm gonna do is add them to my stop word lift so I'm gonna walk through you you through a couple steps here so first I'm importing count vectorizer which we used before I'm also importing text so text is what actually contains the stop word list what I'm gonna do here is I'm going to first read in the clean data so if you remember that was my corpus that had all the punctuation stripped and then I'm gonna add new stop words so here what I'm gonna do is there's already Engler stop words out there and I'm gonna Union I'm gonna add to it my news got word list and so now I'm gonna recreate this document turn matrix with my new stop word list along with my clean data from my last notebook and then at the end of that I'm going to have a new count vectorizer object that includes my new stop words and a new document turn matrix that excludes those staff words as well so I've run all that at this point I have a new document term matrix that doesn't have those stop words in it anymore so at this point we can make some word clouds so this part will only run if you've done this additional install so if you're on a Mac you can do this in your terminal if you're on a Windows machine you can do this in your anaconda prompt and you just type in Conda install Conda Forge word cloud and so I've created my word cloud object and then to actually plot it you can see here so these are the word clouds for all the comedians so what can you tell from this there's so much swearing like I was so surprised with the amount of swearing which is why I added the profanity step yeah there's a lot of swearing and what I found from this remember mine whenever the original question is how is Ali Wong different I noticed that she says okay a lot I say okay watch talks about her husband and I talked about her had my husband I guess I think this is all funny so if I look at this this makes sense to me and I've come up with a couple of findings here okay so the next thing I want to do is look at number of words so my goal here is to see how big of a vocabulary everyone has and so the way I did this is first I went for every comedian I looked at the number of unique words that they used and then the other thing I did I was thinking maybe I can look at the speed of each comedian so I went on IMDB and I looked at how long everyone's routine was so I added that here so this table I created has every single comedian the number of unique words they've used the total words that they use throughout their routine how long that routine was and so I could calculate this words per minute so the two columns that I care about are the unique words that they use the number of unique words that they use and then also the number of words per minute that they say so now that I have this table what's the way I Commission you so I just had the time of the show but you're absolutely right like if I want to be more specific I would need to know the specific time that the community comes on stage I went on IMDB and one and I just like manually like went to each comedian like looked at their run time yeah I could add it to my scraper too so okay so at this point what what can I do like this is kind of hard to read so yeah that's a great idea I could take the average of the unique words and then see who's above or below that I don't have any other ideas oh do some clustering - we'll talk more about that later so for me I like to visualize in very simple ways so what I've done here is I've just created some bar plots this is pretty simple a great idea would be to add that average piece like a line here or a color here for average but you can see that Ricky Gervais and Bill burr have a big vocabulary and Louie CK and Anthony Chile's Nick don't and here are some things about words per minute so I did this and I found that my analysis wasn't really interesting but it's something to try and not finding anything is something in itself so I just did this analysis who didn't find it very interesting and so then I went to my last bit which I was looking at the amount of profanity so when I looked at all the word clouds I found that the S word and the F word were said a lot so I was like okay why don't I just create a scatterplot to see how often those things are saying so that's what I'm going to do in this section I'm just going to create a scatter plot of bad words so this here this is number for death and live s words and what can you tell from this bill burr used a lot of s words and just profanity in general yeah I was really surprised I don't know about Joe I don't know Joe Rogan's but he's using to F words per minute which is crazy and then on the other end you have Mike Birbiglia who has zero swear words in his whole routine which I thought was pretty interesting so just from ETA I haven't done any fancy analysis you can see I've come up with some interesting things here I do think my profanity pot is my most interesting thing from my EDA section so yeah that's a great question so I didn't do any stemming so what I did was I looked at the most common words and then when I looked here I saw that there were variations of the F word and I just put them all in one if you you could do stemming although I don't know what the summers in NL Tek include profanity so that's probably something I do manually okay so the side note I wanted to mention here is the whole goal of EDA was to take an initial look at our data and see if it made sense and my conclusion is it does for a first pass and yes there's always lemak things that could be better but it's good to get something done quickly and so my all my students know my data science motto is let go perfectionism because I always have students that come in they want everything to be absolutely perfect before they move on to the next step but I'm like you got to move quickly so just do something you know there are things wrong in it but it's okay so my data science and my life motto is let go of perfectionism I just wanted to throw that out there okay so we're at about the halfway point and we've been here for a while so I want to get you give you all a break so let's get back together about five minutes at is it 11 we have now completed the EDA step where we've explored and visualized the data and we've gotten a couple interesting findings from it as well so this last part now second to last part is the most interesting part so this is actually looking at all of the NLP techniques so now we're going to be looking at the NLP techniques so our input into this is we have clean data and we verify that the data makes sense and now we're going to do these more advanced analytics techniques including sentiment analysis topic modeling and text generation and our output is going to be additional insights to help us answer our original question so these are the three techniques we're going to be going through starting with sentiment analysis so who here has heard of sentiment analysis okay who here has applied sentiment analysis okay okay that works so the input into sentiment analysis so remember we have our corpus and we have our document term matrix for sentiment analysis it's going we want to input in our corpus and the reason for that is because remember when we had our document term matrix it was in a bag of words format so what is a bag of words format again yeah it's a collection of all the words an order doesn't matter but the thing is with sentiment analysis order does matter right so if we have the work that's term great and we have the term not great those are going to mean very different things so we want to preserve that order so our input here is a corpus and what we're going to use for sentiment analysis here is a library called text blob so earlier I talked about NLT K and NLT k is the library that everyone uses for natural language processing in Python but text blob was built on top of animal TK and it makes it a lot easier to use and it also includes some additional functionality such as sentiment analysis and also like fixes typos so it does all those things but I would say it does it in a very very basic way so keep that in mind so for the output of this for every comedian what we're gonna do is we're gonna look at their entire transcript and then we're going to give that transcript and overall sentiment score and an overall subjectivity score so the sentiment score is how positive or negative they are and then the subjectivity score is how opinionated they are so let's take a look at what this code looks like so this is literally all the code for a sentiment analysis so you would import text blob and then you would say text blob of some text you do dot sentiment and then you get this as an output so polarity is a number between negative 1 and positive 1 and so a polarity of 0.5 would mean I love pie Ohio is generally a positive statement which is right and then subjectivity is how opinionated you are about something so that's a number between 0 & 1 so the more subjective you are the more opinionated you are so love is an opinion so you can see that that's a bit higher on that scale so great let's jump into the notebook but wait don't jump into the notebook so I've mentioned this a little bit before but if you jumped into the notebook at that point and you would fall into that danger zone I talked about because you only understand at a very high level what's going on but before you import and use some model I module I highly encourage you to understand what's going on behind the scenes so I'm gonna go into a little bit more information of what's actually happening so the way that text blog sentiment works is there's this great linguist Tom just met so what he's done is he's gone through all these words in the English language and manually labelled them as positive sentiment negative sentiment and so on and this might seem kind of crazy but there are a lot of linguists out there and this is what they do they've created these amazing incredible databases of the words of the English language so you might have personally called wordnet so we're now tom is created by this group of researchers at Princeton and what they've done is they've said these words are all very similar to each other these are the definitions of these words and so on so they've actually kind of mapped out the English language manually and so there are a lot of sentiment I would say like lexicons out there so people who have gone through all the words researchers linguists who have gone through all the words and labelled them so the one that texts blob uses is this specific one that was labeled by Tom de smet and the reason I know that is because I went to github and like read some of the documentation and it would tell you this so what does this mean so let's take the word great so if you look into that big dictionary from the last slide if you just pull out some key features here you'll see that the word gray actually shows up in that list four times these are the word net IDs so again were denied is this big dictionary that was created by Princeton and a lot of people are reference it in the NLP world and so they'll have different word net IDs and everything in the word net dictionary is labeled with parts of speech so JJ stands for adjectives and then different meanings of that word and then you can see here that this specific linguist has labeled these as a certain polarity and a certain subjectivity and so how do you think text blob aggravates all of that so overall if text blob saw the word great how do you think it would decide what to do you would think it does something fancy like that it just literally takes the average yeah so this is why I was saying like text blobs features it sounds really fancy like fixing typos and sentiment Alice's but it's very very basic but it's okay because it's a good first pass so these are all good things to know before you just use the module so let's look at some examples so if I just did text blob of great you would see that that polarity point is just the average of that column and the son subjectivity is just the average of that column if I did not great what do you see as a change here polarity what's changed it's negative so if you look at the documentation what's actually happening is whenever it seems not before a word it multiplies the polarity by five so this is a rules-based approach to sentiment analysis if you see very before a word what it does is multiplies both those scores by 1.3 and then it caps it up one for polarity so again rules based approach and then if you do I am great you can see that that has the same score as that first line because I am don't have any sentiment yep I'm not served but I would I would hope so I'd have to look at the actual code to see what's happening but that makes sense it's like an explanation mark would add some meaning to it so so this kick kind of gives you an idea of what's happening behind the scenes and it makes it when you do dig deep into what's happening in the modules you don't just take the module to be magic that's great that's great there's the answered your question thanks for answering okay so at the end of the day what text Bob does is at first if it sees a word it will average all the subjectivity and polarity scores and then at the end of the day we'll look at your entire text and then average together all those scores so not the most okay well before I say that so the output will be every comedians gonna be assigned one polarity and one subjectivity score and it's not the most sophisticated technique like I said but it's a really good starting point because so this is a rules based approach and you can also do knowledge based techniques so this is more if you're into data science and you know about classification techniques that are out there you can use one of those classification techniques as well so what how that would work is the most popular data set to do sentiment analysis on is this movie reviews data set so for every movie review it has a bunch of text and then it's labeled as this is a positive review this is a negative review and then you'll use some more advanced technique like naive Bayes or logistic regression to see what's the combination of words that that gives a movie a positive review and so that's a more advanced technique that if you're interested on in I highly recommend you look into after this yep yeah that's a great question so with naive Bayes the major assumption there is that all your features are independent so it's saying like if you're trying to predict how good you will do on a test all the things that could predict how well you're gonna do on tests like number of hours studied or how high where you on your last test like that those are not related to each other so typically if you use a standard algorithm that's not naivebayes that independent have to assume that and sorry let me just summarize my thoughts so if you're using something like naive Bayes it assumes that each of those features is independent so that they're not related to each other at all and with other types of analysis you usually can't make that assumption but with text data it is okay to assume that those are independent and so it just ends up working very well with texts like that any other questions all right okay so now that we've kind of understood how text block works let's jump into the notebook all right so for this we are going to go into notebook three and for this you're also going to have to do an extra Conda install so again instead of reading in the document term matrix we're going to read in the the actual corpus that has all of the words in order so again it's very easy to do text Bob all you have to do is do text Bob sentiment and you end up with those scores so again that was very easy to do just encourage you to figure out what's happening on behind the scenes and then if I map this this is what I see so on the x-axis I have positive and negative on the y-axis I see facts and opinions so what can you tell from this graph yeah he's pretty negative yes yes and so if I if I want routines I want to see comedians similar to Ali Wong who what might I recommend yeah like whoever's close by right family and so on so there you go with sentiment analysis so that was pretty simple and so another thing I did was I wanted to see the sentiment of a routine over time because one of my friends actually did this analysis on Disney movies and saw that like there was like a common pattern throughout all dis gets like starts very sad and then I get sad and I'm very happy I want to see if there was anything with Tommy routines than that so what I did was I took every data every transcript and then I split it into ten pieces of text and so I created this new list and it has 12 it has a different element for every one of the 12 comedians and then you can see every comedian their text has been split into ten different parts and what I want to do is calculate the polarity and the subjectivity for every single piece in there and so if you plot that out for one person you can see this is their sentiment over time and then at the end I created this sentiment all these plots for all the different comedians so anything you can tell from these pots immediately yeah maybe I mean this yeah there's like a peak here what's that yeah that's kind of why I took out at this it's like Ali Wong is pretty consistently positive and so I wanted to see if are there any other comedians that are pretty consistently positive family again who seek a microbrew yeah and so with this extra sentiment analysis I was able to find some extra interesting things about my data so there's some additional exercises for you as well alright so that was sentiment analysis so the next thing I want to talk about is topic modeling so this is probably the most complex thing we'll talk about today but I'll try my best to explain the idea so for topic modeling the input into topic modeling is a document term matrix because order doesn't matter here so again what topic modeling what we're trying to do is we have all these documents we want to see what are the topics that are being said in these documents and so we don't want we don't really care about the order of the words here we just want to know what are the words what's the bag of words that's in each topic and then the library we're going to use here is called Jensen so Jensen was built specifically for topic modeling and it's it it actually has different ways of doing topic modeling but the most popular one is called latent eerily allocation which is called health which is most commonly known as Lda and we're also gonna use NL TK for some additional parts of speech tagging which will go through the notebook and then our goal at the end is to find the various topics or themes across the comedy routines so I'm talking about topics I'm pretty vague way right now so I want to show you an example before I go through that so this technique is called latent d'Orsay allocation so latent just means hidden and dear Schley is a type of probability distribution and that's those are the basic ideas that you have to understand this so I'm going to talk about probability distributions coming up and then hidden it means that what's happening is I'm looking at the text in terms of probability distributions I'm gonna find hidden topics from that so let's go through a concrete example cuz I think this will make it make more sense so let's say I have these five documents here if I applied LD a on this set of documents what would happen is I would get an output that looks something like this so would say that document is a hundred percent topic a this is 100 percent topic B this is split between topic and topic B okay so the way to think about this is earlier I was talking about latent trait latent means hidden so these are topics that were hidden in these documents before that I couldn't find but using Lda I'm able to find those topics that's the latent part dear Schley is all about probability of distributions and the way to think about this is you see that this document here is a mix of topics topic and topic B or fans you were to think of that is this document contains a probability distribution of topics so when I say probability distribution I just mean it's a mix of these words so again this document is a mix or probability distribution of these topics yep great question that's all go through next okay so at this point what you would do is so I talked about this document ends up being a probability distribution or a mix of these topics in addition every topic ends up being a mix or probability distribution of these words so this is what you would get it as an output to Lda you would see this topic contains this much about bananas and kale this topic contains a lot of words about kittens so what do you think topic a might be about food right and what would topic B be about animals so now you can go back to your documents and say that documents about food that's about animals this one's about food and animals so that's the whole idea of topic modeling I have these documents I don't know what the topics are but then I can figure out what the topics are and then see for every document what's the mix of topics and then for every topic what's the mix of words okay so with Lanie I shall do your slate allocation it's all about these probability distributions or these mixes so again just to reiterate I said this a couple times but every document is a mix of topics and then every topic is a mix of words so you can see here this food topic it has a lot of things about bananas and kalynn frogs and then animals has a lot about kittens and puppies so that's a general idea behind Lda and the way that it actually figures out which what the topics are is it goes through these steps so first is again you want Lda to learn about the topic mix in every document and then the word mix in every topic so within each document what are the topics in there within each topic what are all the words that are in there so the first thing you do is you choose the number of topics that you think are in your corpus so people typically start with two and then you just build on from there and then what it does is it goes through every single word in every document so here's my example up here let's say we had this document it goes through every single word and it randomly assigns at a topic so it might say I is part of topic a like is part of topic B and so on just mix a random distribution of topics okay and then this is the most complex part so this is where it goes through every word and it updates those topic assignments so what's going to do is it's going to look at a word so let's say we're gonna say bananas let's see what it looks at bananas and right now banana is assigned to topic a sir sorry banana assigned a topic be so banana assigned to animals does that sound right no right so you want it to update so this is what how this is how Lda works it looks at every word and the topic Simon so like bananas let's say it's a sign to topic B it's gonna look at that and say should I reassign that to topic a or should it stay topic B and the way it decides whether it should reassign it or not is oh look at that word and it'll see how often does that topic that it's assigned to occur in that document so right now banana is assigned to animals how often does animals actually occur in that document that's the first thing it looks at and then it looks at that word and see how often is that word occur in that full topic overall so how often does banana occur in the topic about animals and in those cases those probabilities are both pretty low and so it's saying okay well I should reassign that to the other topic then so we update that topic assignment the topic a which is about fruit or food so that that's like the bulk of what's happening so what's gonna what's gonna happen is it's gonna go through every single word every single document and then try to figure out which topic should be part of it might end up in both great question it would depend on your corpus but it could absolutely be part of both so what happens at this point is it goes through multiple iterations of this so it does the random assignments I'll go through every single word in the whole corpus and that'll be iteration one and then all the topics will be slightly better and then we'll go through it again iteration two so you should start you try like probably a couple dozen iterations and then you'll see the topics evolve and eventually the topics will start making sense hopefully if they don't make sense then you're gonna have to some more data cleaning so this all seems very complex but luckily Jensen does this whole part for you so all you have to do is you have to say how many topics you want how many iterations you want to go through and do the interpretation so our input is our document term matrix our number of topics and our number of iterations Jensen will find those probability distributions that are best and then the output is you'll see for every topic what are the top words in that topic and then you have to figure out do these make sense or not and so this here is probably one of the most famous implementations of topic modeling Clayton do your cell allocation it was created in 2003 paper by David Bly and also Andrew aim if you've heard of him from Coursera so they all created this so this is one way to do topic modeling which is specific on text data but you can also use general matrix factorization techniques so those of you who know a little bit more about linear algebra so that's basically you take a big matrix and you can decompose it into two or three matrices and that's another way of finding topics because you take let's say like a bunch of columns and we need to compose it you can squeeze that down into smaller columns and the end up being your topics so those techniques are called LSI which is essentially a singular value decomposition for text you know what that is and then nmf which is non a non-negative matrix factorization those are also included in Jensen so you can use that as well but I think Lda is I like how specific to text so that's the one I like to teach okay so let's run through topic modeling in the jupiter notebook okay so with topic modeling you'll see that this is absolutely interative process so I have attempt one two and three here because my topics don't make sense that takes a long time to try to make them make sense so what I'm gonna do here is first I'm gonna import the document term matrix because again order doesn't matter so because we're words model and then all of this here like these three cells here these are specific to using Jennsen so I'm gonna import some modules from Jennsen I'm also going to import so I've comment this out for now you can also import logging so logging will help you debug so when you create a topic model it'll log what's happening at every step and you can actually look at the hyper parameters and tune them some more so that's for you to look at later and then at this point for these two cells what I'm doing so for these couple cells here what I'm doing is preparing for LD a so here the first thing I have to do is for LD a it actually requires a term document matrix so and so the document turn matrix we're gonna be using we've been working with its transposed so I've transposed it to a term document matrix and specifically for LD a you have to convert it into a sparse matrix first and then into a specific Jensen corpus so a sparse matrix is basically so here we have this matrix lot of zeros it's not the most efficient way to store that matrix what a sparse matrix will do is saying for this row this column there's a value in there and here's the talk about you so it stores it in that format and then that's just the format that Jensen reads it in and then Jensen also requires a dictionary of every single term and their location in that document term matrix and you can get that from the count vectorizer so if you remember our count vectorizer it was a big document turn matrix it remember it has those vocabulary items thing in it that it has for every single column what's the ID for that column it's just something that LD it requires so at this point this is so we've done all the setup this is actually where we're running LD a so LD a here you can see it requires four things as input so the first is our corpus which is our in this case it's actually our term document matrix but they call it a corpus ID to word which is our dictionary of for this term where is its location in that term document matrix our number of topics which will start at 2 and then our number of passes which will start at 10 and the reason I'm starting at 10 is 10 is actually not that many passes to make all those updates but just for this tutorial I want to make sure we have some time to run through all this so if you run those three it's gonna take a while because it's going through every single term in your entire corpus and making those updates so at the end of the day this is what so you see these are the two topics that emerged so for this first topic these are all the words there and the second topic you see here anything interesting from this yeah absolutely so you start to see those overlaps and those topics so when I look at this I don't find this super meaningful and so at this point I did a lot of work to try to figure out how to make this better the first thing I always do is I try to increase the number of topics but that didn't help me too much I saw so much I saw it like so much overlap between the topics and I was like I wish I could do better so the second thing I did was I looked at nouns only and so this is a very popular trick that you can use and which 104 nouns it doesn't it doesn't specifically weight nouns yeah it doesn't it doesn't do that automatically because it looks at all the words the same but yeah so it's very yeah yeah yeah which is which is why we filter by them so it's like the it's like the easy clean way to do exactly you're saying like look at the topics by nouns so the way I'm doing that here is from NLT kay I'm gonna import this parts of speech tag what I'm gonna do is for my text I'm only gonna look at nouns and the way I know that nouns is NN is because you pen has created their linguists have created this tags all these tags so you see there's there's a lot of manual work here like when I took my first text in a lot of class that's when I learned that it's not all magic it's a ton of manual work people have spent a lot of time labeling this stuff so let's do this just for nouns so at this point I'm gonna run through this because we're running a little bit low on time but you'll see here that once I filter out all the nouns or if I filter it so it's just the nouns some of these words are a bit more meaningful and so if I scroll down here you'll see these topics a little bit more about nouns they're getting a little bit more meaningful there yeah I have Bible talk a little bit about spacing at the end but you can absolutely do you Spacey for that as well I could see that cuz Spacey is like the newer version of adult ek and could do a lot of the same things but better and faster so that's probably he could remove the apostrophe I guess I guess they tagged it yeah big they tagged as a noun no I don't I don't really see it clearing up either so I didn't see it really clearing up either so the next thing I did my attempt three was including nouns and adjectives it was like not just about nouns but like how do people feel about those nouns so at this point I've included nouns and ends and JJ's again I know that because it's from the you UPenn tag set and then I ran through this a couple times and this was starting to make a little bit more sense to me but this point I'd spent like a day on this I'm like I think this is the best I'm gonna get within a day so at this point I decided okay out of all my topic models I think nouns and adjectives worked pretty well and I settled on four topics because it was as good as I could get for that day and so my final I'll be a model what I did was I ran my topic model for four topics and I ran it for a lot of iterations and the reason for that is so that you've reached that steady-state over time and it can get more fine-tuned topics so what I like to do is try a lot of look things for a few iterations and then once I Zone in on the best one I'll do that one for a lot of iterations yeah these are the four topics I and yeah so like the first one was like talking about mom and parents second one was talking about husbands and wives this was talking about guns and then this one is a lot of profanity and so those are the topics that I ended up with definitely could be better but the first path yeah absolutely yeah you can even absolutely like just tag things with topics as well so this you find useful when you have really large documents that you have like no idea what they're about and then you just can pull out so so LDN you when you would always do as unsupervised or if you know what the topics are you can't go the supervisor out and say you already know what these topics are and so then you can just see like what words are time tend to be part of that topic not using Lda yeah so all the topic mounting techniques are unsupervised including matrix factorization it's all unsupervised yep oh I think that's Oh within the error message it says you have to download something or import something just add that import to the line above it I think there were two that you had to go through so there's two there's only two so it's you've imported and I'll TK but there's a lot more that you have to do and all TK has a ton of stuff in it and there's more you had to download yeah you have to do that two times in a cell before and then it should work okay so yeah so so yes Jenson was able to figure out this topic contains all these words and then you have to figure out what those topics are so it's so it's always one thing about as much as a compression as like a mix of words or a probability distribution so it's like figuring out what the most common words aren't in that probability distribution and then it's like doing a better job of curating that list so the food that doesn't come from Jensen you have to label it as food yeah okay so at the end of the day the next other thing you can do is for now that you'd come up with the topics for every comedian you can see what topic they talk about and so it just ended up happening that every comedian was only had one topic they were assigned to but a comedian can actually one document one comedian can actually contain multiple topics but in this case every comedian is assigned one topic and you see that Ali Wong is part of that second group that talked about husbands and wives along with Mike Birbiglia and John Mulaney all right yeah which is pretty amazing with just like a short amount of time with its analysis okay so we have about seven minutes left I wanna I'm not gonna go through this one in detail but I think it's pretty cool so text generation your input into this is a corpus so order matters right because you're trying to figure out what the next word is and there's nothing special and pi thought that we're gonna do here for text generation but the output would be a new comedy routine so this is purely for fun and the way you can do this is using something called Markov chains so Markov chains are basically a way of showing how states change over time so the main assumption of Markov chains is that the next state of a process only depends on its previous state so instead of going through this example I will just go through the text example so with the text example how it works is it takes every word as a state and then it looks at the next word and it says how likely is this going to be the next word based on this word and then for this word how likely is this gonna be the next word based on this word and then you can see it looking even wrap around like if this is a word there's a 10% chance that this is gonna be a word the next word so the idea of Markov chains is the next day is only based on the current state so with in Python you can put this make this a dictionary and you can just say the keys are the current words and then the values are a list of the next words so it's very easy way for text generation it seems like a pretty simple model it is and you end up with some really funny comedy routines at the end so I'll let you discover that on your own in the notebook but if you want to do this in a more complex way then you would have to look at something like deep learning so with Markov chains you're only looking at the State but with deep learning you're also looking at not just a prior word but words before that and then with deep learning it also predicts those words and so it includes those predictions as well so specifically for a text generation you would look at long short-term memory okay so we don't have time to go to that notebook but I encourage you to look at it yourself and then I want to summarize all this by I won't ask you what you learned today sorry we don't have time for a discussion but this is what you did learn today you learned the data science workflow how to complete this end-to-end project all these things here and then my motto and you can think this is what we learned today here in this column and then this is what you can go into next so instead of just looking at a document term matrix look into tf-idf it's the next thing to look at so instead of word counts it weights some of the rare words higher and then these are other data visualization tools we talked about classification techniques like Navis and linear logistic regression more topic modeling techniques and then deep learning I also want to talk about the libraries so we use a little bit of n ltk with the parts of speech tagging but NL TK has a ton of built-in functionality so parts of speech tagging limit ization and so on text blob is built on top of that makes it easy to use Jennsen is for topic modeling and you mentioned Spacey so Spacey is like the next new thing so a lot of my students are using Spacey now but it's a lot faster than these other libraries and I feel like it's it's gonna replace the null TK one day I don't know okay so to summarize this was our question for today what makes her routine stand out and be for all the techniques that we used and so now we can answer that question what makes her every teen stand out though we saw that she talks about her husband and I talk about my husband a lot in my lectures I mentioned him once and then but yeah I have a lot of lessons and then she has the highest s word to s worst ratio which I don't like the F word at all so I don't mind the S word so and then she tends to be pretty positive and what's opinionated like me so who are some other comedians I might like so these people don't say the F word that often Michael Baigent Leigh doesn't at all he says no swear words these people all have that similar sentiment flow these people talk about similar topics and you can see Mike Birbiglia comes up a lot I just want to mention I finished this analysis on Thursday and then I got this text from my husband as I was looking at the comic I say which one and he says Mike Birbiglia and I was like Corey my analysis he is most similar to Ali Wong so my analysis worked yes no but I just that would be so funny Mike Birbiglia just randomly I got that text two days ago amazing right so next steps so try the text generation notebook it's super cool apply this analysis to other comedians and then also try this IR text data yep Oh interesting probably because I don't think they're yeah I'm thinking their recommendation engine probably doesn't actually look at transcripts yeah absolutely so I just wanted to finally give a shout-out to my company Metis because they gave me time over the last two weeks to actually create all this we are a data science training company so we do live online trainings where you have a professor and a TA teaching you and then we do in-person boot camps which I'm teaching the fall and that's it thanks so much for coming [Applause]
Info
Channel: Zohan Syah Fatomi
Views: 1,034
Rating: undefined out of 5
Keywords: zohan syah fatomi, python, nlp, python 3.8.2, natural language, natural language python, natural python, natural language processing, python alice zhao, alice zhao nlp, alice zhao mentis, alice zhao
Id: 8Fw1nh8lR54
Channel Id: undefined
Length: 111min 3sec (6663 seconds)
Published: Wed Apr 22 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.