Python for Data Analysis: Working With Text Data

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hello everyone in this lesson we're going to learn more about how to work with text data in python text data cleaning and pre-processing can be one of the more time intensive tasks involved with preparing a data set for analysis because text data can have a lot of irregularities in it that may need to be corrected before it is useful for analysis so in this lesson we're going to go over a variety of different functions that you can use for working with text data which in python are known as strings we're going to start off by loading in a data set on kaggle this is actually going to be a data set of reddit user comments related to the nba basketball team the minnesota timberwolves because that is the team from the state where i am from and this provides an example of how messy some text data might be so we just extracted eight comments from this subreddit and we can see that there's a lot of different types of text in here this one just looks kind of like a normal comment this one looks like it is some kind of embedded web link and this one here is some kind of emoji where there's not any actual normal letters in it so to use this sort of data for an analysis there's a good chance you're going to want to do some text pre-processing steps to figure out what elements of these strings are even worth looking at so go over a variety of functions available in the pandas package that you can use for doing operations on text data first there is a pandas function.tower that will change all text to lowercase so for instance if we look at the first comment in our data set so that's comments zero and then we run dot lower on it it will convert that entire comment to all lower case and similarly we could use dot upper to make things uppercase so this is going to take all of the comments comments dot stir for string dot upper and we'll just look at the first eight but this is converting all of the comments to all uppercase you can also use dotster.len to check the length of all the different strings so if we run comments.stir.len we'll check the first eight of those with dot head eight we're getting the length of the first eight comments so pandas also has string splitting and stripping functions so we'll show how to use some of those the split function takes a string and then splits it up into a bunch of sub strings and puts them into a list based on some separator so in this case we're going to run comments.stir.split and then the argument in here is what we want to split on in this case we're going to just split on blank spaces this is a way you can create a list of all of the words in a string provided those words are separated by spaces and the dot strip method will take off a given character off of the front or end of a string so here we can for instance run comments dot stir dot strip and pass in square brackets and that will strip off any opening or closing brackets at the beginning and ends of a string so you can see this comment here comment number one used to have an opening brace here because here's the closing one and that has been successfully stripped off and if you pass in a blank space here like this you can use strip to strip off white space so that is probably the most common reason to use strip but you can pass in different characters like this to strip those off as well now if you have a panda series of several different strings you can combine them all into one long string using dot cat for concatenate so show how to do that here we can take our comments and say dot stir dot cat and that will take them all and paste them essentially into one gigantic long string so we'll run that and we're only going to look at the first 500 characters here because if we actually printed that whole thing to the screen it would be really long you can slice parts of strings in pandas in an element-wise fashion using stir dot slice so we'll show how to do that we'll take our comments then say dot stir dot slice and this is going to take a slice from the zeroth index up to the tenth index for every single comment so every single comment in our series of comments here we're going to get the first 10 characters only so let's run that and see that we've sliced off the first 10 characters alternatively we could use our normal indexing operations to take slices of the same sort so we could have just said comments dot stir and then used our indexing slice so use the square brackets to index into the strings and go from 0 to 10 and then do dot head and that would have got us the same thing now if you want to take a slice of a string and replace it with something else you can do that too with the slice dot replace method so we'll show how to do that we'll take our s comments dot stir and then say dot slice replace and then for the arguments you just put in the slice as the first two arguments so we're going to slice from index 5 to 10 and we'll cut that out and replace it with this wolves rule so all of these comments now should be having this wolves rule added into them now if you want to replace something within a string but you don't want it to be based on index positions like this you want it to be based on the actual content so like you want to replace a given word with another word no matter where that word appears in the string you can do that with the dot replace method so we'll show how to do that here we're going to do comments dot stir and then dot replace the first argument here is what you want to replace so we're going to replace the term wolves and then comma the second argument is what you want to replace it with so we're going to replace wolves with pups so when we run that any instance of wolves has been replaced by pups you can see in this first comment before it said the t wolves dot dot dot well now it says the t pups dot dot dot now a common operation when working with strings is to test whether a bit of text contains a certain substring so you can do that using the dot contains function here we're going to say comments that stir dot lower so we're making everything lowercase then we're going to say that stir dot contains and now we're going to pass in the thing we're trying to detect in all the strings and see if each comment contains this and we're going to search for wig and then this bar means or and then drew so basically we're trying to search for comments about andrew wiggins who is a basketball player on the timberwolves basketball team so wig is a part of his last name and drew is a part of his first name so any comment that contains either of these constructions will return true for this so we're going to save this function call as a logical index where every entry in the index is true when the comment contains one of these and is false when it doesn't and then after running that we can use this index to grab the comments where it's true by using it as an index back into the data so we can take the index we created and then say comments use the logical index to grab the comments that are about andrew wiggins and then we'll say dot head 10 just to look at the first 10 of them to confirm that they all do contain some information about andrew wiggins or at least if they contain one of these sub strings i suppose you could have a comment that contains one of these substrings and it's not actually about him but we'll see what we get here now for interest sake we could use this data to calculate something we might be interested in such as what proportion of all the comments about this nba team are about this particular player we could do that by taking the length of the comments that are about andrew wiggins so the ones that conform to our logical index divided by the length of all the comments and we can see that approximately 6.6 of all the comments on this subreddit have something to do with this player pandas has a few more useful string functions but before we go any further learning them we should touch on regular expressions regular expression is a sequence of special characters that lets you define a pattern that can match strings of different lengths that are perhaps made up of different characters for instance when we use the stir dot contains function above we supplied it with what is actually a regular expression when we wrote wig and then bar drew that is a regular expression that will match any string that contains either wig or drew that bar is a special character in this case that's telling the regular expression to match either of those things but regular expressions have a lot of other special characters that can allow you to match strings of a lot of different types so they can be very useful for various string processing tasks for example the period character in regular expressions is a meta character that will match anything other than a new line so we could for instance use the period to match any word that ends in ill we will show an example of this we're going to make a panda series of some different words and we can see that all of these words end in ill except for goal so if we wanted to write a regular expression that could match all of the words that end in ill we could put a period first so that's saying match any character and then match exactly ill when we run stir.contains here it should match all of these ones that end in ill regardless of the starting character because the period is matching any character that comes before the ill but the gull will not be matched because it doesn't have ill in it it has ull now square brackets in python regular expressions let you specify a set of different characters to match so for instance if we used it here and enclosed capital t and lowercase t this regular expression will then match a starting capital t or lowercase t followed by ill so we run that we see we get two trues one for till because it starts with a t and then ends in ill and one for still because it does have a t before the ill it isn't capitalized but because we enclosed these two different forms of t that was matched as well there are a lot of special characters and regular expressions that can only allow you to match different things i have written down some of the more common and useful ones here so i'll just quickly go through and show you what some of these do so if you include the range a through z in square brackets that will match any lower case letter a through z capital will match upper case letters 0 through 9 in the brackets will match any digit this construction here will match lowercase letters uppercase letters and numbers and adding the carrot symbol at the beginning within square brackets matches characters that are not in the set so if we instead start this off with the carrot a through z well that's matching any character now that is not a lowercase letter outside of square brackets the carrot symbol will search for matches at the beginning of a string for instance if we make a series with some new strings in it where did he go he went to the mall and he is good and we want to match strings only where he occurs at the very beginning of the string we can do that using the carrot so we'll say stir dot contains and then use the carrot to say we want to match it at the beginning and we'll say we want to match he with an upper case or he with a lowercase so when we run that it should match the second one and the third one because he occurs at the beginning of both of those but it will not match the first one even though he does appear within this the string because it's not occurring at the very beginning so let's just run that and see that we do get false true true there and the opposite of the carrot character in this case is the dollar sign character that will search for matches at the end of a string now parentheses in regular expressions are used for grouping and enforcement of proper order of operations an asterisk will match zero or more copies of a preceding character a question mark matches zero or one copy of a preceding character and a plus will match one or more copies of a preceding character so to give an example of how we could use these let's make some new strings here that contain some a's and b's and now we can run stir dot contains and use a regular expression to capture some of them so let's walk through what this regular expression is doing we're saying a star that means match zero or more a's then it's saying b match a single b and then period plus well the period is a special character that matches anything other than new lines so period plus means match any number of any characters after that other than new lines so let's walk through and see which of these substrings this should match well here we have zero or more a's we have a single b and then we have one or more of something after it so this string should be matched by this here we have a single b well a single b does conform to zero or more a's and then a single b but it doesn't have anything after it and here we need at least one of something after it so this will not be matched the a a will also not be matched because our regular expression requires a single b and both of the two remaining ones will be matched because they contain a single b followed by at least something and the optional a is there but it doesn't have to be there so we run that we see that we get true false false true true that matches up with what we thought it would be now we'll go through a couple more special characters you can use in regular expressions curly braces will match a preceding character a specified number of repetitions so if you use curly braces m then the preceding element would be matched m times if you use curly braces m and then a comma the preceding element is matched m or more times and if you use curly braces m comma n the preceding element is matched between m and n times so these curly brace constructions just give you a little bit more control over how many times you're matching a letter as opposed to just say period plus that's saying match it an unlimited number of times now one final thing we will note about regular expressions is that since it uses a lot of special characters like period to denote things what happens when we actually want to match a special character itself for instance what if we want to match a period we can't just write period because a period is a special character that matches anything so if we want to actually match a period we have to use backslashes to do that in regular expressions backslashes are an escape meta character so if you want to match an actual meta character you put a backslash before it and that will escape the normal behavior of that meta character and just turn it back into the normal version of itself where you can actually match it in a string so we'll give an example of how we might do that here we're going to provide some new strings that have some periods in them so if we wanted to match the periods themselves we'd have to instead of saying just match period match backslash period and the backslash says we actually want to match period the character and not use this period as a meta character so when we run this we should be matching things that contain a period and then a space after it which means in this case we're going to be matching mr ed and dr mario because they both have period space and if you want to use a regular expression to detect backslashes well backslash itself is a special character so to escape a backslash you need to write a backslash to escape itself but then you also need an escape to escape that backslash so it's it's kind of hard to even think about but if you want to do that you need to write four backslashes so if we wanted to match the backslash character here in miss slash granger we could do that by writing four backslashes here and when we run that we see that we do return true because it has detected that backslash now alternative to doing this four backslash construction that's a bit confusing we could use a raw string it's just a special type of string in python that simplifies some of these oddities when performing regular expressions to do a raw string you just pre-face a string with a lowercase r like this and if you do that you actually only need to use two backslashes one two for the character itself and then the other one to escape it because it is a special character and this will do the same thing so if you're getting into some complicated regular expressions it might be a good idea to use raw strings for that because that can simplify some of the constructions you might have to write now thus far we've been using regular expressions within the context of panda string functions you can use regular expressions in built in python as well but you have to import the regular expression library to do that you just do import re for regular expressions and there's a host of functions within that built-in package as well that you can use to run a lot of regular expression functions now the panda string functions we've been using for so far a lot of them accept regular expressions to match strings instead of just full sub strings actually two of the functions we've looked at already except regular expressions the stir dot contains and stir dot replace functions both of those accept regular expressions so you don't necessarily need to write out full substrings if you want to do more complicated string matching if you wanted to do more complicated string matching and replacement now to close out this section we'll look at a couple more useful string functions in pandas and use a simple regular expression with them the dot count method will count the number of occurrences of a given pattern within a string so if we say comments dot stir dot count and we use this regular expression wolves starting with either a capital or lowercase w and then the rest of the word wolves we can count how many times the word wolves appears in each comment so if we run that we can see that the first comment here or comment zero said wolves twice some of these other comments don't even contain it at all but comment three does so it's fairly common for a timberwolves related text to have that term in it and we can use the find all method to get all of the matches and then return them all as a list so here if we write comments.stir.findall and we use the same string as before when we run that we see we have been returned lists for each comment and all of the instances of the match strings so since that zeroth comment contained two instances of wolves we see a list with wolves twice now that we've learned a bit about different string functions that we can use in python and some about regular expressions we're going to close out the lesson by doing a slightly more complicated regular expression that might be a little bit more useful and something you might actually want to do on real text data so we're going to try to do here is write a regular expression that will allow us to identify posts that have web links in them so web links begin with the characters http or https so we could start by making a regular expression that matches those specific substrings to do that we could say comments.stir.contains and for our regular expression here we'll just say http and then an optional s the question mark says the proceeding character is optional and then after that we'll also include the colon character because that is also in a web link if we run this it will create a logical index where it's true when it matches this substring and false when it doesn't match we could then take that logical index to index back into the comments and filter down to only the comments that contain this construction that matches with web links so let's run that and then we will both print the length of that to see how many posts there are that have web links and we'll also look at the first five comments with weblinks just to see that they do in fact contain what appear to be web links now it does appear that the comments we matched all do contain web links of some kind but this is a pretty simple way of trying to match web links so if we wanted to be a little bit more specific with what we are doing we can make a more complicated regular expression so we'll give an example of trying to find all the different web links in each comment using a new construction so here we're going to say dot find postwithweblinks.stir.findall so find all is actually going to find them and return them all in a list and then we're going to find this regular expression now this probably looks a bit confusing so let's walk through it the beginning of it is actually the same as what we did before we're going to match http an optional s and then a colon so that's all the same after that we have a bunch of stuff in square brackets but it starts with the carrot character we learned that the character says match things that are not this so we're matching things that are not a space because spaces tend to break up web links we want to match things that are not a new line character slash n is newline character and we also want to match things that are not a closing parenthesis so we have backslash in there to escape the closing parenthesis because that is a special character for regular expressions so we have to escape it to make sure that we're actually matching that character itself and then after this whole square bracket we're saying plus so that's saying we want to match at least one character that conforms to this and any number of characters that conform to that but then as soon as our string shows any of these a space a new line or closing parenthesis then it will stop and only match up to that point so let's go ahead and run this and see what the links we get actually look like and we can see it looks like we did manage to grab what appeared to be links we have some youtube links here some things from instagram so this is probably doing a bit of a better job matching weblinks than the previous more simple construction would have done but it might not be perfect anytime you're trying to match something that is quite complicated and is something that lots of other people have probably wanted to match in the past for instance web links or email addresses or somewhat common constructions of that sort you can probably search the internet and find a fairly good regular expression that someone else has written to match it and use that instead of trying to write something yourself because writing something that does a really good job yourself can be a bit tricky if you're not very good at writing regular expressions and there's no use reinventing the wheel on trying to match something that everybody else has probably tried to match in the past and you can just reuse code that they know works well for that now it is also worth noting that these regular expression patterns we've learned about in this lesson are specifically for python a lot of them will carry over into other languages and using regular expressions in other languages but that is not a guarantee there might be some slight differences depending on what language you're trying to use them with so these skills will transfer over to other languages to some extent but just know that some of these constructions could be different if you're using a language other than python so now that we know a bit more about how to work with text data in python in the next lesson we're going to turn our attention to numeric data and go through various operations and pre-processing steps you might want to run on numeric data to get it ready for analysis if you found this video useful drop a like hit subscribe and i will see you again next time you
Info
Channel: DataDaft
Views: 3,091
Rating: 5 out of 5
Keywords: python text processing, python string processing, pandas string functions, pandas string operations, text proprocessing, string processing, working with text data, text in python, strings in python, text data functions, regular expressions, regular expressions in python, python regular expressions, findall, regular expression patterns, regular expression basics, python data processing, python data munging, python data wrangling, python data manipulation, pandas, python, text
Id: w7zFa3RAdcs
Channel Id: undefined
Length: 25min 16sec (1516 seconds)
Published: Fri Jul 24 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.