NLP - Text Preprocessing and Text Classification (using Python)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[Music] hi my name is Andre and this week we will focus on text classification problem although the methods that we will overview can be applied to text regression as well but that will be easier to keep in mind text classification problem and for the example of such problem we can take sentiment analysis that is the problem when you have a text of review as an input and as an output you have to produce the class of sentiment for example it could be two classes like positive and negative it could be more fine-grained like positive somewhat positive neutral somewhat negative and negative and so forth and the example of positive review is the following the hotel is really beautiful very nice and helpful service at the front desk so we read that and we understand that that is a positive review as for the negative review we had problems to get the Wi-Fi working the pool area was occupied with young party animals so the area wasn't fun for us so it's easy to for us to read this text and to understand whether it has positive or negative sentiment but for computer that is much more difficult and we will first start with text pre-processing and the first thing we have to ask ourselves is what is text you can think of text as a sequence and it can be a sequence of different things it can be a sequence of characters that is a very low-level representation of text you can think of it as a sequence of words or maybe more high-level features like phrases like I don't really like that could be a phrase or a named entity like the history of museum or the museum of history and it could be like bigger chunks like sentences of paragraphs and so forth let's start with words and let's denote what Ward is it seems natural to think of a text as a sequence of words and you can think of a word as a meaningful sequence of characters so it has some meaning and it is it is usually like if we if we take English language for example it is usually easy to find the boundaries of words because in English we can split our sentence by spaces or punctuation and all that is left of words let's look at the example friends Romans countrymen lend me your ears so it has commas it has a semicolon and it has spaces and if we split on those then we will get words that are ready for further analysis like friends Romans countrymen and so forth it could be more difficult in German because in German there are compound words which are written without spaces at all and the longest word that is still in use is the following you can see it on the slide and it actually stands for insurance companies which provide legal protection so for the analysis of these texts it could be beneficial to split that compound word into separate words because every every one of them actually makes sense they are just written in such form that they don't have spaces the Japanese language is a different story it doesn't have spaces at all but people can still read it right and even if you look at the example at the end of the slide you can actually read that sentence in English but they it doesn't have spaces but that's not a problem for a human being and the process of splitting an input text into meaningful chunks is called tokenization and that chunk is actually called token you can think of a token as a useful unit for further semantic processing it can be a word a sentence a paragraph or anything else let's look at the example of simple whitespace tokenism what it does is it splits the input sequence on white spaces that could be a space or any other character that is not visible and actually you can find that white space tokenizer in Python library and ltk and let's take an example of a text which says this is Andrews text isn't it and we splitted on white spaces what is it the problem here so you can see different tokens here that are left after this tokenization the problem is that the last token each question mark is actually it doesn't it does have actually the same meaning as the token eat without question mark but if we try to compare them then these are different tokens and that might be not a desirable effect we might want to merge these two tokens because they have essentially the same meaning as well as for the text comma it is the same token as simply text so let's try to also split by punctuation and for that purpose there is a token tokenizer rate if ready for you in analytic a library as well and this time we can get something like this the problem with with this thing is that we have a Bostaph is that different tokens and we have that s isn't and T as separate tokens as well but the problem is that these tokens actually don't have much meaning because it doesn't make sense to analyze that single letter T or s it only makes sense when this combined with apostrophe or the previous word so actually we can come up with a set of rules or heuristics which you can find in treebank tokenizer and it actually uses the grammar rules of english language to to make a tokenization that actually makes sense for further analysis and this is very close to perfect tokenization that we want for english language so andrew and text are now different tokens and apostrophe s is left untouched as a different token and that actually makes much more sense as well as is and an apostrophe T because anapest apostrophe T is actually it means not like we we negate the last token that we had let's look at tightening example you just import an Ulta kay you have a bunch of text and you can instantiate tokenizer like whitespace tokenizer and just called tokenize and you will have the list of tokens you can use three bank tokenizer or word punctum Kyser that we have reviewed previously so it's pretty easy to do tokenization in Python the next thing you might want to do is token normalization we may want the same token for different forms of the word like we have word wolf or wolves and this is actually the same thing right and we want to merge this token into single one wolf we can have different example like talk talks or talked then maybe it's all about the talk and we don't really care what ending that word has and the process of normalizing the words is called stemming or limit ization and stemming is a process of removing and replacing suffixes to get to the root form of the word which is called the stem it usually refers to heuristic that chopped off suffixes or replaces them another story is a limitation when people talk about limit ization they usually refer to doing things properly with the use of vocabularies and morphological analysis this time we return the base or dictionary form of a word which is known as the lemma let's see the examples of how it works for stemming example there is well-known porter stemmer that is like the oldest stammer for english language it has five heuristic phases of word reductions applied sequentially and let me show you the example of phase one rules there are pretty simple rules you can think of them as regular expressions so when you see the combination of characters like SS es you just replace it with as SS and strip that es at the end so and it may work for like the word like caresses and it is successfully reduced to caress another rule is replace IES with I and for ponies it actually works in in a way but what do you get in the result is not a valid word because punishment with I why and it ends with I so that is a problem but it actually works in practice and it is well known stammer and you can find it in analytic a library as well let's see at the examples of how it might work for feet it produces feet so it doesn't know anything about the irregular forms for wolves that produce wolf which is not a valid word but still it can be useful for analysis cats become cat and talk becomes talk so the problems are obvious it fails on the regular forms and it produces non words but that could be not much of a problem actually another example is lamentation and for that purpose you can use wordnet lemma tiser that uses word net database to look up lemmas it can also be found in analytic a library and the examples of the following this time when we have a word feed is actually successfully reduced to the normalized form food because we have that in our database we know about words of English language and no irregular forms we're going to take wolves it becomes wolf cats become cat and talk becomes talked so nothing changes and the problem is lamb attire actually it doesn't really use all the forms so for nouns it might be like the normal form ulema could be a singular form of that noun but for verbs that is a different story and that might actually prevent you from merging tokens that have the same meaning the takeaway is the following we need to try stemming and limitation and choose what works best for our task let's look at the bite an example here we just import analytic a library we take the bunch of text and the first thing we need to do is tokenize it and for that purpose let's use three bank tokenizer that produces a list of tokens and now we can instantiate a porter stemmer or wordnet levitt Iser and we can call stem or lemma ties on each token on our text and get the results that we have reviewed in slights so it is pretty easy in Python and ltk and alter k2 so what you can do next you can further normalize those tokens and there are a bunch of different problems let's review some of them the first problem is capital letters you can have US and US written in different forms and if both of these words are pronounced then it is it is safe to reduce it to the word US and another story is when you have US and US in capital form that could be a noun that could be a pronounced re in the country and we need to distinguish them somehow and the problem is that when you if you if you remember that we always keep in mind that we are doing text classification and we are working on let's say sentiment analysis then it is easy to imagine a review which is written with caps lock just like with capital letters and ask could be actually us a pronoun but not a country so that that is a very tricky part we can use heuristics for English language luckily we can lowercase the beginning of the sentence because we know that every sentence starts with capital letter then it is very likely that we need to lowercase that we can also lowercase words that are seen in titles because in English language titles are written in such form that every word is capitalized so we can strip that and what else we can do is we can leave mid-sentence words as they are because if they are capitalized somewhere inside this sentence maybe that means that that is a name or a named entity and we should leave it as it is or we can go a much harder way we can use machine learning to retrieve true casing but that is out of scope of this lecture and that that might be a harder problem than the original problem of sentiment analysis another type of normalization that you can use for your tokens is normalizing acronyms like ETA or a comma T comma a comma or ETA written in capsule form that is just the same thing that the acronym ETA which stands for estimated time of arrival and people might frequently use that in their reviews or chats or anywhere else and for this we actually can write a bunch of regular expressions that will capture those different representation of the same acronym and will normalize that but that is a pretty hard thing because you can think you must think about all the possible forms in advance and all the acronyms that you want to normalize okay so let's summarize we can think of text as a sequence of tokens and tokenization is a process of extracting those tokens and token is like a meaningful part a meaningful chunk of our text it could be a word a sentence or something bigger we can normalize those tokens using either stemming or a limitation and actually you have to try both to decide which works best we can also normalize casing and acronyms and a bunch of different things in the next video we will transform extracted tokens into features for our model [Music] you
Info
Channel: Machine Learning TV
Views: 60,834
Rating: 4.9012346 out of 5
Keywords: NLP, Machine Learning, Text Preprocessing, Artificial Intelligence, Natural Language Processing, Python
Id: nxhCyeRR75Q
Channel Id: undefined
Length: 14min 31sec (871 seconds)
Published: Wed Jul 18 2018
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.