Tutorial: Keith Galli - Natural Language Processing (NLP) in Python - From Zero to Hero

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hey how's it going everyone and welcome to my Pike on 2020 tutorial on natural language processing in Python I'm bummed that I couldn't be there in person to share this with all of you but this online presentation is the next best thing so I'm excited to get started before we actually jump into the contents of this tutorial I thought it'd be useful to kind of set the stage the past two years have been really exciting for natural language processing I would say starting in probably like with like late 2018 a lot of really influential work was being published it seemed like each new paper was kind of more influential than the last and performance benchmarks were being like broken left and right and I think that period can be pretty well captured with this news headline I remember stumbling upon in in 2019 and that is new AI fake text generator maybe too dangerous to release says creators this this happened at the time when open AI had built this open AI GPT 2 language model and they were scared that people were going to generate all sorts of fake news stories and maliciously use their model that they hadn't initially released the full set of parameters they ultimately did release these parameters but I think that this kind of headline captures a important changing point in natural language processing and that was that the first time we had been building these models that captured the language semantics that we never had previously been able to before and for the first time we were really doing things that were crazy and like scary and powerful at the same time and really impressive and so for this reason I'm really excited to share this tutorial with all of you it's an exciting time and in this tutorial we'll kind of walk through we'll start with the basics and kind of some of the more traditional NLP models and we'll walk our way up to some miscellaneous NLP techniques all the way up to these state-of-the-art language models like Open AIG PT 2 so this tutorial is going to start with the basics it's going to start with kind of the NLP fundamentals we're going to look at basically how you can take and convert it into a numerical vector and probably the easiest way to do that is what's called bag-of-words so I'll be the first funnel we look at we'll spend a little bit of time high-level overview of it and then we'll jump into Python code related to it then the next thing we'll look at is a slightly different variation of how we can convert text to a numerical vector and that will be a word vector approach and we'll also see how we can use Python code to implement models that use word vectors after we go through kind of the NLP fundamentals we're going to go through all sorts of different miscellaneous and LP techniques so we'll go through reg X's basically pattern matching in Python will go through stemming and lemon limit ization we'll go through some basic spell correction some basic part of speech tagging and we'll finish this tutorial off by kind of giving it a high-level introduction to the state of the art models that I I mentioned so this open a IGBT to model and more generally transformer what is called in transformer architectures so another big name one is Burt so we'll kind of introduce and see how those came about and I'll show you how you can use Python code to play around with those and build even more powerful models a little bit about me my name is Keith galley and I currently work at a company called posh technologies we're a small software startup that builds chat BOTS and other conversational AI solutions for a wide range of enterprise customers at posh I worked a while on our intent recognition models for chat BOTS so actually how the chat BOTS taken text and processed that automatically to kind of understand what a customer is asking it in addition to that one of my hobbies has always been teaching and ultimately that's kind of what leads me to give this presentation today but also I on the side run a YouTube channel youtube.com/ /kg MIT where I post all sorts of Python content and specifically a lot of data science tutorials and then the last thing I wanted to say is that I'm bummed that PyCon is not in person this year I would love to connect with everyone so feel free to reach out to me on you know LinkedIn Instagram Twitter if you search up Keith's galley one where you'll probably find me the first model we're going to talk about is called the bag of words model or also sometimes called bags of words model and to give you a little bit of intuition of behind this approach I mean whenever we're doing any sort of data science related task we like working with numerical vectors and obviously our issue with text is it's not a numerical vector so this bag bag of words approach is kind of the easiest way to take sentences and you know pieces of text and convert it into a numerical representation so imagine we have before uh Turin tsa's I love the book this is a great book the fit is great and I love the shoes these come from two different types of categories at our retail store the books Department and maybe the clothing department and we're trying to build a model to tell them apart well what we can do with the bag of words approach is it basically says take all the words that we see in all of our utterances and just extract out each individual word each unique word so that would give us I love the book this is a great fit and shoes take all those words and then based on what words what which of these sentences have which of these total unique words we create a vector where one a one represents this sentence has the word in a zero represents it doesn't and so that's the basic approach of a binary bag of words model and so what does that look like in code well I think whenever we're building a bag of words model the easiest library to use is the scikit-learn library so we're gonna extract a couple things from the scikit-learn library roof I'll make my text a little bit bigger from SK learn dot feature extraction dot text import count vectorizer and TF or so I will just start with a cat vectorizer so when it says count vectorizer this can be either a binary bag of words that we just mentioned ones and zeros it also could be if you imagine one of these phrases I mentioned I love the book the book was great if that was all like one sentence maybe a count vectorizer could also say that book appeared twice so it ruff a straight count of the word in the overall sentence how many times the word appeared in a sentence but I usually just tend to use a binary approach if you ever don't know how to find something like this if you know of the bag of words model but you don't know where to look to find it I just literally would do like a Google search SK learning bag of words you know get to the scikit-learn documentation and through that you should be able to find some examples using bag of words basically so here we have the count vector that I just showed you and it shows us how to to utilize it so we might reference this a couple of times as we do this so we have our count vectorizer and above this I'm going to just real quick to find some training that are in sis I'm gonna call this train X I'm going to say let's use our examples I love the book as a first one I love the book this is a great book was the second one and then the last two where the fit is great and I love the shoes all right so that's our for training utterances we can imagine so first thing we might want to do is utilize our cat vectorizer to transform this into a vector representation so once again I'm going to reference the documentation here and you know I know what I'm trying to do but I sometimes forget the syntax so this is why I always reference documentation like this so it looks like we want to use vectorizer dot fit transform so vectorizer equals count vectorizer and then fit transform on our sentences so we can do that so we can say vectorizer equals count vectorizer vectorizer and then we can say our vectors are equal to vectorizer dot fit transform train X and when this says fit transform it fits a dictionary around our training utterances so basically that's the first step in it's it's finding all the unique words so it knows how to make these vectors and then it's going ahead and transforming the utterances we pass in based on that vector that we just fitted so now we get our vectors so I could do like print vectors 0 and it'll be the vector 4 I love the book if this runs train X is not defined on here on the cell - I'm using Google collab right now to run this ok it looks good I might just do dot vector I think it is there's I just see I would print to this out real quick I guess we could do two array to see how it actually looks we'd also do vectorizer get feature names that looks helpful to so I'm gonna print some things out I'm gonna print out vectorizer die get feature names and I'm also gonna print out let's say vectors print vectors to array let's see what happens okay cool so our different words in this dictionary or book fit great is love shoes the this and one thing to note is it looks like it actually stripped away the a so that might be just a feature of you know one word utterances it it strips away part of the way that this contractor riser is implemented but if we look at the vector here let's look at the first utterance that's I love the book so we have a one here for book right we have a one here for the love right does the second to last so we have a one there and I guess it stripped away I as well so we don't see that but we get a vector representation other than that one word honors is where this vector Esther doesn't count and one thing to note is that I think by default if I made this I love the book book count vectorizer is non-binary so it actually matters how many times you type something in so if you wanted to you could do I think binary equals true here and now it switches back to ones and zeros okay so that's the basics of the bag of words just seeing it and making a very Toya example of doing it let's build a quick model to actually classify these as being clothing related and these as being these first two being book related so I'm gonna call something train Y so that's going to be first ones books second ones books and whenever I have a repeated string like this I like to make it its own variable just so I'd make sure I don't actually like misspell it so we're gonna make a class category here real quick and just label books as one of the categories and that's going to be the string books and clothing as the other category and you'll see why I'm doing this in a sec so now when I want to I'm doing this in order so I have four things I'm going to give them their four labels in order so we have category books we have category books and then the last two are about clothing so those are category clothing all right and I'm gonna run that again run this again I guess and now we're gonna build a simple classifier to let me just hide this print so you can see everything real quick I'd build a simple classifier to classify whether these are book a book category and this is clothing and we can use it on new utterances so how do we do that well once again the psychic learn library is very helpful in this and what a good classic classification model for text is often a linear SVM so that's what we're going to use so we can do from SK learn import SVM and then we can divine our SVM by the following I'm going to say classifier SVM equals s VM SVC with a linear kernel because I want to use a linear SVM because it's I know just with my null like background knowledge that this is a good classification model for text oftentimes and then with that SVM we're going to do a fit on our training vectors which is right here these vectors because we couldn't just pass in the text so I'm gonna call this train X vectors vectors and then our Y labels which is just train Y so this should fit a model to the four utterances we have here and we got a error I didn't rerun this so cool and then finally let's do the fun part and predict new utterances with this so I'm just call this text x equals vectorizer dot transform so before we had a fitted dictionary and transform based on our training utterances now that we already have a vectorizer we can just do transform into the fit transform and so I'm going to just say something like how about I like the book so based on what we have here because we said the word book you'd expect that this would classify as the books category so we can do classifier see you left SVM predict on our test X and see what happens nice books at school let's try like I love the shoes maybe we say something like shoes are alright and wow that says clothing and obviously we only trained this with four utterances so it's only be so powerful but so yeah it's only gonna be so powerful but as we added more and more utterances to this our dictionary would grow more and more within this vectorizer fit transform and we get a more powerful model so the more training utterances we feed into a bag of words model oftentimes the better it does and I guess one of the caveats to that is you might build such a big dictionary that it might get hard to process like the model so another technique you can add in here and definitely ask you learn offers you some ability to do this which maybe you only take the top 1000 words that occur so that's like another option another thing to know about bag of words is right now we are doing a unigram approach where we're just taking each individual word by itself but you can also do a bigram approach it's called which would categorize I love love the the book all as their own unique utterances let's try to do that real quick see let's see what we have Engram range just probably gonna help us so we're gonna type in Engram range because it probably give me some more information about that equals oh so I'm gonna say one one in two together I think both one words and two words so we can print this stuff out again and see if that is the case yeah so now you see we get two words as well so that we just now captured you diagrams as well and one thing to note about these diagrams one reason you might want to use two words in a row is let's say we were talking about if things were positive or negative well if I say something is great that's very positive but if I use a Biograph and the word before great is not you know not great then that's a completely different sentiment so that's one way you know engrams are important and as you you know words depend on their context so you know adding additional words in this sometimes helps it could also be detrimental if you had too many words here in the intagram range you might have so many like a random miscellaneous like three word phrases that only occur once in the entire set and it might actually skew the classifier off a bit so I'm going to just stick for now with just a single one unigram model and here's the final the final thing I'll say about bag of words before we get into the next model is here's the limitation is if we have a word that is not in our training utterances we're not going to know how to handle it so if I said something like I love the story well to us that's very similar you know to book and books but to our model we've never seen the word story so we might not know how to handle this so let's see what happens when we draw this and here's the issue yeah it classified it as clothing even though it's obvious to us that story and book are very related and honestly I think even if we type in something like books it doesn't know that book and books are the same because it has seen the word book right here but has never seen books so it's like pretty dumb if it hasn't seen a word and yeah again it says clothing even though it's very obvious to us that book and books are the same so that's one downfall of bag of words is it's great on the stuff it's trained on but if it hasn't seen a word then it just fails miserably and you know that's not great when in the human language you know we can say different things in so many different ways so this leads us to the topic of word vectors and word vectors are another approach to turning text into a numeral vector and the big goal with word vectors is to convert text into a numerical vector that captures some of that semantic meaning in the vector space that you're mapping this piece of text oh and what I mean by that is that imagine you have the words red blue and yellow these are all three different types of colors and what we're trying to do with word vectors is have an honor we imagine you have this big you like vector space we want the the similar words to be mapped in a similar spot of that vector space so red blue and yellow would all be colors and should be mapped somewhere similar and there's many different approaches to training a vector to ultimately do this and one of the popular methods is called word Tyvek and weird effect works in one of two ways usually there's two popular like most popular I guess ways to train in this word of act one is called continuous bag of words and one's called skip grab the different details of these two approaches is it super super important but what is important is to kind of understand on a high level how they work so imagine we're going back to our kind of example of classifying between books and clothing related so imagine we have the three phrases we say best book I've read in years we have great story and characters and we have no development of characters during book so these are three phrases related to books we can tell this pretty easily by reading these but how do we turn a model to see that and what the word devack approaches do the continuous bag of words and the skip gram approaches is they look at a window of text so oftentimes that window might be you know five tokens long so best book I've read in its not be kind of our context window and basically what we'll do is we'll selectively look at different tokens in that context window and based on that token will utilize the surrounding tokens to kind of figure out the context of each token and kind of start developing a meaning of each token and so translating that to this example we can start if we read enough text we can start to see relationships between words so for example here one relationship we might develop over time is book often appears close to read so now we can associate book and read into a similar spot of the vector space and this is ultimately going to be trained through some sort of neural network architecture for the most part so given our neural network architecture we're you know our model is learning that book and read are close together in this next example you know great story and characters we might start relating story and characters together and you know we'll see these together so we'll know that those are related should be similar vector space and then this last example no development of characters during the book well maybe we relate development and characters that's what our models learning you know also by learning that characters and book are related and and we can kind of start building out farther like bigger relationships like okay we've seen characters and book here we know that those are probably related okay we go up here story and characters are often together so I'm guessing that story and book are probably related as well and this is a very title of what's happening with hundreds of thousands of like sentences being fed into these types of models and we're ultimately building up these word vectors out of that so it's one thing to talk about it let's start jumping into some Python code to actually show you what that looks like and I think the best spot to easily utilize word vectors in Python would probably be to use the Spacey library so I whenever I'm like trying to remember how to do something like this you know I'm going to be doing a Google search finding some sort of information on how I use a spacey and use the word vectors they offer as again see we're probably have to download some word vector model because we're not going to Train this from scratch we'll use what's something that's already been trained and then we can kind of follow along like this it looks like okay yeah you can look through here maybe you look through the actual documentation but spacey is a good place to start and let's start implementing this okay so we're gonna need to use the spacey library before we can use the space elibrary we're going to need to download that language these train vectors so I already did that previously so I'm gonna just go ahead to the top of my Google collab file and this also apply should work with a Jupiter notebook or whatever you're using or your your browser if you might need to do like a pip install here I'm going to insert a code cell and I want to make this above I'm gonna show you a couple installations of things that I need here because by default the google collab that I'm using won't have the spacey language model that or the Spacey word vector model embeddings that we need so we're gonna have to install this so in here I can do pip install space see so I think space is already installed but sometimes you need the most up-to-date version and I think this will help us do that and we're going to do Python - M Spacey download we're going to download the medium-sized embeddings but as we can see in this doc that I walk through there's also a large model but that might take a bit long to download but if you want to try some more powerful word vectors maybe try that large model okay so I'm gonna run this downloading okay so we've downloaded it and one quick note is sometimes if you're using Google collab and you do this you might have to restart the runtime before changes take effect like imagine you already imported spacy I don't think that Spacey would recognize that this model is here unless you restart that runtime so I just did that to be doubly sure so import Spacey that's our first step okay it looks like it imported properly and now what we want to do is load in that we're going to call NLP that word embeddings model that i just downloaded n core web medium-size so let's see if this works hopefully it does we only need to load this in once and then I'll stay in memory and like if you're not remembering how to do this I always like to remind people literally just do a Google search word vectors Spacey documentation the article I was looking at wasn't the actual Docs reply actually find the docs here and that will give us yeah look at this this gives us some nice examples right there and that makes it a lot easier than trying to figure it out from memory so let's see want to see a actual value looks like factors of values I think I can do dot value and get my vector great okay so what we're gonna do here is basically we need to take text so we can if we want to use the same text from above so I'm going to just rerun this cell so now we have all these examples that we were using previously with our bag of words model so that's called Train X and now I'm going to say our Doc's equal basically we're going to need to convert if I can see the initialization here I really look up one more time word vectors Spacey example I just want an example to see how I can do things Spacey 101 looks like I already had clicked this next one no this one right here so let's try this okay look at this yeah there we go this is nice it's showing us how we can make this a little bit bigger so you can see this is showing us how we can get that vector if we pass in doc NLP do vector then we get a vector representation did that see well maybe the doc container had the information here in the docs vector vector vector vector I want to find it here okay here we go yeah a real-valued meaning representation defaults to an average of the token vectors so basically if we do NLP of our phrase and then do dot vector it's going to take all the individual word word embeddings and just average them together so that's I think what we ultimately want if we want to build a model around this and also to note that these word vectors are usually hundreds of they have you know a dimension one by like several hundred so they they're pretty big but they kind of have to be to capture the information we need okay so docks are gonna be LP of text for text in our Train X that we just defined and now each of these items in this docks list is a word vector representation of our sentences that we defined above here so I love this book this is a great book fit is great I love the shoes and just so you have that here it might be helpful for me to just print out train X so I can remember that okay so that's what we're altima converting so now if I print out docs you're gonna see that it's I guess it keeps the if I print out docs dot vector or hoops if I print out a let's say the first Doc's vector you know SIA this is the word embedding representation the average word embedding for each of the words in I love the book so that's cool we did that pretty easily using Spacey and a couple lines of code and now that we have this okay so now that we have in this let's fill the same model that we've built for the s the back of words model for the Spacey model so we can define a classifier again and I'll find it like this so we're gonna define a classifier and I'm gonna give it a slightly different name and I just say SVM and I'll give it w v forward vectors at the end just so we can tell these apart and we want it to not fit to the Train X vectors anymore but we want to fit to will define word vector word vector vectors will say I'll just call them train X word vectors just to separate a little bit from the other word vectors and that's going to be a dot vector X dot vector for X in Docs so we're just getting all of those this list so now we're gonna pass that into our fit and the the Y labels stay the same as above books books category clothing clothing all right so I let's see what happens here we were fitting this we run this again and run this again SVM is not defined okay I need a re-import SVM from scikit-learn cool I think of one other thing I needed a reinforced nope I think we're good okay so we have our model now what happens if we try to predict some novel utterances so to do this we can pass in NLP of a phrase so how about I love the book we'll just do that again here and then we'll have to grab the vector for that what's gonna happen here oh it's trying to do too much in one spot so instead of doing that I'm gonna say test X word vectors equals well I'll also define a test docs that's going to be equal to trying to do a lot here sorry I feel like this is simple but as I do this life it's sometimes tough so I've test is going to be equal to is equal to a list of words so let's say I love the book test Docs now is the NLP representation of test act or texts for text in for text in test X and then finally the word vectors are going to be X naught vector for X in test Docs cool so now we can predict test X word vectors cool sorry I was trying to simplify it too much and as a result no so close how did I keep getting double things here okay I love the book works as expected I mean that was an exact take from here so that recognizes books but now let's try to find the power of of word vectors and let's type in something like I love the story and we're hoping that story and book have a similar word vector representation so that when we do this averaging when we do this NLP text which basically averages all those word embeddings together and actually get the vector value of that average embedding that we get a books classification again there we go look at that I love this story gets books as well and now let's try testing out some things for clothing related so I love the shoes works so if this is properly capturing the the semantics you should be able to say something like I love the Hat and hopefully that's this clothing look at that cool I love the hats that should also be clothing I love the books even though we haven't seen books exactly it now knows that book and books are more related because it's seen those in similar context windows and yeah there's all sorts of cool stuff that we can do with this you know these earrings hurt or something and already with just four training examples because there's so much power baked into the Spacey word embeddings the word embeddings in general even just with this medium-sized model that we can already predict a lot of things correctly just by knowing that you know semantic space and I think there's like word vectors are so cool and this concept is so cool that we can do so much with language like this this gets me really excited I guess before I end this little seg section there are some drawbacks to word vectors they're not a cure for everything I think one thing you'll see is that it worked out pretty well for us because we only had two categories we only have books and clothing but if we were trying to use word vectors by themselves and I think at the exercise at the end of this tutorial we might see this if we were to say use ten different categories instead of these phrases being like four words they were like 50 words then when we're trying to capture in a bedding for the entire sentence we're averaging together like 50 word individual word vectors and the actual meanings of all these individual words might get kind of lost in that averaging process so sometimes they're not quite as precise as maybe using bag of words in that case because things get kind of mixed together another drawback of the standard word Tyvek word embeddings is this is a little bit different than the context that we're worrying about but imagine we were trying to get some sort of meaning for the word check so I went to the bank and wrote a check show that's gonna the check in this case he's gonna have a specific meaning but if I also had another word in bed a another sentence that was let me check that out check and let me check that out is very different than writing a check but the word vector is going to be the same for both of those so you do get four words that have multiple meanings you do get a little bit of messiness because both their meanings are kind of try to are captured in the training process and ultimately part of those meetings are probably lost so we're vectors are very cool pretty powerful but they don't solve everything and ultimately left a lot of room for improvement that it was ultimately a lot has been developed recently as I kind of mentioned at the start of this tutorial all right in this next section of the tutorial we're gonna kind of do a rapid-fire overview of a bunch of different NLP techniques that are good to add into your NLP toolkit and so the first technique we're gonna look at is using reg X's so reg X a reg X's our pattern matching of strings and they're not a Python specific concept but we can definitely effectively utilize them in Python so real quick we're gonna start with a quick overview of PI of reg axes basically the way you can think about them is you could have all sorts of different types of phone numbers so like one two three one two three one two three four would be a valid number you know maybe a different set of numbers that's the same format five five five five five five five five five five would also be another valid phone number maybe you want to right it a little bit differently and maybe you do like +1 - parentheses 1 2 3 - 1 2 3 - 1 2 3 4 these are three different ways to write a phone number but they're all like technically valid ways to go about it so one way we could use red x's is pattern matching on like figuring out if something as a telephone number or not and so like in this case one thing we could see is that it has three digits followed by some sort of punctuation or maybe no punctuation followed by three more digits followed by four digits and basically we could define our egg axe for that and similarly for this like last one you could define in your FX that it could have a plus one or plus some number optionally and so records allow us to easily capture these different types of patterns and effectively add this and to kind of like code that we're writing so you know phone numbers writing registers for phone numbers is one thing it also might have like a password checker where if you're logging on to a site they say all you need one symbol one character one uppercase character and a number in your password and it has to be plus ten ten digits or more a red X can help the people implementing that site on the back end make sure that your password meets those specs emails formats another thing so I think the easiest way to get into red X's is to just start right away with an example so I'm going to say as the example is that say we want to match a string that starts with the letters A and B and they can have in between in the middle of the string it can have any number of characters it doesn't matter how many characters that has as long as there's no white space so we don't want any white space and what we're trying to match and then it needs to end with the letters CD so whenever I'm going about working on our egg X I usually will start looking at a like reg X cheat sheet to just remind myself of what we can do with a reg X so I'll link this in the git a page for this tutorial but a page like this has all sorts of useful tidbits so ultimately what we're going to want to do is have like exactly air be so or exactly a B which is going to be a group followed by any number of characters so if you see if there's this special dot or period that's any character except newline so maybe we will utilize that and then there's these quantifiers so if we could have zero or more one or more zero or one we can use these quantifiers and there's some other stuff too like word boundaries is useful but let's try to write a reg X for the case I just said so I'm going to the site regex 101.com and it actually has the flavor you can test so I clicked on Python so what I said is it needs to be a be to start followed by any number of characters which is period and then it doesn't matter how many there are so it can be the star that means it could be zero or more followed by CD so that's like the basic implementation as we can see if I do a B EE CD it matches if I do ba all this and followed by X or something it's not going to match that one because it doesn't start with a B and it doesn't end with CD continuing on I said that there couldn't be any white space so let's see if a B space CD works and that does work so now we need a fix our egg axe to disallow white spaces going back to our cheat sheet we see that there's this not character not a B or C so we can utilize that along with whitespace this character class so we can do forward slash forward slash s to do the whitespace we're gonna do this brackets will do not wait space so let's go back to here so instead of dot star going to do brackets not slash s close bracket and then one or more of that and so now we see that this works but this one now doesn't because it has white space in the middle if I remove that white space it's now a match that's pretty cool one thing to note here is I could do something like xxxx ABCD xxxx and that's still saying it's a match or at least like it's saying that a line is a match because that's included if we wanted to make sure that it's exclusively this to start the line in this to end the line we can use a couple more special characters so this is the end of a string and this is the start of a string this is used without the brackets and it means start up string but if it's used in here it means not so we can add that so start up string followed by end of string so now this one doesn't match at all and that's the solution to that little exercise so what does this look like in code well we can quickly go to our google collab file we can import the regular expression library which is just re in Python and we can start out with our regular expression which we could define to be whatever we want so if we were defining the one I just described it would be carrot a be followed by not whitespace star and then followed by CD followed by the end of line so now we compile that to allow Python to know that that's a actual regular expression not sometimes when we're defining regular expressions and Python uses they are in front of it just to help us highlight and help us now and then we want to do if we want to see if something has that we could say here are like tests for our phrases ABCD xx x eb x XX c d eb space CD so only the first and third should match so if we wanted to check if something matches we can do for phrase in phrases there's give me two main functions that we're going to use when we're checking for matches I'm gonna do a regular expression match we're going to pass in first our regular expression and then we're going to pass in the phrase that we want to see if it matches and if it does match I'm gonna make this a conditional thing so if the regular expression matches that I'm gonna have it append to a list called matches the phrase it's a lot to find the matches list okay and then finally print matches run okay so as you see as we expected the first and the third match now here's something interesting let's say we took a way the requirement of this happen to be the first and last thing in the line and now we added something like AAA in front and then CCC if we wanted to just check if a regex is in a tire string so here we do have it in the string but it's not the start it's not the entire thing there's places in this that don't match if we rerun this code you'll see that this regular expression match function no longer says true on this one right here so the other main thing we're probably on to use when we're searching texts for regular expressions is our e dot search and as you can see now it still matches those those two and quickly just to kind of apply this to our toy example that we've been doing imagine we wanted to create a regular expression that matches read story and book maybe we're making some sort of hard coded rule too we were making some sort of hard coded rule to find if something's in the books category so I liked that story I like that book this hat is nice as you can see this regular expression this is the or sign it counts those one quick nuance if we wanted to make our our reg ex more complicated is that I could try to trick this and say like history like and instead of read I can maybe say I the car shredded up the hill I don't know if that makes sense but you can see that read is inside of that but it is not actually referencing the word read well notice that still matching those things so one thing that's useful to know about is this word boundaries character within the red X so 4 /b means it needs to be between word boundaries and now watch what happens when I run this again no matches so it now knows that story has to be by itself and one thing that's nice too with that a /b format is that you can have a period or something at the end it knows that that's a word boundary so that's like one way we maybe would apply it to our example that we've been working on but there's so many use cases for rag axes all right next we're going to quickly look at stemming and limit ization and python and these are two techniques to normalize text and what I mean by that is that in our first example with bag-of-words we saw an example or a problem where when we train the model with the word book it didn't know the words books even though to us that's very straightforward so one example of what stemming and limit ization will can do is take books and kind of reduce it down to a more canonical form of book and it can do like several different things so imagine these techniques could help you turn reading to read books to book stories and now here's where a little bit there's a little bit of a difference stemming follows an algorithm and is not to give you a actual physical true English word so it might reduce it down to story whereas for Levitt izing you would take stories and it actually is using a dictionary making sure that everything in outputs is an actual word so that would output story there how do we use this in Python well I think the library that's easiest to you the library that's easiest to access stemming online monetization is probably the NL TK library so we'll use that l TK library okay so first off we're going to need to import penalty K and that an OTT NLT K stands for a natural language toolkit I know trying to do too much import NLT K and we're also going to need to do is import a couple or download a couple of things for and I'll TK and this should work out of the box I believe if you just do NLT you get download so we're getting stop words weirdin and I think the stop word is is going to be for my next example enough we need that for the stemming and limitation section alright downloading all that and now let's go ahead and start with a stemmer so to get a stammer I recommend importing two different things I recommend importing tokenize library which basically could take a sentence and break it into its individual words as well as the actual stemmer and we're going to use the porter stemmer for this example so I'm gonna do a health key to NLT KDOT stem will import the porter stemmer and now let's go ahead and initialize that stammer and now we can just type in like a test phrase so reading the books will say and - first we'll need to tokenize that because if we just try to stem but just try to I do stemmer dot stem this is all you have to do to stem with the phrase it doesn't know how to process this because this algorithm expects a single word so we're gonna have to tokenize it so you can just say words equal word tokenize of our phrase and now we can just do stemmed words we're going to do this in a for loop so that forward in words we're going to stem the word stem r dot stem of the word and we're going to append that to our stemmed words okay and now we can do a join by spaces of the stemmed words because the final thing so we started with reading the books and the stemmer now it got us down to reading the book and this immediately can be seen to be very helpful for like that bag of words model that we built earlier where if you didn't have many training examples if you did stem the training example as you did have and stemmed any incoming phrase that you didn't hadn't been trained on it would probably help improve accuracy there a couple examples to see so if I did I said stories was one that was a little bit weird see how it stems it to story okay so it's not guaranteed to have a word and sometimes you'll get collisions on two words that aren't necessarily similar so there are some drawbacks but it's a nice little quick trick to have and I think I just want to check something with that whole tokenization one thing to note is that you also might want to like strip out your punctuation or handle that a little bit separately if you're trying to return a phrase that makes sense because as you can see right there it uh at least the way I had to find this you have to be a little bit tricky because when you tokenize this I print out the words it treats the punctuation as its own own word alright moving on to lemon to taxation let's now from penalty K that stem import the word net lemma tiser so this is using this corpus called word net to help reduce these words to a more simple form alright so pretty similar to the last one we could just do lemon ties air equals word net lemma ties there and same thing for the example as before we need to tokenize whatever phrase we use oh my god I didn't mean to copy the cell no delete this I just wanted to copy this part right so reading the books let's see what happens if we I'll copy this too no not the cell just the piece and this will be lemma ties ties words not and it's pretty straightforward and says down its limit eyes that's two oops the word and let's see what happened here print lemma ties or I guess we'll do we'll join them again join of lemma tightest words run that what's gonna happen reading the book okay hmm it's a little bit different than our above example it turned it into read the book and here's one thing that's a little bit trickier about using the lemma tiser with NLT K is it expects for each of these words it expects the part of speech and by default it says that each token is a noun by default so that's why I converted books to book but not reading to read if I instead said these were all verbs will see that it is read the book I guess that's interesting still made books book is it books he book yeah I guess like if you think of books as a verb you like he books it that would be book that makes sense so I guess there is a verb books but yeah the one caveat is you very try not effectively utilize this sometimes you have to do some sort of part of speech tagging which I believe I'll get to in the tutorial and then maybe utilize that part of speech tag here to like truly reduce down all the words but it might be also helpful to just reduce down all your your noun phrases or all your verbs so that is lemma ties and we'll we'll move on from there next while we're still looking at NLT K we're quickly go over stop words and basically stop words are the set of English words that are kind of most common and sometimes we might want to strip those out of our phrases because they don't have much meaning to our sentences so a couple examples of stop words might be this that he the these them those types of things so we can easily do this in penalty K so here we go okay so we're gonna continue using n LT K we've already imported it at this point and we're gonna do actually maybe just for he'll be will be helpful is if I copy and this just in case people run these independently page them that and then we're also going to have to copy in the actual stoppers or import stop words and here we go so our stop words are gonna be I don't want to stop words because otherwise all over write this import of stop words so I'll just say stop words equals stop words words and then we have the passion that we want English I believe actually maybe it would work without doing English so let's print out stop words think it's oh I guess I guess you do have to specify English because I don't know what that means cool I me my myself we and I'm curious how many of these are there so length of stop words 179 so there's 179 words we might want to extract out and this just gives us a nice easy interface to do it so what we could do is we could have a phrase just like before so here is an example sentence removing stop words I'll just say demonstrating the removal of stop so that's the phrase to begin with and now we will tokenize that just like that we did before so our words will equal or tokenize of the phrase and then forward in words well I guess our strips phrase I'll call it that equals so if word not in stop words then we can append it to the strip's phrase probably do this in a list comprehension too so whatever you prefer word and then we can do join of the words or we could just keep this as a list because I guess might not make sense as a strip phrase join strips phrase okay here example sentence demonstrating removal stuck words so is and the we're all removed as stopped words and how this could help us is going back to the word vector model from above if we maybe said something like I went to the bank and wrote a check you know we might get bogged down by some of these stop words in the actual meaning of this as we average together those word vectors so removing the stop words might give us a little bit more precise capturing of meaning in something like this that's just one example there's other examples where removing stop words is helpful next we're gonna do kind of a rapid-fire round of the rapid-fire round and I'm going to just quickly go through another library called text blob that allows you to access all sorts of different nice little things with language quickly and it builds off the n ot K library and it gives provides like a really nice interface for doing a lot and things so I'm gonna just say various other techniques and we'll look at spell correction sentiment and let's say part of speech tagging all right so when you're wanting to use this text bubble library I recommend actually just looking at the reference and I will make sure I link this in that resources txt I mentioned would be on my github but it's very straightforward one pager that has all sorts of nice things that you can do so from text blob import text blog I'll be our first line as you can see you literally just take in a phrase and right here be tags we've already done part of speech tagging you can do like dot noun phrases not words a dot sentiment do all this stuff with like one line like just a single dot something so the first one I wanted to look at was the spell correct and where is that correct here we go so let's go to the code from text blob lets you from text blob import text blob okay so whenever we do anything with text blob what we're going to want to start with is just basically baking our phrase a text blob object so we're gonna go phrase you can say whatever it is equals this is an example and then what we're going to want to do is just do we could even just surround this or I'll just say TB phrase so text blob phrase so we're converting it as text blob and then phrase okay and now with TV phrase we already can do all sorts of nice things so the first thing I mentioned was spell correct so TB phrase dot correct is all you have to do to spell correct that this is an example what if I said this is an example and used to ease see it's still this is an example if I used like two eyes here at PI will still think look at that its correcting this very quickly and when I mentioned that original goal this tutorial was you know taking tweets and processing it you can imagine that a lot of tweets are misspelled and have this so being able to in you know we could usually make this two lines of code actually even a single line of code if we really wanted to get fancy do spell correcting on a phrase that's pretty pretty helpful so what else can we do with text blog well I mentioned you could do part of speech tagging so I wanted to go back to our reading examples so I read the book so obviously we don't have any spelling errors there but one thing we could do is get the parts of speech and I just have to do dot tags to do that okay what do we say it looks like you're missing some required data for this feature okay so I'm gonna real quick add this to my Google Club file and just like I download some stuff earlier I will run this cool it's done and now we should be able to do part of speech tagging list object does not call a wall I'm gonna have been just dot tags you might already be baked in look at that so we have I which is a noun we have read which is the verb we have the which is whatever D T stands for and then a book which is another noun and if you just look up like part of speech tagging you probably will find information on the different variance and so this I'll link this in my github I found this page as a helpful resource to learn about the different things you see so you we did see okay like DT is the what do these things actually mean adver like we can just kind of look through this and help us find so this is a terminar the DT now known as a singular now known as is plural so yeah this has all sorts of useful information right here so I'll link this if you want to know more about the tags here another cool thing is we can do dot sentiment you might have to do your sentiment like this let's see go back to the reference yeah it's just that sentiment okay so the book was great clarity 0.8 so high numbers here I'm gonna mean positive but if I say the book sucked it was so bad right there we see that as a negative sentiment because this is a you know negative value here in the polarity the book is another example of the book was horrible another negative example here so you can look into the details of the sentiment to see that put yeah in this API reference for text blog there's all sorts of really accessible easy things you can do so it's another great resource to utilize and I just want to make the point of one of the main goals of this tutorial is to show you all these different things you can use so you can kind of build upon this knowledge and apply it as you see fit and your own tests that you are working on alright so moving towards the kind of state-of-the-art models in natural language processing I wanted to start with recurrent neural networks so if you remember back to word vectors one of the issues that we had with them was that they were kind of set in stone and once you've trained your word vectors no matter what way you utilize a word inside of a like bigger phrase that word vector that you've trained is the same so we don't have kind of context dependent word vectors when we're just using a preset trained of preset a pre trained set of word vectors out-of-the-box so immediately recurrent neural networks can help us kind of solve that problem so here I have a diagram of a recurrent neural network and so basically how a recurrent neural network would work with language is we would feed in words into the network one at a time so we would have [Music] something like you know the word check comment network and so that would come in the network would process and produce a hidden state based off of it and also feed that hidden state to the next you know layer of the recurrent neural network so maybe the next word is out and so that now we have check and out all feeding into the same network basically we can do this as much as we want we're always feeding the output from the last input back into the network so we have check out the book so basically you know we're feeding this in and creating some sort of at the end of book we'll get some sort of final hidden state and we can use this hidden state in NLP applications and so what's nice about this is if we had another phrase that was write me a check well the way that check was used here is at the start of the sentence there's no other words around it and you're gonna get a different kind of embedding for this check than you will when it says write me a check based on the context of write me a you know the embedding this final hidden state made me more related to banking versus check out the book would be kind of a more related to a hidden state that more represents check as and check it out so that's like a little toy example but that's ultimately like one of our goals with something like a recurrent on that work but the recurrent neural network you know for a while this was kind of the go-to to build all sorts of state-of-the-art language models but it did have some drawbacks which ultimately got replaced by like the open aig pt.2 model i introduced at the start of the video and some other other types of models so what are those drawbacks well the first one I would say is that the long dependencies don't always work well with a recurrent no network so imagine you have the sentence like I need to go to the bank today so that I can make a deposit I hope it's not closed while we are feeding in we would be feeding in each of these words to the network and like always that hidden state would have had some influence from each of these words by the time we get to closed a word like Bank is pretty far away so in the final output embedding that hidden state it might not be clear to the network that the thing that we're talking about that is closed is the bank because it's so far away in this sentence so that's one drawback with recurrent neural networks another drawback is more on my performance side of things is that because we're feeding these one token at a time the sequential nature of that makes it tough to paralyze the training and utilization of our nuns and and language models sometimes and you know we can't as effectively utilize modern GPUs right so that leads us into attention and a big paper that came out was called attention is all you need and this kind of set off a whole new wave in natural language processing so what is the difference with attention well basically you can feed enough phrase and as you iterate through the tokens in that phrase and you can basically figure out what needs to be attended to so if you like walk through this sentence I need to go to the bank and write a check well you know right when we get to the word or when we get to the word check let's say we're feeding these all in and this is all being fed in at the same time using these positional encoding so it doesn't have to be one after the other but basically if you get to a word like check well we can it will apply spike off right and bank as like relevant items to check I need to go-go might like set off if you're looking ahead Bank might be triggered as like an important indicator to go etc so basically we're based on what word we're looking at our network kind of learns to attend to other words and figure out what's most important given a certain Turk token that we're looking at and to kind of bring that to a high level the network we can kind of think of learns to ask questions about phrases at C's and these are not you know they're not actually asking you know human questions but baked into these attention networks they're basically looking at each token asking all sorts of edge questions on that token and learning based on that token what else is important in the phrase and this is a super powerful technique and also this attention can work at a longer range of dependencies then in our traditional RNN so this set off all sorts of really impressive stuff in the NLP world so saw the video I mentioned the open a IG PT to architecture one thing to note about that is it's a forward nature so basically the task this is trained on his language modeling so given some words what are the words that are coming after it and that's why it was really good at being able to autumn fleet like a story and when I typed in a phrase it could utilize the fact that it's been trained on figuring out I'm predicting what the next word should be based on when it's already seen so let's open a GP t GP T then there's this Elmo network that's also listed here I just wanted to note this one I don't know as much about l1 network but one thing to note is that it uses LS TMS and it's kind of the core of it and an LS TM is a type of recurrent neural network it's supposed to handle long longer dependencies a bit better but still compared to attention and actually and actually I wrote the header as this as transformer architectures but Ellen was not technically a transformer architecture it's just another powerful language model and then finally what we're gonna focus on right now is Bert which is a bi-directional transformer architecture and as a result of it being bi-directional it really captures a lot of very impressive things about language that even like the OPA IGBT model can so we're gonna just look at how we can use Python do interact with the Bert model real quick okay so to do this we're going to use Spacey again so I'll link this article on my github page but basically Spacey provides us with a really easy way to interact with transformers and if you just follow this link basically it gives you all the lines you need very quickly to get up and running so let's utilize Spacey real quick so the first two things we're going to need to do is pip install turn spacey transformers and python - um spacey download this model okay so yeah I wanted to make a title not an italicized title transformer architectures I'll say alright so you want to download these large model and we also wanted to do a pip install of Spacey transformers and what these uh explanation points me and I'm pretty sure is that this should only this cell should only be run once within Google collab so we don't accidentally download the model again okay download it took a little while but brand successfully cool alright and now how do we actually now that we have it downloaded we can go back to that article and see how it can actually use it it looks pretty straightforward and you can if you're using a GPU you could utilize these lines of code and you'll know some details about that I'm going to just I'll copy this in real quick I think all I need is gonna be I'm not gonna use this right now I'm not going to use my GPU don't think we need numpy right now so I think you need to import spacy and torch and it should be good and then we're loading this large firt model and one thing to note about when we're using transformers we're not going to Train it when we're using like Berk model we're not going to probably train it from scratch the burp model itself is super super massive it takes multiple days on like some of the most powerful tensor processing unit machines to train this model so it's not something you're gonna want to do you know at home but luckily they open-source these models and we can utilize them effectively and one thing to note too is that even if you you you are loading in this model that's already been pre trained one of the nice features about one of the nice features about models like this is that they can be fine-tuned so you can fine-tune them on your specific tasks and get some really impressive results by doing that and that's a lot easier of a step than retraining it from scratch okay so run this see if it loads properly if you're using Google collab you play at the restart the runtime sometimes it doesn't realize where this model is unless you do that okay cool we have it now it's just like it is like if we're using Spacey now that we've loaded in this Bert model as our NLP engine basically we can do the exact same stuff as we did for the word vectors when we were in that section of the tutorial so here we go so basically what I think I'm going to do is just to find another set of training utterances and test utterances just like we did here they're actually to copy this in real quick because I don't think it's necessary to write it because I want to change the example up a little bit all right so we're about to see the power of Bert and transformers in general okay so this time I am defining another category class but this time instead of the difference between books and clothing our two categories are book related items and banking related items so here we've defined seven different utterances each with their associated category and we can read through these real quick okay cool so let's build our model around this and this is the exact same code basically that we use before we just need to now all it changes is that we have this model as our NLP engine as opposed to the word vectors that we were using earlier okay so I'm just copying in some code here this is yeah as I said before this is the exact same code as we were using before so we can train it on the train X here so I'm gonna run this cell real quick so we want to train our model on that and we can predict new utterances here so I'm going to just start as a toy example say book okay and as a result it says books here that's what we expect one thing I want to note here though is one of our training utterances is check out the book and so with that example I was kind of walking through is you could say check out the book but if maybe you're writing about more banking related terms you might say I need to write a check so I'm gonna just type in you know I need to write a check and you know in our trading utterances here check the word check only appears in the books category of our training utterances and these three back here that are about banking need to make a deposit to the bank balance inquiry savings save money there's no mention of the word check so a true test of the power of this Bert model is that oh does it know that when we say I need to write a check we're talking about banking and not talking about this word we've already seen in the other category and let's mode of truth and look at that I need to write a check utilizing the power of the bird model and just like some small number of training examples we see that it's its banking related which is crazy and now maybe you said check this story you know once again it knows that check in this context is is about books so it's really impressive the capabilities because we've trained such a big model leveraging it is is really really powerful stuff and you can continue playing with this one thing I recommend looking into as I kind of alluded to is one of the nice things you can do with a pre trained model like Bert here is you can fine-tune it to a specific task so here's some code that they use in Spacey to show you how you would fine-tune this NLP you know model to a specific you know maybe classification task so that's really useful the next thing I want to mention is this this interface that Spacey provides Stuber is really nice but there definitely is room that if you wanted to build off of a model like Bert you can't really do that so if you wanted to dive deeper into these types of models I recommend checking out hugging phases transformers github repo and in that they have pretty much all the most popular different models that have come out recently within natural language processing they have those models written in PI torch that you can look around so like Burt here this modeling Burt file you'll see that pretty much a entire PI torch model is defined here so if you wanted to build off of Burt and you know fine-tune it specifically or something you could use this hugging face library all right so that's what we're gonna talk about for transformers and how the code would look to easily interface with that and one thing I want to note is that I can only teach you so much about the actual inner workings of Burt in this tutorial I was kind of more focusing on quickly getting up to speed and using this in Python but I will list out on the github repo resources that you can go to learn more about these types of things that I've found useful myself alright that concludes most of what we're actually to do in this tutorial but I will leave this tutorial with an exercise that you guys can kind of tie everything together and really try to build your like keep building on your NLP skills so if you go to my github page github calm sighs Keith Cali so I spiked on 2020 basically I've curated some data for a NLP exercise you can work on and the data it comes from Amazon reviews in several different categories and there's also a test set of Amazon data and several different categories each the same categories and the goal of this task is to basically use this training data and build the best classifier you can to properly categorize the reviews here in the test set the the kind of bigger picture context you could think about this problem bin is imagine you are a in charge of the social media like analytics for a large retail store and you're getting people tweeting at you all the time about all your different products and you wanted to build a model to automatically like route those tweets to the appropriate parties that could handle questions about them then a model like this would be super useful so there's the training test data for that task and I also provided this NLP exercise notebook and in this I kind of quickly walk through getting up and using this data you know loading it in as a training data loading and as test data building a model on the training data which by default the for the simplest model that I included here is a bag of words model but I'm going to be kind of adding by the time you probably see this video I'll have met several models included in here but this bag of where the model can help you get started on using this data and then basically based on that model evaluate the performance on how you do on the test set so as you can see out of the box bag-of-words categorized 65% of the test reviews correctly into their category and if you wanted to break that down a little bit more closely I calculated the f1 scores here for each category and as you can see like it did really bad job on the automotive questions but did well on beauty and books and you can kind of utilize this data as you see fit and play around with building different types of categorizes classification models around this a couple of recommendations I have I think you'd be really interesting to see how well a fine-tuned BIRT model would do on this task I also think it would be very interesting to take the bag of words model but also leverage some of those other techniques we mentioned so like limit izing words stripping away stop words you know maybe doing some part of speech tagging see if you can leverage that and to make it more important all really interesting techniques to try to build more powerful models although what do I'll still I want to say here guess quickly I didn't actually show you what the data looked like but if I actually clicked on one of these files maybe something like Beauty it would be easier to see and also link my source where I scraped all this data someone already did a lot of the grunt work for me and I want to give them credit so I'll have that included in the readme of my github page but yeah basically you have this review text always talking about the category here so this is beauty so like outstanding top organic shampoo what our model we would expect to learn is often with shampoo one thing to note here is this is not perfect data you'll get some reviews that are just like really good and thanks and they don't give us much information so we're never going to be able to get like a hundred percent on this task but maybe another exercise you could work on is how could we strip out these kind of non specific types of words and not use them as training information and test information all right we're gonna end this tutorial here hopefully you had some fun learnings about some different NLP techniques and Python there's a lot we covered in this video and hopefully my goal out of this tutorial was for you to all see some of these techniques and have them in your mind so that you could take them and kind of build on your knowledge and apply them in different areas so really it's like seeing all of what you can do and then taking it from there and going also just a reminder before I leave if you have any questions or anything feel free to connect with me on LinkedIn Instagram Twitter and you can search up Keith's galley one word you'll probably find me and also if you want to find more tutorials that I've posted you can check out my youtube channel youtube.com slash thank you very much PyCon for having me this was fun even if it wasn't in person it wasn't what I expected but I was exciting to go through this process and hopefully I'm will be there in person next year I take care of one thanks for watching peace out
Info
Channel: PyCon US
Views: 33,724
Rating: undefined out of 5
Keywords:
Id: vyOgWhwUmec
Channel Id: undefined
Length: 99min 22sec (5962 seconds)
Published: Wed Apr 29 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.