Natural Language Processing with spaCy & Python - Course for Beginners

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
In this course you will learn all about natural  language processing and how to apply it to real   world problems using the spacey library. Dr.  Mattingly is extremely knowledgeable in this area,   and he's an excellent teacher. Hi, and welcome  to this video. My name is Dr. William Mattingly,   and I specialize in multilingual natural  language processing, I come to NLP from a   humanities perspective, I have my PhD in medieval  history, but I use spacey on a regular basis to   do all of my NLP needs. So what you're going to  get out of this video over the next few hours   is a basic understanding of what natural language  processing is or NLP, and also how to apply it   to domain specific problems, or problems that  exist within your own area of expertise. I happen   to use this all the time to analyze historical  documents, or financial documents for my own   personal investments. Over the next few hours,  you're going to learn a lot about NLP language as   a whole and most importantly, the spacey library.  I like the spacey library because it's easy to   use, and easy to also implement really kind of  general solutions to general problems with the   off the shelf models that are already available  to you. I'm going to walk you through in part one   of this video series how to get the most out  of spacey with these off the shelf features.   In part two, we're going to start tackling some  of the features that don't exist in off the shelf   models. And I'm going to show you how to use rules  based pipes or components in spacey to actually   sole domain specific problems and your own area  from the entity ruler to the matcher to actually   injecting robust complex regular expression or  regex patterns, and a custom spacey component   that doesn't actually exist at the moment. I'm  going to be showing you all that in part two,   so that in part three, we can take the lessons  that we learned in part one and part two, and   actually apply them to solve a very kind of common  problem that exists in NLP and that is information   extraction from financial documents. So finding  things that are of relevance, such as stocks,   markets, indexes and stock exchanges. If you join  me over the next few hours, you will leave this   lesson with a good understanding of the standing  of spacey and also a good understanding of kind   of the off the shelf components that are there  and a way to take the off the shelf components   and apply them to your own domain. If you also  join me in this video and you like it, please let   me know in the comments down below because I am  interested in making a second part to this video   that will explore not only the rules based aspects  of spacey, but the machine learning based aspects   of spacey. So teaching you how to train your  own models to do your own things such as   training a dependency parser, training a named  entity recognizer things like this, which are not   covered in this video. Nevertheless, if you join  me for this one and you like it, you will find   part two, much easier to understand. So sit back,  relax, and let's jump into what NLP is, what kind   of things you can do with NLP such as information  extraction, and what the spacey library is and how   this course will be laid out. If you like this  video, also consider subscribing to my channel   Python tutorials for digital humanities,  which is linked in the description down below.   Even if you're not a digital humanists like  me, you will find these Python tutorials useful   because they take Python and make it accessible  to students of all levels. specifically those who   are beginners, I walk you through not only the  basics of Python, but also I walk you through   step by step some of the more common libraries  that you need. A lot of the channel deals with   texts or text based problems. But other content  deals with things like machine learning, and   image classification and OCR, all in Python. So  before we begin with spacey, I think we should   spend a little bit of time talking about what  NLP or natural language processing actually is.   Natural Language Processing is the process  by which we try to get a computer system   to understand and parse and extract human language  oftentimes with raw text. There are a couple   different areas of natural language processing.  There's named entity recognition, part of speech   tagging, syntactic parsing, text categorization,  also known as text classification, co reference   resolution machine translation. Adjacent to NLP  is another kind of computational linguistics field   called natural language understanding NLU  This is where we train computer systems to   do things like relation extraction, semantic  parsing, question and answering this is where   bots really kind of come into play, summarization,  sentiment analysis and paraphrasing. NLP and NLU   are used by a wide array of industries, from  finance industry, all the way through to law   and academia with researchers trying to do  information extraction from texts. Within an LP,   there's a couple different applications. The first  and probably the most important is information   extraction. This is the process by which we try to  get a computer system to extract information that   we find relevant to our own research or needs. So  for example, as we're gonna see, in part three of   this video, when we need to apply spacey to the  financial sector, a person interested in finances   might need an LP to go through and extract things  like company names, stocks, indexes, things that   are referenced within maybe news articles, from  Reuters to New York Times to Wall Street Journal.   This is an example of using NLP to extract  information. A good way to think about NLP   is application in this area, is it takes in some  unstructured data, in this case, raw text, and   extracts structured data from it or metadata. So  it finds the things that you want it to find and   extracts them for you. Now while there's ways to  do this with gazetteers, and list matching, using   an NLP framework, like spacey, which I'll talk  about in just a second, has certain advantages,   the main one being that you can use and leverage  things that have been parsed syntactically or   semantically. So things like the part of speech of  a word things like its dependencies, things like   its co reference, these are things that the spacey  framework allow for you to do off the shelf,   and also train into machine learning models, and  work into pipelines with rules. So that's kind of   one aspect of NLP. And one way it's used. Another  way it's used is to read in data and classify it.   This is known as text categorization. And we  see that on the left hand side of this image,   text categorization or text classification. And we  conclude in this sentiment analysis for the most   part as well, is a way we take information into  a computer system, again, unstructured data or   raw text, and we classify it in some way. you've  actually seen this at work for many decades now,   with spam detection, spam detection is nearly  perfect, it needs to be continually updated.   But for the most part, it is a solved problem. The  reason why you have emails that automatically go   to your spam folder, is because there's a machine  learning model that sits on the background of   your on the back end of your email server. And  what it does is it actually looks at the emails,   it sees if it fat fits the pattern for what it's  seen as spam before, and it assigns it a spam   label. This is known as classification. This is  also used by researchers, especially in the legal   industry, lawyers oftentimes receive hundreds of  1000s of documents, if not millions of documents,   they don't necessarily have the human time to  go through and analyze every single document   verbatim. It is important to kind of get a quick  umbrella sense of the documents without actually   having to go through and read them page by  page. And so what lawyers will oftentimes do   is use NLP to do classification and information  extraction, they will find keywords that are   relevant to their case, or they will find  documents that are classified according to the   relevant fields of their case. And that way, they  can take a million documents and reduce it down   to maybe only a handful, maybe 1000 that they have  to read verbatim. This is a real world application   of NLP or natural language processing. And  both of these tasks can be achieved through   the spacey framework. spacey is a framework  for doing NLP right now. As of 2021, it's only   available I believe in Python, I think there is a  community that's working on an application with R   but I don't know that for certain. But spacey  is one of many NLP frameworks that Python has   available. If you're interested in looking at all  of them, you can explore things like NLT Kay, the   natural language toolkit stanza, which I believe  is coming out of the same program at Stanford.   There's many out there, but I find spacey to be  the best of all of them for a couple different   reasons. Reason one is that they provide for you  off the shelf models that benchmark very well   meaning they perform very quickly. And they also  have very good accuracy metrics such as precision   recall, and F score. And I'm not going to talk  too much about the way we measure machine learning   accuracy right now, but know that they are quite  good. Second, spacey has the ability to leverage   current natural language processing methods,  specifically transformer models, also known,   usually kind of collectively as Bert models, even  though that's not entirely accurate, but it allows   for you to use an off the shelf transformer  model. And third, it provides the framework for   doing custom training relatively easily compared  to these other NLP frameworks that are out there.   Finally, the fourth reason why I picked spacey  over other NLP frameworks is because it scales   well. spacey was designed by explosion AI, and the  entire Purpose of spacey is to work at scale AI   at scale, we mean working with large quantities of  documents efficiently, effectively and accurately.   spacey scales well because it can process hundreds  of 1000s of documents with relative ease in   a relatively short period of time, especially if  you stick with more rules based pipes, which we're   going to talk about in part two of this video. So  those are the two things you really need to know   about NLP, and spacey in general, we're going to  talk about spacey in depth as we explore it both   through this video. And and the free textbook  I provide to go along with this video, which is   located at spacey dot python humanities.com. And  it should be linked in the description down below   this video and the textbook I meant to work in  tandem. Some stuff that I cover in the video   might not necessarily be in the textbook  because it doesn't lend itself well to   text representation. And the same goes for the  opposite some stuff that I don't have the time   to cover verbatim In this video, I cover  in a little bit more depth in the video.   And in the book, I think that you should try  to use both of these, what I would recommend   is doing one pass through this whole video, watch  it in its entirety and get an umbrella sense of   everything that space you can do. And everything  that we're going to cover, I would then go back   and try to replicate each stage of this process on  a separate window or on a separate screen and try   to kind of follow along and code and then I would  go back through a third time and try to watch the   first part Why talk about what we're going to be  doing and try to do it on your own without looking   at the textbook or the video. If you can do that  by your third pass, you'll be in very good shape   to start using spacey to solve your own domain  specific problems. NLP is a complex field and   applying NLP is really complex. But fortunately,  frameworks like spacey make this project and this   process a lot easier. I encourage you to spend a  few hours in this video get to know spacey and I   think you're going to find that you can do things  that you didn't think possible and relatively   short order. So sit back, relax and enjoy this  video series on spacey. In order to use spacey,   you're first going to have to install spacey. Now  there's a few different ways to do this. Depending   on your environment and your operating system,  I recommend going to spacey.io backslash usage   and kind of enter in the correct framework that  you're working with. So if you're using Mac OS   versus windows versus Linux, you can go through  and in this very handy kind of user interface, you   can go through and select the different features  that matter most to you. I'm working with Windows,   I'm going to be using PIP in this case, and  I'm going to be doing everything on the CPU.   And I'm going to be working with English. So I've  established all of those different parameters.   And it goes through and it tells me exactly  how to go through and install it using PIP   in the terminal. So I encourage you to  go through and pause the video right now   go ahead and install Windows however you want to.  I'm going to be walking through how to install it   within the Jupyter Notebook that we're going to  be moving to in just a second. I want you to not   work with the GPU at all. Working with spacey on  the GPU requires a lot more understanding about   what the GPU is used for specifically, in training  machine learning models. It requires you to have   CUDA installed correctly. It requires a  couple other things that I don't really   have the time to get into in this video, but we'll  be addressing in a more advanced spacey tutorial   video. So for right now, I recommend selecting  your o s selecting either can use PIP or conda   and then selecting CPU. And since you're going to  be working through this video with English texts,   I encourage you to select English right now  and go ahead and just install or download   the N core web SM model. This is the small  model. I'll talk about that in just a second.   So the first thing we're going to do in our  Jupyter Notebook is we're going to be using the   the exclamation mark to delineate in the cell that  this is a terminal command, we're going to say pip   install spacey, your output when you execute  this cell is going to look a little different   than mine. I already have spacey installed in this  environment. And so mind kind of goes through and   looks like this yours will actually go through and  instead of saying requirement already satisfied   it'll be actually passing out the the different  things that it's actually installing to install   spacey and all of its dependencies. The next thing  that you're going to do is you're going to again,   you follow the instructions, and you're  going to be doing Python dash m space spacey,   space download, and then the model that you want  to download. So let's go ahead and do that right   now. So let's go ahead and say Python m spacing.  Download to this is a spacey terminal command.   And we're going to download the N core web SM  and again, I already have this model downloaded   So on my end, spacey is going to look a  little differently than as it's going to   look on your end as it prints off on the Jupyter  Notebook. And if we give it a just a second,   everything will go through, and it says that  it's collected it, it's downloading it. And we   are all very happy now. And so now that we've got  spacey installed correctly, and that we've got the   small model downloaded correctly, we can go ahead  and start actually using spacey and make sure   everything's correct. The first thing we're going  to do is we're going to import the spacey library   as you would with any other Python library. If  you're not familiar with this, a library is simply   a set of classes and functions that you can import  into a Python script so that you don't have to   write a whole bunch of extra code. Libraries  are massive collections of classes and functions   that you can call. So when we import spacey,  we're importing the whole library of spacey   and now that we've seen something like this,  we know that spacey has imported correctly,   as long as you're not getting an error message,  everything was in was imported fine. The next   thing that we need to do is we want to make sure  that our English core web SM are small English   model was downloaded correctly. So the next thing  that we need to do is we need to create an NLP   object. I'm going to be talking a lot more about  this as we move forward. Right now, this is just   troubleshooting to make sure that we've installed  spacey correctly and we've downloaded our model   correctly. So we're going to use the spacey dot  load command. This is going to take one argument,   it's going to be a string that is going to  correspond to the model that you've installed.   And this case n cor web s n. And if you  execute this cell and you have no errors,   you have successfully installed spacey correctly  and you've downloaded the English core web SM   model correctly. So go ahead take time, and get  all this stuff set up. Pause the video if you need   to, and then pop back and we're going to start  actually working through the basics of spacey.   I'm now going to move into kind of an  overview of kind of what's within spacey,   why it's useful and kind of some of the basic  features of it that you need to be familiar with.   And I'm going to be working from the Jupyter  Notebook that I talked about and the introduction   to this video. If we scroll down to the bottom of  chapter one, the basics of spacey, then you get   past the install section, you get to this section  on containers. So what are containers? Well,   containers within spacey are objects that  contain a large quantity of data about a text.   There are several different containers that  you can work with. In spacey, there's the doc,   the doc Ben example, language,  lexeme span, span group and token,   we're going to be dealing with the lexeme a  little bit in this video series. And we're   going to be dealing with the language container  a little bit in this video series. But really,   the three big things that we're going to be  talking about again and again is the dock   the span and the token. And I think when you  first come to spacey, there's a little bit of   a learning curve about what these things are, what  they do, how they are structured hierarchically.   And for that reason I've created this,  in my opinion, kind of easy to understand   image of what different containers are. So if  you think about what spacey is as a pyramid,   so a hierarchical system, we've got all  these different containers structured around,   really the dock object, your Docker container,  or your dock object contains a whole bunch of   metadata about the text that you pass to the  spacey pipeline, which we're going to see   in practice, and just a few minutes. The doc  object contains a bunch of different things.   It contains attributes. And these attributes  can be things like, like sentences. So if you   iterate over doc dot cents, you can actually  access all the different sentences found within   that doc object. If you iterate over each  individual item, or index and your doc object,   you can get individual tokens. tokens are going  to be things like words or punctuation marks   something within your sentence or text  that has a self contained important value,   either syntactically or semantically. So this  is going to be things like words a comma period,   a semi colon, a quotation mark, things like this,  these are all going to be your tokens. And we're   going to see how tokens are a little different  than just splitting words up with traditional   string methods and Python. The next thing that  you should be kind of familiar with are spans.   So spans are important because they kind of exist  within and without of the doc object. So unlike   the token, which is an index of the doc object,  a span can be a token itself, but it can also   be a sequence of multiple tokens, we're gonna see  that at play. So imagine if you had a span in its   category, maybe group one are our places. So a  single token might be like a city like Berlin,   but span group two, this could be something like  full proper names. So of people, for example. So   this could be like, as we're going to see Martin  Luther King, this would be a sequence of tokens, a   sequence of three different items in the sentence  that make up one span, or one self contained item.   So Martin Luther King, would be a person who's  a collection of a sequence of individual tokens.   If that doesn't make sense, right now, this  image will be reinforced as we go through   and learn more about spacey in practice. For  right now, I want you to be just understanding   that the doc object is the thing around which  all of spacey sits, this is going to be the   object that you create. This is going to be the  object that contains all the metadata that you   need to access. And this is going to be the  object that you tried to essentially improve   with different custom components, factories, and  pipelines. As you go through and do more advanced   things with spacey, we're going to now see in  just a few seconds how that dock object is kind   of similar to the text itself. But how it's  very, very different and much more powerful.   We're now going to be moving on to chapter two  of this textbook, which is going to deal with   kind of getting used to the in depth features  of spacing. If you want to pause the video or   keep this notebook or this book open up to  kind of separate from this video and follow   along. As we go through and explore it in live  coding, we're going to be talking about a few   different things as we explore chapter two,  this will be a lot longer than chapter one,   we're going to be not only importing spacey, but  actually going through and loading up a model,   creating a dog object around that model. So that  we're going to work with container and practice.   And then we're going to see how that container  stores a lot of different features or metadata   attributes about the text. And while  they look the same on the surface,   they're actually quite different. So let's  go ahead and work within our same Jupyter   Notebook where we've imported spacey and we have  already created the NLP object. The first thing   that I want to do is I want to open up a text to  start working with within this repo, we've got   a data folder. Within this data sub folder,  I've got a couple different Wikipedia openings,   I've got one on MLK that we're going to be using  a little later in this video. And then I have one   on the United States, this is wiki underscore us.  That's going to be what we work with right now.   So let's use our width operator and open up  the data backslash wiki underscore us dot txt.   We're gonna just read that in as F. And then we're  going to create this text object which is going to   be equal to F dot read. And now that we've got  our text object created, let's go ahead and see   what this looks like. So let's print text. And  we see that it's a standard Wikipedia article   kind of follows that same introductory format and  it's about four or five paragraphs long with a   lot of the features left in such as the brackets  that delineate some kind of a footnote. We're not   going to worry too much about cleaning this up  right now, because we're interested in not with   cleaning our data so much as just starting  to work with the doc object in spacey.   So the first thing that you want to do is  you're going to want to create a doc object
Info
Channel: freeCodeCamp.org
Views: 38,958
Rating: 4.9594998 out of 5
Keywords:
Id: dIUTsFT2MeQ
Channel Id: undefined
Length: 182min 32sec (10952 seconds)
Published: Mon Sep 27 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.