In this course you will learn all about natural
language processing and how to apply it to real world problems using the spacey library. Dr.
Mattingly is extremely knowledgeable in this area, and he's an excellent teacher. Hi, and welcome
to this video. My name is Dr. William Mattingly, and I specialize in multilingual natural
language processing, I come to NLP from a humanities perspective, I have my PhD in medieval
history, but I use spacey on a regular basis to do all of my NLP needs. So what you're going to
get out of this video over the next few hours is a basic understanding of what natural language
processing is or NLP, and also how to apply it to domain specific problems, or problems that
exist within your own area of expertise. I happen to use this all the time to analyze historical
documents, or financial documents for my own personal investments. Over the next few hours,
you're going to learn a lot about NLP language as a whole and most importantly, the spacey library.
I like the spacey library because it's easy to use, and easy to also implement really kind of
general solutions to general problems with the off the shelf models that are already available
to you. I'm going to walk you through in part one of this video series how to get the most out
of spacey with these off the shelf features. In part two, we're going to start tackling some
of the features that don't exist in off the shelf models. And I'm going to show you how to use rules
based pipes or components in spacey to actually sole domain specific problems and your own area
from the entity ruler to the matcher to actually injecting robust complex regular expression or
regex patterns, and a custom spacey component that doesn't actually exist at the moment. I'm
going to be showing you all that in part two, so that in part three, we can take the lessons
that we learned in part one and part two, and actually apply them to solve a very kind of common
problem that exists in NLP and that is information extraction from financial documents. So finding
things that are of relevance, such as stocks, markets, indexes and stock exchanges. If you join
me over the next few hours, you will leave this lesson with a good understanding of the standing
of spacey and also a good understanding of kind of the off the shelf components that are there
and a way to take the off the shelf components and apply them to your own domain. If you also
join me in this video and you like it, please let me know in the comments down below because I am
interested in making a second part to this video that will explore not only the rules based aspects
of spacey, but the machine learning based aspects of spacey. So teaching you how to train your
own models to do your own things such as training a dependency parser, training a named
entity recognizer things like this, which are not covered in this video. Nevertheless, if you join
me for this one and you like it, you will find part two, much easier to understand. So sit back,
relax, and let's jump into what NLP is, what kind of things you can do with NLP such as information
extraction, and what the spacey library is and how this course will be laid out. If you like this
video, also consider subscribing to my channel Python tutorials for digital humanities,
which is linked in the description down below. Even if you're not a digital humanists like
me, you will find these Python tutorials useful because they take Python and make it accessible
to students of all levels. specifically those who are beginners, I walk you through not only the
basics of Python, but also I walk you through step by step some of the more common libraries
that you need. A lot of the channel deals with texts or text based problems. But other content
deals with things like machine learning, and image classification and OCR, all in Python. So
before we begin with spacey, I think we should spend a little bit of time talking about what
NLP or natural language processing actually is. Natural Language Processing is the process
by which we try to get a computer system to understand and parse and extract human language
oftentimes with raw text. There are a couple different areas of natural language processing.
There's named entity recognition, part of speech tagging, syntactic parsing, text categorization,
also known as text classification, co reference resolution machine translation. Adjacent to NLP
is another kind of computational linguistics field called natural language understanding NLU
This is where we train computer systems to do things like relation extraction, semantic
parsing, question and answering this is where bots really kind of come into play, summarization,
sentiment analysis and paraphrasing. NLP and NLU are used by a wide array of industries, from
finance industry, all the way through to law and academia with researchers trying to do
information extraction from texts. Within an LP, there's a couple different applications. The first
and probably the most important is information extraction. This is the process by which we try to
get a computer system to extract information that we find relevant to our own research or needs. So
for example, as we're gonna see, in part three of this video, when we need to apply spacey to the
financial sector, a person interested in finances might need an LP to go through and extract things
like company names, stocks, indexes, things that are referenced within maybe news articles, from
Reuters to New York Times to Wall Street Journal. This is an example of using NLP to extract
information. A good way to think about NLP is application in this area, is it takes in some
unstructured data, in this case, raw text, and extracts structured data from it or metadata. So
it finds the things that you want it to find and extracts them for you. Now while there's ways to
do this with gazetteers, and list matching, using an NLP framework, like spacey, which I'll talk
about in just a second, has certain advantages, the main one being that you can use and leverage
things that have been parsed syntactically or semantically. So things like the part of speech of
a word things like its dependencies, things like its co reference, these are things that the spacey
framework allow for you to do off the shelf, and also train into machine learning models, and
work into pipelines with rules. So that's kind of one aspect of NLP. And one way it's used. Another
way it's used is to read in data and classify it. This is known as text categorization. And we
see that on the left hand side of this image, text categorization or text classification. And we
conclude in this sentiment analysis for the most part as well, is a way we take information into
a computer system, again, unstructured data or raw text, and we classify it in some way. you've
actually seen this at work for many decades now, with spam detection, spam detection is nearly
perfect, it needs to be continually updated. But for the most part, it is a solved problem. The
reason why you have emails that automatically go to your spam folder, is because there's a machine
learning model that sits on the background of your on the back end of your email server. And
what it does is it actually looks at the emails, it sees if it fat fits the pattern for what it's
seen as spam before, and it assigns it a spam label. This is known as classification. This is
also used by researchers, especially in the legal industry, lawyers oftentimes receive hundreds of
1000s of documents, if not millions of documents, they don't necessarily have the human time to
go through and analyze every single document verbatim. It is important to kind of get a quick
umbrella sense of the documents without actually having to go through and read them page by
page. And so what lawyers will oftentimes do is use NLP to do classification and information
extraction, they will find keywords that are relevant to their case, or they will find
documents that are classified according to the relevant fields of their case. And that way, they
can take a million documents and reduce it down to maybe only a handful, maybe 1000 that they have
to read verbatim. This is a real world application of NLP or natural language processing. And
both of these tasks can be achieved through the spacey framework. spacey is a framework
for doing NLP right now. As of 2021, it's only available I believe in Python, I think there is a
community that's working on an application with R but I don't know that for certain. But spacey
is one of many NLP frameworks that Python has available. If you're interested in looking at all
of them, you can explore things like NLT Kay, the natural language toolkit stanza, which I believe
is coming out of the same program at Stanford. There's many out there, but I find spacey to be
the best of all of them for a couple different reasons. Reason one is that they provide for you
off the shelf models that benchmark very well meaning they perform very quickly. And they also
have very good accuracy metrics such as precision recall, and F score. And I'm not going to talk
too much about the way we measure machine learning accuracy right now, but know that they are quite
good. Second, spacey has the ability to leverage current natural language processing methods,
specifically transformer models, also known, usually kind of collectively as Bert models, even
though that's not entirely accurate, but it allows for you to use an off the shelf transformer
model. And third, it provides the framework for doing custom training relatively easily compared
to these other NLP frameworks that are out there. Finally, the fourth reason why I picked spacey
over other NLP frameworks is because it scales well. spacey was designed by explosion AI, and the
entire Purpose of spacey is to work at scale AI at scale, we mean working with large quantities of
documents efficiently, effectively and accurately. spacey scales well because it can process hundreds
of 1000s of documents with relative ease in a relatively short period of time, especially if
you stick with more rules based pipes, which we're going to talk about in part two of this video. So
those are the two things you really need to know about NLP, and spacey in general, we're going to
talk about spacey in depth as we explore it both through this video. And and the free textbook
I provide to go along with this video, which is located at spacey dot python humanities.com. And
it should be linked in the description down below this video and the textbook I meant to work in
tandem. Some stuff that I cover in the video might not necessarily be in the textbook
because it doesn't lend itself well to text representation. And the same goes for the
opposite some stuff that I don't have the time to cover verbatim In this video, I cover
in a little bit more depth in the video. And in the book, I think that you should try
to use both of these, what I would recommend is doing one pass through this whole video, watch
it in its entirety and get an umbrella sense of everything that space you can do. And everything
that we're going to cover, I would then go back and try to replicate each stage of this process on
a separate window or on a separate screen and try to kind of follow along and code and then I would
go back through a third time and try to watch the first part Why talk about what we're going to be
doing and try to do it on your own without looking at the textbook or the video. If you can do that
by your third pass, you'll be in very good shape to start using spacey to solve your own domain
specific problems. NLP is a complex field and applying NLP is really complex. But fortunately,
frameworks like spacey make this project and this process a lot easier. I encourage you to spend a
few hours in this video get to know spacey and I think you're going to find that you can do things
that you didn't think possible and relatively short order. So sit back, relax and enjoy this
video series on spacey. In order to use spacey, you're first going to have to install spacey. Now
there's a few different ways to do this. Depending on your environment and your operating system,
I recommend going to spacey.io backslash usage and kind of enter in the correct framework that
you're working with. So if you're using Mac OS versus windows versus Linux, you can go through
and in this very handy kind of user interface, you can go through and select the different features
that matter most to you. I'm working with Windows, I'm going to be using PIP in this case, and
I'm going to be doing everything on the CPU. And I'm going to be working with English. So I've
established all of those different parameters. And it goes through and it tells me exactly
how to go through and install it using PIP in the terminal. So I encourage you to
go through and pause the video right now go ahead and install Windows however you want to.
I'm going to be walking through how to install it within the Jupyter Notebook that we're going to
be moving to in just a second. I want you to not work with the GPU at all. Working with spacey on
the GPU requires a lot more understanding about what the GPU is used for specifically, in training
machine learning models. It requires you to have CUDA installed correctly. It requires a
couple other things that I don't really have the time to get into in this video, but we'll
be addressing in a more advanced spacey tutorial video. So for right now, I recommend selecting
your o s selecting either can use PIP or conda and then selecting CPU. And since you're going to
be working through this video with English texts, I encourage you to select English right now
and go ahead and just install or download the N core web SM model. This is the small
model. I'll talk about that in just a second. So the first thing we're going to do in our
Jupyter Notebook is we're going to be using the the exclamation mark to delineate in the cell that
this is a terminal command, we're going to say pip install spacey, your output when you execute
this cell is going to look a little different than mine. I already have spacey installed in this
environment. And so mind kind of goes through and looks like this yours will actually go through and
instead of saying requirement already satisfied it'll be actually passing out the the different
things that it's actually installing to install spacey and all of its dependencies. The next thing
that you're going to do is you're going to again, you follow the instructions, and you're
going to be doing Python dash m space spacey, space download, and then the model that you want
to download. So let's go ahead and do that right now. So let's go ahead and say Python m spacing.
Download to this is a spacey terminal command. And we're going to download the N core web SM
and again, I already have this model downloaded So on my end, spacey is going to look a
little differently than as it's going to look on your end as it prints off on the Jupyter
Notebook. And if we give it a just a second, everything will go through, and it says that
it's collected it, it's downloading it. And we are all very happy now. And so now that we've got
spacey installed correctly, and that we've got the small model downloaded correctly, we can go ahead
and start actually using spacey and make sure everything's correct. The first thing we're going
to do is we're going to import the spacey library as you would with any other Python library. If
you're not familiar with this, a library is simply a set of classes and functions that you can import
into a Python script so that you don't have to write a whole bunch of extra code. Libraries
are massive collections of classes and functions that you can call. So when we import spacey,
we're importing the whole library of spacey and now that we've seen something like this,
we know that spacey has imported correctly, as long as you're not getting an error message,
everything was in was imported fine. The next thing that we need to do is we want to make sure
that our English core web SM are small English model was downloaded correctly. So the next thing
that we need to do is we need to create an NLP object. I'm going to be talking a lot more about
this as we move forward. Right now, this is just troubleshooting to make sure that we've installed
spacey correctly and we've downloaded our model correctly. So we're going to use the spacey dot
load command. This is going to take one argument, it's going to be a string that is going to
correspond to the model that you've installed. And this case n cor web s n. And if you
execute this cell and you have no errors, you have successfully installed spacey correctly
and you've downloaded the English core web SM model correctly. So go ahead take time, and get
all this stuff set up. Pause the video if you need to, and then pop back and we're going to start
actually working through the basics of spacey. I'm now going to move into kind of an
overview of kind of what's within spacey, why it's useful and kind of some of the basic
features of it that you need to be familiar with. And I'm going to be working from the Jupyter
Notebook that I talked about and the introduction to this video. If we scroll down to the bottom of
chapter one, the basics of spacey, then you get past the install section, you get to this section
on containers. So what are containers? Well, containers within spacey are objects that
contain a large quantity of data about a text. There are several different containers that
you can work with. In spacey, there's the doc, the doc Ben example, language,
lexeme span, span group and token, we're going to be dealing with the lexeme a
little bit in this video series. And we're going to be dealing with the language container
a little bit in this video series. But really, the three big things that we're going to be
talking about again and again is the dock the span and the token. And I think when you
first come to spacey, there's a little bit of a learning curve about what these things are, what
they do, how they are structured hierarchically. And for that reason I've created this,
in my opinion, kind of easy to understand image of what different containers are. So if
you think about what spacey is as a pyramid, so a hierarchical system, we've got all
these different containers structured around, really the dock object, your Docker container,
or your dock object contains a whole bunch of metadata about the text that you pass to the
spacey pipeline, which we're going to see in practice, and just a few minutes. The doc
object contains a bunch of different things. It contains attributes. And these attributes
can be things like, like sentences. So if you iterate over doc dot cents, you can actually
access all the different sentences found within that doc object. If you iterate over each
individual item, or index and your doc object, you can get individual tokens. tokens are going
to be things like words or punctuation marks something within your sentence or text
that has a self contained important value, either syntactically or semantically. So this
is going to be things like words a comma period, a semi colon, a quotation mark, things like this,
these are all going to be your tokens. And we're going to see how tokens are a little different
than just splitting words up with traditional string methods and Python. The next thing that
you should be kind of familiar with are spans. So spans are important because they kind of exist
within and without of the doc object. So unlike the token, which is an index of the doc object,
a span can be a token itself, but it can also be a sequence of multiple tokens, we're gonna see
that at play. So imagine if you had a span in its category, maybe group one are our places. So a
single token might be like a city like Berlin, but span group two, this could be something like
full proper names. So of people, for example. So this could be like, as we're going to see Martin
Luther King, this would be a sequence of tokens, a sequence of three different items in the sentence
that make up one span, or one self contained item. So Martin Luther King, would be a person who's
a collection of a sequence of individual tokens. If that doesn't make sense, right now, this
image will be reinforced as we go through and learn more about spacey in practice. For
right now, I want you to be just understanding that the doc object is the thing around which
all of spacey sits, this is going to be the object that you create. This is going to be the
object that contains all the metadata that you need to access. And this is going to be the
object that you tried to essentially improve with different custom components, factories, and
pipelines. As you go through and do more advanced things with spacey, we're going to now see in
just a few seconds how that dock object is kind of similar to the text itself. But how it's
very, very different and much more powerful. We're now going to be moving on to chapter two
of this textbook, which is going to deal with kind of getting used to the in depth features
of spacing. If you want to pause the video or keep this notebook or this book open up to
kind of separate from this video and follow along. As we go through and explore it in live
coding, we're going to be talking about a few different things as we explore chapter two,
this will be a lot longer than chapter one, we're going to be not only importing spacey, but
actually going through and loading up a model, creating a dog object around that model. So that
we're going to work with container and practice. And then we're going to see how that container
stores a lot of different features or metadata attributes about the text. And while
they look the same on the surface, they're actually quite different. So let's
go ahead and work within our same Jupyter Notebook where we've imported spacey and we have
already created the NLP object. The first thing that I want to do is I want to open up a text to
start working with within this repo, we've got a data folder. Within this data sub folder,
I've got a couple different Wikipedia openings, I've got one on MLK that we're going to be using
a little later in this video. And then I have one on the United States, this is wiki underscore us.
That's going to be what we work with right now. So let's use our width operator and open up
the data backslash wiki underscore us dot txt. We're gonna just read that in as F. And then we're
going to create this text object which is going to be equal to F dot read. And now that we've got
our text object created, let's go ahead and see what this looks like. So let's print text. And
we see that it's a standard Wikipedia article kind of follows that same introductory format and
it's about four or five paragraphs long with a lot of the features left in such as the brackets
that delineate some kind of a footnote. We're not going to worry too much about cleaning this up
right now, because we're interested in not with cleaning our data so much as just starting
to work with the doc object in spacey. So the first thing that you want to do is
you're going to want to create a doc object