[MUSIC PLAYING] SPEAKER 1: OK, welcome back, everyone,
to our final topic in an introduction to artificial intelligence with Python. And today, the topic is language. So thus far in the class,
we've seen a number of different ways of interacting
with AI, artificial intelligence, but it's mostly been happening in
the way of us formulating problems in ways that I can understand--
learning to speak the language of AI, so to speak, by trying to take a problem
and formulated as a search problem, or by trying to take a problem and make
it a constraint satisfaction problem-- something that our AI
is able to understand. Today, we're going to try and come
up with algorithms and ideas that allow our AI to meet us
halfway, so to speak-- to be able to allow AI to be able to
understand, and interpret, and get some sort of meaning
out of human language-- the type of language,
in the spoken language, like English, or some other
language that we naturally speak. And this turns out to be a
really challenging task for AI. And it really encompasses a
number of different types of tasks all under the broad heading of
natural language processing, the idea of coming up
with algorithms that allow our AI to be able to process
and understand natural language. So these tasks vary in
terms of the types of tasks we might want an AI to perform,
and therefore, the types of algorithms that we might use. Them but some common
tasks that you might see are things like automatic summarization. You give an AI a long document,
and you would like for the AI to be able to summarize
it, come up with a shorter representation of the same idea, but
still in some kind of natural language, like English. Something like information extraction--
given a whole corpus of information in some body of documents
or on the internet, for example, we'd like for
our AI to be able to extract some sort of meaningful semantic
information out of all of that content that it's able to look at and read. Language identification--
the task of, given a page, can you figure out what language
that document is written in? This is the type of thing you might
see if you use a web browser where, if you open up a page
in another language, that web browser might ask you, oh,
I think it's in this language-- would you like me to translate into
English for you, for example? And that language
identification process is a task that our AI needs to be able to
do, which is then related then to machine translation, the process
of taking text in one language and translating it into another
language-- which there's been a lot of research
and development on really over the course of the
last several years. And it keeps getting
better, in terms of how it is that AI is able to
take text in one language and transform that text into
another language as well. In addition to that, we have topics
like named entity recognition. Given some sequence of text, can you
pick out what the named entities are? These are names of companies,
or names of people, or names of locations for example, which
are often relevant or important parts of a particular document. Speech recognition as a related task
not to do with the text that is written, but text that is spoken-- being able
to process audio and figure out, what are the actual words
that are spoken there? And if you think about smart
home devices, like Siri or Alexa, for example, these are
all devices that are now able to listen to when we
are able to speak, figure out what words we are saying, and draw some
sort of meaning out of that as well. We've talked about how you
could formulate something, for instance, as a hit and
Markov model to be able to draw those sorts of conclusions. Text classification, more
generally, is a broad category of types of ideas, whenever we
want to take some kind of text and put it into some sort of category. And we've seen these
classification type problems and how we can use statistical
machine learning approaches to be able to solve them. We'll be able to do something
very similar with natural language that we may need to make a couple
of adjustments that we'll see soon. And then something like
word sense disambiguation, the idea that, unlike in
the language of numbers, where AI has very precise
representations of everything, words and are a little bit fuzzy,
in terms of their meaning, and words can have multiple
different meanings-- and natural language is
inherently ambiguous, and we'll take a look at some of
those ambiguities in due time today. But one challenging
task, if you want an AI to be able to understand
natural language, is being able to
disambiguate or differentiate between different possible
meanings of words. If I say a sentence like, I went to
the bank, you need to figure out, do I mean the bank where I
deposit and withdraw money or do I mean the bank like the river bank? And different words can
have different meanings that we might want to figure out. And based on the context
in which a word appears-- the wider sentence,
or paragraph, or paper in which a particular word appears-- that might help to
inform how it is that we disambiguate between different
meanings or different senses that a word might have. And there are many other topics
within natural language processing, many other algorithms
that have been devised in order to deal with and
address these sorts of problems. And today, we're really just
going to scratch the surface, looking at some of the fundamental ideas
that are behind many of these ideas within natural language processing,
within this idea of trying to come up with AI algorithms that are able to do
something meaningful with the languages that we speak everyday. And so to introduce this idea,
when we think about language, we can often think about it in
a couple of different parts. The first part refers to
the syntax of language. This is more to do with just
the structure of language and how it is that that structure works. And if you think about natural
language, syntax is one of those things that, if you're a native
speaker of a language, it comes pretty readily to you. You don't have to think
too much about it. If I give you a sentence from Sir
Arthur Conan Doyle's Sherlock Holmes, for example, a sentence like this-- "just before 9:00 o'clock, Sherlock
Holmes stepped briskly into the room"-- I think we could probably
all agree that this is a well-formed grammatical sentence. Syntactically, it makes
sense, in terms of the way that this particular
sentence is structured. And syntax applies not just to natural
language, but to programming languages as well. If you've ever seen a syntax error
in a program that you've written, it's likely because you
wrote some sort of program that was not syntactically well-formed. The structure of it was
not a valid program. In the same way, we can look at
English sentences, or sentences in any natural language, and
make the same kinds of judgments. I can say that this sentence
is syntactically well-formed. When all the parts are put together,
all these words are in this order, it constructs a grammatical sentence, or
a sentence that most people would agree is grammatical. But there are also grammatically
ill-formed sentences. A sentence like, "just
before Sherlock Holmes 9 o'clock stepped briskly the room"-- well, I think we would all agree that
this is not a well-formed sentence. Syntactically, it doesn't make sense. And this is the type of thing that,
if we want our AI, for example, to be able to generate
natural language-- to be able to speak to us the way
like a chat bot would speak to us, for example-- well then our AI is going to need to be
able to know this distinction somehow, is going to be able to know what
kinds of sentences are grammatical, what kinds of sentences are not. And we might come up with rules or ways
to statistically learn these ideas, and we'll talk about some
of those methods as well. Syntax can also be ambiguous. There are some sentences that are
well-formed and not well-formed, but certain way-- there are certain
ways that you could take a sentence and potentially construct multiple
different structures for that sentence. A sentence like, "I saw the man on
the mountain with a telescope," well, this is grammatically well-formed--
syntactically, it makes sense-- but what is the structure
of the sentence? Is it the man on the mountain
who has the telescope, or am I seeing the man on the mountain and
I am using the telescope in order to see the man on the mountain? There's some interesting ambiguity
here, where it could have potentially two different types of structures. And this is one of the ideas
that will come back to also, in terms of how to think about dealing
with AI when natural language is inherently ambiguous. So that then is syntax,
the structure of language, and getting an
understanding for how it is that, depending on the order
and placement of words, we can come up with different
structures for language. But in addition to language having
structure, language also has meaning. And now we get into the world
of semantics, the idea of, what it is that a word,
or a sequence of words, or a sentence, or an entire
essay actually means? And so a sentence like, "just
before 9:00, Sherlock Holmes stepped briskly into the
room," is a different sentence from a sentence like, "Sherlock Holmes
stepped briskly into the room just before 9:00." And yet they have
effectively the same meaning. They're different
sentences, so an AI reading them would recognize them as
different, but we as humans can look at both the
sentences and say, yeah, they mean basically the same thing. And maybe, in this case, it was just
because I moved the order of the words around. Originally, 9 o'clock with near
the beginning of the sentence. Now 9 o'clock is near
the end of the sentence. But you might imagine that I could come
up with a different sentence entirely, a sentence like, "a few minutes
before 9:00, Sherlock Holmes walked quickly into the room." And OK, that also has
a very similar meaning, but I'm using different words
in order to express that idea. And ideally, AI would
be able to recognize that these two sentences, these
different sets of words that are similar to each other,
have similar meanings, and to be able to get
at that idea as well. Then there are also ways that a
syntactically well-formed sentence might not mean anything at all. A famous example from linguist Noam
Chomsky is this sentence here-- "colorless green ideas sleep furiously." Syntactically, that
sentence is perfectly fine. Colorless and green are adjectives
that modify the noun ideas. Sleep is a verb. Furiously is an adverb. These are correct constructions,
in terms of the order of words, but it turns out this
sentence is meaningless. If you tried to ascribe meaning to
the sentence, what does it mean? And it's not easy to
be able to determine what it is that it might mean. Semantics itself can also be ambiguous,
given that different structures can have different types of meanings. Different words can have
different kinds of meanings, so the same sentence
with the same structure might end up meaning
different types of things. So my favorite example
from the LA times is a headline that was in the Los
Angeles Times a little while back. The headline says, "Big rig carrying
fruit crashes on 210 freeway, creates jam." So depending on how it is
you look at the sentence-- how you interpret the sentence-- it
can have multiple different meanings. And so here too are challenges in this
world of natural language processing, being able to understand
both the syntax of language and the semantics of language. And today, we'll take a
look at both of those ideas. We're going to start
by talking about syntax and getting a sense for how it
is that language is structured, and how we can start by coming
up with some rules, some ways that we can tell our computer,
tell our AI what types of things are valid sentences, what types
of things are not valid sentences. And ultimately, we'd like
to use that information to be able to allow our AI to
draw meaningful conclusions, to be able to do
something with language. And so to do so, we're going
to start by introducing the notion of formal grammar. And what formal grammar is
all about its formal grammar is a system of rules that
generate sentences in a language. I would like to know what are
the valid English sentences-- not in terms of what they mean-- just in terms of their structure--
their syntactic structure. What structures of English
are valid, correct sentences? What structures of
English are not valid? And this is going to apply in a very
similar way to other natural languages as well, where language follows
certain types of structures. And we intuitively know
what these structures mean, but it's going to be helpful to
try and really formally define what the structures mean as well. There are a number of different
types of formal grammar all across what's known as the
Chomsky hierarchy of grammars. And you may have seen
some of these before. If you've ever worked with
regular expressions before, those belong to a class
of regular languages. They correspond to regular languages,
which is a particular type of language. But also on this hierarchy
is a type of grammar known as a context-free grammar. And this is the one we're
going to spend the most time on taking a look at today. And what a context-free grammar
is it is a way of taking-- of generating sentences
in a language or via what are known as rewriting rules--
replacing one symbol with other symbols. And we'll take a look in a
moment at just what that means. So let's imagine, for example,
a simple sentence in English, a sentence like, "she saw the city"-- a valid, syntactically
well-formed English sentence. But we'd like for some way for our
AI to be able to look at the sentence and figure out, what is the
structure of the sentence? If you imagine a guy in
question answering format-- if you want to ask the AI a
question like, what did she see, well, then the AI wants to be
able to look at this sentence and recognize that what she saw is the
city-- to be able to figure that out. And it requires some
understanding of what it is that the structure of
this sentence really looks like. So where do we begin? Each of these words--
she, saw, the, city-- we are going to call terminal symbols. There are symbols in our language--
where each of these words is just a symbol-- where this is ultimately what
we care about generating. We care about generating these words. But each of these words
we're also going to associate with what we're going to
call a non-terminal symbol. And these non-terminal symbols initially
are going to look kind of like parts of speech, if you remember
back to like English grammar-- where she is a noun,
saw is a V for verb, the is a D. D stands for determiner. These are words like the,
and a, and and, for example. And then city-- well, city is
also a noun, so an N goes there. So each of these-- N, V, and D-- these are what we might
call non-terminal symbols. They're not actually
words in the language. She saw the city-- those are
the words in the language. But we use these non-terminal symbols
to generate the terminal symbols, the terminal symbols which
are like, she saw the city-- the words that are actually
in a language like English. And so in order to translate these
non-terminal symbols into terminal symbols, we have what are
known as rewriting rules, and these rules look
something like this. We have N on the left side
of an arrow, and the arrow says, if I have an N
non-terminal symbol, then I can turn it into any of these
various different possibilities that are separated with a vertical line. So a noun could translate
into the word she. A noun could translate into the
word city, or car, or Harry, or any number of other things. These are all examples
of nouns, for example. Meanwhile, a determiner, D, could
translate into the, or a, or an. V for verb could translate
into any of these verbs. P for preposition could translate
into any of those prepositions-- to, on, over, and so forth. And then ADJ for adjective can translate
into any of these possible adjectives as well. So these then are rules in
our context-free grammar. When we are defining what
it is that our grammar is, what is the structure of the English
language or any other language, we give it these types of
rules saying that a noun could be any of these possibilities, a verb
could be any of those possibilities. But it turns out we can then begin
to construct other rules where it's not just one non-terminal
translating into one terminal symbol. We're always going to have one
non-terminal on the left-hand side of the arrow, but on the
right-hand side of the arrow, we could have other things. We could even have other
non-terminal symbols. So what do I mean by this? Well, we have the idea of nouns-- like
she, city, car, Harry, for example-- but there are also a noun phrases-- like phrases that work as nouns-- that are not just a single word,
but there are multiple words. Like the city is two words,
that together, operate as what we might call a noun phrase. It's multiple words, but they're
together operating as a noun. Or if you think about a more complex
expression, like the big city-- three words all operating
as a single noun-- or the car on the street-- multiple words now, but that entire set
of words operates kind of like a noun. It substitutes as a noun phrase. And so to do this, we'll
introduce the notion of a new non-terminal symbol called
NP, which will stand for noun phrase. And this rewriting rule says that
a noun phrase it could be a noun-- so something like she is
a noun, and therefore, it can also be a noun phrase-- but a noun phrase could also be a
determiner, D, followed by a noun-- so two ways we can have a noun
phrase in this very simple grammar. Of course, the English language
is more complex than just this, but a noun phrase is either a noun or
it is a determiner followed by a noun. So for the first example, a
noun phrase that is just a noun, that would allow us to
generate noun phrases like she, because a noun phrase is
just a noun, and a noun could be the word she, for example. Meanwhile, if we wanted to look at
one of the examples of these, where a noun phrase becomes a
determiner and a noun, then we get a structure like this. And now we're starting to
see the structure of language emerge from these rules in a
syntax tree, as we'll call it, this tree-like structure that represents
the syntax of our natural language. Here, we have a noun
phrase, and this noun phrase is composed of a determiner and a noun,
where the determiner is the word the, according to that rule,
and noun is the word city. So here then is a noun phrase that
consists of multiple words inside of the structure. And using this idea of taking one symbol
and rewriting it using other symbols-- that might be terminal
symbols, like the and city, but might also be non-terminal symbols,
like D for determiner or N for noun-- then we can begin to construct
more and more complex structures. In addition to noun phrases, we
can also think about verb phrases. So what might a verb phrase look like? Well, a verb phrase might
just be a single verb. In a sentence like "I
walked," walked is a verb, and that is acting as the
verb phrase in that sentence. But there are also more complex verb
phrases that aren't just a single word, but that are multiple words. If you think of the sentence like
"she saw the city," for example, saw the city is really
that entire verb phrase. It's taking up like what it is
that she is doing, for example. And so our verb phrase
might have a rule like this. A verb phrase is either
just a plain verb or it is a verb followed
by a noun phrase. And we saw before that a
noun phrase is either a noun or it is a determiner
followed by a noun. And so a verb phrase
might be something simple, like verb phrase it is just a verb. And that verb could be the
word walked for example. But it could also be
something more sophisticated, something like this noun, where we
begin to see a larger syntax tree, where the way to read the
syntax tree is that a verb phrase is a verb and
a noun phrase, where that verb could be something like saw. And this is a noun phrase we've
seen before, this noun phrase that is the city-- a noun phrase composed
of the determiner the and the noun city all put together to
construct this larger verb phrase. And then just to give one
more example of a rule, we could also have a rule like this-- sentence S goes to noun
phrase and a verb phrase. The basic structure of
a sentence is that it is a noun phrase followed by verb phrase. And this is a formal grammar
way of expressing the idea that you might have learned when you
learned English grammar, when you read that a sentence is like a subject
and a verb, subject and action-- something that's happening
to a particular noun phrase. And so using this structure,
we could construct a sentence that looks like this. A sentence consists of a noun
phrase and a verb phrase. A noun phrase could just be
a noun, like the word she. The verb phrase could be
a verb and a noun phrase, where-- this is something we've
seen before-- the verb is saw and the noun phrase is the city. And so now look what we've done here. What we've done is, by
defining a set of rules, there are algorithms that we
can run that take these words-- and the CYK algorithm, for example, is
one example of this if you want to look into that-- where you start with a set of terminal
symbols, like she saw the city, and then using these rules,
you're able to figure out, how is it that you go from a
sentence to she saw the city? And it's all through
these rewriting rules. So the sentence is a noun
phrase and a verb phrase. A verb phrase could be a verb and
a noun phrase, so on and so forth, where you can imagine
taking this structure and figuring out how it is that
you could generate a parse tree-- a syntax tree-- for that set of
terminal symbols, that set of words. And if you tried to do this for a
sentence that was not grammatical, something like "saw the city
she," well, that wouldn't work. There'd be no way to
take a sentence and use these rules to be able to
generate that sentence that is not inside of that language. So this sort of model
can be very helpful if the rules are expressive
enough to express all the ideas that you might want to
express inside of natural language. Of course, using just the
simple rules we have here, there are many sentences that we
won't be able to generate-- sentences that we might agree are grim
and syntactically well-formed, but that we're not going to be able
to construct using these rules. And then, in that case,
we might just need to have some more complex rules in
order to deal with those sorts of cases. And so this type of
approach can be powerful if you're dealing with a
limited set of rules and words that you really care about dealing with. And one way we can actually
interact with this in Python is by using a Python library called
NLTK, short for natural language toolkit, which we'll see
a couple of times today, which has a wide variety of
different functions and classes that we can take
advantage of that are all meant to deal with natural language. And one such algorithm that
it has is the ability to parse a context-free grammar, to
be able to take some words and figure out according to
some context-free grammar, how would you construct
the syntax tree for it? So let's go ahead and
take a look at NLTK now by examining how we might construct
some context-free grammars with it. So here inside of cfg0-- cfg's short for context-free grammar-- I have a sample context-free grammar
which has rules that we've seen before. So sentence goes to noun phrase
followed by a verb phrase. Noun phrase is either a
determiner and a noun or a noun. Verb phrase is either a verb
or a verb and a noun phrase. The order of these things
doesn't really matter. Determiners could be the
word the or the word a. A noun could be the
word she, city, or car. And a verb could be the word saw
or it could be the word walked. Now, using NLTK, which I've
imported here at the top, I'm going to go ahead
and parse this grammar and save it inside of this
variable called parser. Next, my program is going
to ask the user for input. Just type in a sentence,
and dot split will just split it on all of the
spaces, so I end up getting each of the individual words. We're going to save that inside
of this list called sentence. And then we'll go ahead and try to parse
the sentence, and for each sentence we parse, we're going to
pretty print it to the screen, just so it displays in my terminal. And we're also going to draw it. It turns out that NLTK has
some graphics capacity, so we can really visually see
what that tree looks like as well. And there are multiple different
ways a sentence might be parsed, which is why we're putting
it inside of this for loop. And we'll see why that can
be helpful in a moment too. All right, now that I have
that, let's go ahead and try it. I'll cd into cfg, and we'll
go ahead and run cfg0. So it then is going to prompt
me to type in a sentence. And let me type in a very
simple sentence-- something like she walked, for example. Press Return. So what I get is, on
the left-hand side, you can see a text-based
representation of the syntax tree. And on the right side here-- let
me go ahead and make it bigger-- we see a visual representation
of that same syntax tree. This is how it is that my computer has
now parsed the sentence she walked. It's a sentence that consists of
a noun phrase and a verb phrase, where each phrase is just a single
noun or verb, she and then walked-- same type of structure
we've seen before, but this now is our
computer able to understand the structure of the
sentence, to be able to get some sort of structural understanding
of how it is that parts of the sentence relate to each other. Let me now give it another sentence. I could try something like she
saw the city, for example-- the words we were dealing
with a moment ago. And then we end up getting
this syntax tree out of it-- again, a sentence that has a
noun phrase and a verb phrase. The noun phrase is fairly simple. It's just she. But the verb phrase is more complex. It is now saw the city, for example. Let's do one more with this grammar. Let's do something like she saw a car. And that is going to look very similar-- that we also get she. But our verb phrase is now different. It's saw a car, because there
are multiple possible determiners in our language and
multiple possible nouns. I haven't given this grammar
rule that many words, but if I gave it a larger
vocabulary, it would then be able to understand more and
more different types of sentences. And just to give you a sense of some
added complexity we could add here, the more complex our grammar,
the more rules we add, the more different
types of sentences we'll then have the ability to generate. So let's take a look
at cfg1, for example, where I've added a whole number
of other different types of rules. I've added the adjective phrases, where
we can have multiple adjectives inside of a noun phrase as well. So a noun phrase could be an adjective
phrase followed by a noun phrase. If I wanted to say
something like the big city, that's an adjective phrase
followed by a noun phrase. Or we could also have a noun
and a prepositional phrase-- so the car on the street, for example. On the street is a
prepositional phrase, and we might want to combine those two ideas
together, because the car on the street can still operate as something
kind of like a noun phrase as well. So no need to understand all of
these rules in too much detail-- it starts to get into the
nature of English grammar-- but now we have a more complex way of
understanding these types of sentences. So if I run Python cfg1-- and I can try typing something like
she saw the wide street, for example-- a more complex sentence. And if we make that larger, you can
see what this sentence looks like. I'll go ahead and
shrink it a little bit. So now we have a sentence like
this-- she saw the wide street. The wide street is one
entire noun phrase, saw the wide street is
an entire verb phrase, and she saw the wide street ends
up forming that entire sentence. So let's take a look at one more example
to introduce this notion of ambiguity. So I can run Python cfg1. Let me type a sentence like
she saw a dog with binoculars. So there's a sentence, and here
now is one possible syntax tree to represent this idea-- she saw, the noun phrase a dog,
and then the prepositional phrase with binoculars. And the way to interpret the sentence is
that what it is that she saw was a dog. And how did she do the seeing? She did the seeing with binoculars. And so this is one possible
way to interpret this. She was using binoculars. Using those binoculars, she saw a dog. But another possible way
to pass that sentence would be with this tree over
here, where you have something like she saw a dog with binoculars,
where a dog with binoculars forms an entire noun phrase of its own-- same words in the same order, but
a different grammatical structure, where now we have a dog with binoculars
all inside of this noun phrase, meaning what did she see? What she saw was a dog,
and that dog happened to have binoculars with the dog-- so
different ways to parse the sentence-- structures for the sentence-- even given
the same possible sequence of words. And NLTK's algorithm and
this particular algorithm has the ability to find
all of these, to be able to understand the
different ways that you might be able to parse a sentence and be able
to extract some sort of useful meaning out of that sentence as well. So that then is a brief
look at what we can do-- using getting the structure of language,
of using these context-free grammar rules to be able to describe
the structure of language. But what we might also
care about is understanding how it is that these
sequences of words are likely to relate to each other in
terms of the actual words themselves. The grammar that we saw before could
allow us to generate a sentence like, I eat a banana, for example, where I
is the noun phrase and ate a banana is a verb phrase. But it would also allow
for sentences like, I eat a blue car, for example, which
is also syntactically well-formed according to the rules, but is probably
a less likely sentence that a person is likely to speak. And we might want for our
AI to be able to encapsulate the idea that certain sequences of words
are more or less likely than others. So to deal with that, we'll
introduce the notion of an n-gram, and an n-gram, more generally,
just refers to some sequence of n items inside of our text. And those items might take
various different forms. We can have character n-grams,
which are just a contiguous sequence of n characters--
so three characters in a row, for example, or four
characters in a row. We can also have word n-grams,
which are a contiguous sequence of n words in a row
from a particular sample of text. And these end up proving
quite useful, and you can choose our n to decide how many
how long is our sequence going to be. So when n is 1, we're just looking at
a single word or a single character. And that is what we might
call a unigram, just one item. If we're looking at two
characters or two words, that's generally called
a bigram-- so an n-gram where n is equal to 2, looking at
two words that are consecutive. And then, if there are
three items, you might imagine we'll often call those
trigrams-- so three characters in a row or three words that happen
to be in a contiguous sequence. And so if we took a
sentence, for example-- here's a sentence from,
again, Sherlock Holmes-- "how often have I said
to you that, when you have eliminated the
impossible, whatever remains, however improbable, must be the truth." What are the trigrams that we
can extract from the sentence? If we're looking at
sequences of three words, well, the first trigram
would be how often have-- just a sequence of three words. And then we can look
at the next trigram, often have I. The next
trigram is have I said. Then I said to, said to you,
to you that, for example-- those are all trigrams of words,
sequences of three contiguous words that show up in the text. And extracting those bigrams and
trigrams, or n-grams more generally, turns out to be quite
helpful, because often, when we're dealing with
analyzing a lot of text, it's not going to be particularly
meaningful for us to try and analyze the entire text at one time. But instead, we want to segment
that text into pieces that we can begin to do some analysis of-- that our AI might never have
seen this entire sentence before, but it's probably seen the
trigram to you that before, because to you that is something that
might have come up in other documents that our AI has seen before. And therefore, it knows a little
bit about that particular sequence of three words in a row-- or
something like have I said, another example of another sequence
of three words that's probably quite popular, in terms of where you
see it inside the English language. So we'd like some way to be able
to extract these sorts of n-grams. And how do we do that? How do we extract
sequences of three words? Well, we need to take our
input and somehow separate it into all of the individual words. And this is a process generally
known as tokenization, the task of splitting up some
sequence into distinct pieces, where we call those pieces tokens. Most commonly, this refers to
something like word tokenization. I have some sequence of text
and I want to split it up into all of the words
that show up in that text. But it might also come up
in the context of something like sentence tokenization. I have a long sequence of text
and I'd like to split it up into sentences, for example. And so how might word tokenization
work, the task of splitting up our sequence of characters into words? Well, we've also already seen this idea. We've seen that, in word
tokenization just a moment ago, I took an input sequence and I just called
Python's split method on it, where the split method took
that sequence of words and just separated it based on where
the spaces showed up in that word. And so if I had a sentence like,
whatever remains, however improbable, must be the truth, how
would I tokenize this? Well, the naive approach is just
to say, anytime you see a space, go ahead and split it up. We're going to split up this particular
string just by looking for spaces. And what we get when we do
that is a sentence like this-- whatever remains, however
improbable, must be the truth. But what you'll notice here is
that, if we just split things up in terms of where the spaces are, we
end up keeping the punctuation around. There's a comma after the word remains. There's a comma after
improbable, a period after truth. And this poses a little
bit of a challenge, when we think about trying to tokenize
things into individual words, because if you're comparing
words to each other, this word truth with a period after it-- if you just string
compare it, it's going to be different from the word
truth without a period after it. And so this punctuation can
sometimes pose a problem for us, and so we might want some way of dealing
with it-- either treating punctuation as a separate token altogether or maybe
removing that punctuation entirely from our sequence as well. So that might be
something we want to do. But there are other cases where it
becomes a little bit less clear. If I said something like,
just before 9:00 o'clock, Sherlock Holmes stepped
briskly into the room, well, this apostrophe after 9 o'clock-- after the O in 9 o'clock-- is
that something we should remove? Should be split based on that
as well, and do O and clock? There's some interesting
questions there too. And it gets even trickier if you begin
to think about hyphenated words-- something like this, where we
have a whole bunch of words that are hyphenated and then you
need to make a judgment call. Is that a place where you're
going to split things apart into individual words, or are you going
to consider frock-coat, and well-cut, and pearl-grey to be
individual words of their own? And so those tend to pose challenges
that we need to somehow deal with and something we need to
decide as we go about trying to perform this kind of analysis. Similar challenges arise when it comes
to the world of sentence tokenization. Imagine this sequence of
sentences, for example. If you take a look at this
particular sequence of sentences, you could probably imagine you could
extract the sentences pretty readily. Here is one sentence and
here is a second sentence, so we have two different sentences
inside of this particular passage. And the distinguishing feature
seems to be the period-- that a period separates
one sentence from another. And maybe there are other
types of punctuation you might include here as well-- an exclamation point, for
example, or a question mark. But those are the types of
punctuation that we know tend to come at the end of sentences. But it gets trickier again if you look
at a sentence like this-- not just sure talking to Sherlock, but
instead of talking to Sherlock, talking to Mr. Holmes. Well now, we have a
period at the end of Mr. And so if you were just
separating on periods, you might imagine this
would be a sentence, and then just Holmes
would be a sentence, and then we'd have a
third sentence down below. Things do get a little
bit trickier as you start to imagine these sorts of situations. And dialogue too starts to
make this trickier as well-- that if you have these sorts of lines
that are inside of something that-- he said, for example-- that he said this
particular sequence of words and then this particular
sequence of words. There are interesting
challenges that arise there too, in terms of how it is
that we take the sentence and split it up into
individual sentences as well. And these are just things that
our algorithm needs to decide. In practice, there usually some
heuristics that we can use. We know there are certain
occurrences of periods, like the period after Mr.,
or in other examples where we know that is not the
beginning of a new sentence, and so we can encode
those rules into our AI to allow it to be able to
do this tokenization the way that we want it to. So once we have these ability to
tokenize a particular passage-- take the passage, split it
up into individual words-- from there, we can begin to extract
what the n-grams actually are. So we can actually take
a look at this by going into a Python program that will
serve the purpose of extracting these n-grams. And again, we can use NLTK, the
Natural Language Toolkit, in order to help us here. So I'll go ahead and go into ngrams
and we'll take a look at ngrams.py. And what we have here
is we are going to take some corpus of text, just
some sequence of documents, and use all those documents and extract
what the most popular n-grams happen to be. So in order to do so, we're going to
go ahead and load data from a directory that we specify in the
command line argument. We'll also take in a number
n as a command line argument as well, in terms of what
our number should be, in terms of how many sequences-- words
we're going to look at in sequence. Then we're going to go ahead and
just count up all of the nltk.ngrams. So we're going to look at all of
the grams across this entire corpus and save it inside this variable ngrams. And then we're going to
look at the most common ones and go ahead and print them out. And so in order to do so,
I'm not only using NLTK-- I'm also using counter, which is built
into Python as well, where I can just count up, how many times do these
various different grams appear? So we'll go ahead and show that. We'll go into ngrams, and I'll
say something like python ngrams-- and let's just first look
for the unigrams, sequences of one word inside of a corpus. And the corpus that
I've prepared is I have all of the-- or some of these
stories from Sherlock Holmes all here, where each one is just
one of the Sherlock Holmes stories. And so I have a whole bunch of
text here inside of this corpus, and I'll go ahead and provide that
corpus as a command line argument. And now what my program
is going to do is it's going to load all of the
Sherlock Holmes stories into memory-- or all the ones that I've
provided in this corpus at least-- and it's just going to look
for the most popular unigrams, the most popular sequences of one word. And it seems the most popular one is
just the word the used in 9,700 times; followed by I, used 5,000 times;
and, used about 5,000 times-- the kinds of words you might expect. So now let's go ahead and check for
bigrams, for example, ngrams 2, holmes. All right, again, sequences of two
words now that appear multiple times-- of the, in the, it was, to the, it
is, I have-- so on and so forth. These are the types
of bigrams that happen to come up quite often inside this
corpus, the inside of the Sherlock Holmes stories. And it probably is true
across other corpses as well, but we could only find out
if we actually tested it. And now, just for good
measure, let's try one more-- maybe try three, looking now
for trigrams that happen to show up. And now we get it was the, one
of the, I think that, out of the. These are sequences of
three words now that happen to come up multiple times
across this particular corpus. So what are the
potential use cases here? Now we have some sort of data. We have data about how often
particular sequences of words show up in this particular
order, and using that, we can begin to do some
sort of predictions. We might be able to say that, if
you see the words that it was, there's a reasonable
chance the word that comes after it should be the word a. And if I see the words one of,
it it's reasonable to imagine that the next word might be
the word the, for example, because we have this data about
trigrams, sequences of three words and how often they come up. And now, based on two
words, you might be able to predict what the
third word happens to be. And one model we can use for that is
a model we've actually seen before. It's the Markov model. Recall again that the
Markov model really just refers to some sequence
of events that happen one time step after a one time step,
where every unit has some ability to predict what the next
unit is going to be-- or maybe the past two units predict
with the next unit is going to be, or the past three predict with
the next one is going to be. And we can use a Markov
model and apply it to language for a very
naive and simple approach at trying to generate natural
language, at getting our AI to be able to speak English-like text. And the way it's going to work is we're
going to say something like, come up with some probability distribution. Given these two words,
what is the probability distribution over what the
third word could possibly be based on all the data? If you see it was, what are the
possible third words we might? Have how often do they come up? And using that information,
we can try and construct what we expect the third word to be. And if you keep doing
this, the effect is that our Markov model
can effectively start to generate text-- can be
able to generate text that was not in the original
corpus, but that sounds kind of like the original corpus. It's using the same sorts of rules
that the original corpus was using. So let's take a look
at an example of that as well, where here now, I have
another corpus that I have here, and it is the corpus of all of
the works of William Shakespeare. So I've got a whole bunch of stories
from Shakespeare, and all of them are just inside of this big text file. And so what I might like to do is
look at what all of the n-grams are-- maybe look at all the trigrams
inside of shakespeare.txt-- and figure out, given
two words, can I predict what the third word is likely to be? And then just keep
repeating this process-- I have two words-- predict the third word; then,
from the second and third, word predict the fourth word; and
from the third and fourth word, predict the fifth word, ultimately
generating random sentences that sounds like Shakespeare, that are
using similar patterns of words that Shakespeare used, but that never
actually showed up in Shakespeare itself. And so to do so, I'll
show you generator.py, which, again, is just going to
read data from a particular file. And I'm using a Python library
called markovify, which is just going to do this process for me. So there are libraries out here that
can just train on a bunch of text and come up with a Markov
model based on that text. And I'm going to go
ahead and just generate five randomly generated sentences. So we'll go ahead and go in to markov. I'll run the generator
on shakespeare.txt. What we'll see is it's going to load
that data, and then here's what we get. We get five different
sentences, and these are sentences that never showed
up in any Shakespeare play, but that are designed to
sound like Shakespeare, that are designed to just
take two words and predict, given those two words, what would
Shakespeare have been likely to choose as the third word that follows it. And you know, these sentences
probably don't have any meaning. It's not like the AI is trying to
express any sort of underlying meaning here. It's just trying to understand,
based on the sequence of words, what is likely to come
after it as a next word, for example. And these are the types of sentences
that it's able to come up with, just generating. And if you ran this multiple times, you
would end up getting different results. I could run this again and
get an entirely different set of five different
sentences that also are supposed to sound kind of like the way
that Shakespeare's sentences sounded as well. And so that then was
a look at how it is we can use Markov models to be able to
naively attempt generating language. The language doesn't mean
a whole lot right now. You wouldn't want to use the
system in its current form to do something like
machine translation, because it wouldn't be able
to encapsulate any meaning, but we're starting to see now that
our AI is getting a little bit better at trying to speak our
language, at trying to be able to process natural language
in some sort of meaningful way. So we'll now take a look
at a couple of other tasks that we might want our
AI to be able to perform. And one such task is text
categorization, which really is just a classification problem. And we've talked about
classification problems already, these problems where we would
like to take some object and categorize it into a
number of different classes. And so the way this comes up in text
is anytime you have some sample of text and you want to put it inside of a
category, where I want to say something like, given an email, does it belong
in the inbox or does it belong in spam? Which of these two
categories does it belong in? And you do that by looking
at the text and being able to do some sort of analysis on that
text to be able to draw conclusions, to be able to say that, given the
words that show up in the email, I think this is probably
belonging in the inbox, or I think it probably
belongs in spam instead. And you might imagine
doing this for a number of different types of classification
problems of this sort. So you might imagine that another
common example of this type of idea is something like sentiment
analysis, where I want to analyze, given a sample of text, does
it have a positive sentiment or does it have a negative sentiment? And this might come up in the case
of a product reviews on a website, for example, or feedback on a website,
where you have a whole bunch of data-- samples of text that are
provided by users of a website-- and you want to be able to quickly
analyze, are these reviews positive, are the reviews negative,
what is it that people are saying, just to get a sense for
what it is that people are saying, to be able to categorize text into
one of these two different categories. So how might we approach this problem? Well, let's take a look at
some sample product reviews. Here are some sample prep reviews
that we might come up with. My grandson loved it. So much fun. Product broke after a few days. One of the best games I've
played in a long time. Kind of cheap and flimsy. Not worth it. Different product reviews that you
might imagine seeing on Amazon, or eBay, or some other website where people
are selling products, for instance. And we humans can pretty
easily categorize these into positive sentiment
or negative sentiment. We'd probably say that the
first and the third one, those are positive sentiment messages. The second one and the fourth
one, those are probably negative sentiment messages. But how could a computer
do the same thing? How could it try and take these
reviews and assess, are they positive or are they negative? Well, ultimately, it
depends upon the words that happen to be in this particular--
these particular reviews-- inside of these particular sentences. For now we're going to
ignore the structure and how the words are
related to each other, and we're just going to focus
on what the words actually are. So there are probably some key
words here, words like loved, and fun, and best. Those probably show up in more
positive reviews, whereas words like broke, and cheap, and flimsy-- well, those are words
that probably are more likely to come up inside of negative
reviews, instead of positive reviews. So one way to approach this
sort of text analysis idea is to say, let's, for now, ignore the
structures of these sentences-- to say, we're not going to care about how it
is the words relate to each other. We're not going to try and parse
these sentences to construct the grammatical structure
like we saw a moment ago. But we can probably just rely
on the words that were actually used-- rely on the fact that
the positive reviews are more likely to have words
like best, and loved, and fun, and that the negative reviews are
more likely to have the negative words that we've highlighted there as well. And this sort of model-- this approach
to trying to think about language-- is generally known as
the bag of words model, where we're going to model a sample of
text not by caring about its structure, but just by caring about the
unordered collection of words that show up inside of a sample--
that all we care about is what words are in the text. And we don't care about what
the order of those words is. We don't care about the
structure of the words. We don't care what noun
goes with what adjective or how things agree with each other. We just care about the words. And it turns out this
approach tends to work pretty well for doing classifications
like positive sentiment or negative sentiment. And you could imagine doing
this in a number of ways. We've talked about different approaches
to trying to solve classification style problems, but when it
comes to natural language, one of the most popular approaches
is that naive Bayes approach. And this is one approach to trying to
analyze the probability that something is positive sentiment
or negative sentiment, or just trying to categorize it
some text into possible categories. And it doesn't just work for text--
it works for other types of ideas as well-- but it is quite
popular in the world of analyzing text and natural language. And the naive Bayes approach
is based on Bayes' rule, which you might recall back from when
we talked about probability, that the Bayes' rule looks like this-- that the probability of
some event b, given a can be expressed using
this expression over here. Probability of b given a is the
probability of a given b multiplied by the probability of b divided
by the probability of a. And we saw that this came about
as a result of just the definition of conditional independence and
looking at what it means for two events to happen together. This was our formulation
then of Bayes' rule, which turned out to be quite helpful. We were able to predict one
event in terms of another by flipping the order of those events
inside of this probability calculation. And it turns out this approach
is going to be quite helpful-- and we'll see why in a moment-- for being able to do this
sort of sentiment analysis, because I want to say you
know, what is the probability that a message is positive,
or what is the pop probability that the message is negative? And I'll go ahead and simplify
this just using the emojis just for simplicity-- probability of
positive, probability of negative. And that is what I
would like to calculate, but I'd like to calculate
that given some information-- given information like
here is a sample of text-- my grandson loved it. And I would like to know not just what
is the probability that any message is positive, but what is the probability
that the message is positive, given my grandson loved it
as the text of the sample? So given this information that inside
the sample are the words my grandson loved it, what is the probability
then that this is a positive message? Well, according to the bag of
words model, what we're going to do is really ignore the
ordering of the words-- not treat this as a single sentence
that has some structure to it, but just treat it as a whole
bunch of different words. We're going to say something
like, what is the probability that this is a positive
message, given that the word my was in the message, given that the
word grandson was in the message, given that the word loved within
the message, and given the word it was in the message? The bag of words model here-- we're treating the entire simple
sample as just a whole bunch of different words. And so this then is what I'd like
to calculate, this probability-- given all those words,
what is the probability that this is a positive message? And this is where we can
now apply Bayes' rule. This is really the probability
of some b, given some a. And that now is what
I'd like to calculate. So according to Bayes' rule, this
whole expression is equal to-- well, it's the probability-- I switched the order of them-- it's the probability
of all of these words, given that it's a positive
message, multiplied by the probability that is
the positive message divided by the probability of
all of those words. So this then is just an
application of Bayes' rule. We've already seen where I want to
express the probability of positive, given the words, as related to
somehow the probability of the words, given that it's a positive message. And it turns out that--
as you might recall, back when we talked about probability,
that this denominator is going to be the same. Regardless of whether we're looking
at positive or negative messages, the probability of these
words doesn't change, because we don't have a
positive or negative down below. So we can just say that,
rather than just say that this expression up here is
equal to this expression down below, it's really just proportional
to just the numerator. We can ignore the denominator for now. Using the denominator would
get us an exact probability. But it turns out that
what we'll really just do is figure out what the probability
is proportional to, and at the end, we'll have to normalize the
probability distribution-- make sure the probability distribution
ultimately sums up to the number 1. So now I've been able to
formulate this probability-- which is what I want to care about-- as proportional to multiplying these two
things together-- probability of words, given positive message, multiplied by
the probability of positive message. But again, if you think back
to our probability rules, we can calculate this really as just
a joint probability of all of these things happening-- that the probability
of positive message multiplied by the probability of these words,
given the positive message-- well, that's just the joint
probability of all of these things. This is the same thing
as the probability that it's a positive message, and my
isn't the sentence or in the message, and grandson is in the sample,
and loved is in the sample, and it is in the sample. So using that rule for the
definition of joint probability, I've been able to say that
this entire expression is now proportional to this sequence-- this joint probability of these
words and this positive that's in there as well. And so now the interesting
question is just how to calculate that joint probability. How do I figure out
the probability that, given some arbitrary message, that it is
positive, and the word my is in there, and the word grandson is in there,
and the word loved is in there, and the word it is in there? Well, you'll recall that we can
calculate a joint probability by multiplying together all of
these conditional probabilities. If I want to know the
probability of a, and b, and c, I can calculate that as
the probability of a times the probability of b, given a, times
the probability of c, given a and b. I can just multiply these
conditional probabilities together in order to get the overall joint
probability that I care about. And we could do the same thing here. I could say, let's
multiply the probability of positive by the probability of the
word my showing up in the message, given that it's positive, multiplied
by the probability of grandson showing up in the message, given
that the word my is in there and that it's positive, multiplied
by the probability of loved, given these three things,
multiplied by the probability of it, given these four things. And that's going to end up being a
fairly complex calculation to make, one that we probably
aren't going to have a good way of knowing the answer to. What is the probability that
grandson is in the message, given that it is positive and the
word my is in the message? That's not something we're really
going to have a readily easy answer to, and so this is where the naive
part of naive Bayes comes about. We're going to simplify this notion. Rather than compute exactly what
that probability distribution is, we're going to assume
that these words are going to be effectively
independent of each other, if we know that it's
already a positive message. If it's a positive message, it
doesn't change the probability that the word grandson
is in the message, if I know that the word loved
is in the message, for example. And that might not necessarily
be true in practice. In the real world, it
might not be the case that these words are
actually independent, but we're going to assume
it to simplify our model. And it turns out that
simplification still lets us get pretty good
results out of it as well. And what we're going to assume is that
the probability that all of these words show up depend only on whether
it's positive or negative. I can still say that loved
is more likely to come up in a positive message than a negative
message, which is probably true, but we're also going to say that it's
not going to change whether or not loved is more likely or less likely to
come up if I know that the word my is in the message, for example. And so those are the assumptions
that we're going to make. So while top expression is
proportional to this bottom expression, we're going to say it's naively
proportional to this expression, probability of being a positive message. And then, for each of the words
that show up in the sample, I'm going to multiply what's
the probability that my is in the message, given
that it's positive, times the probability of grandson
being in the message, given that it's positive-- and
then so on and so forth for the other words that happen
to be inside of the sample. And it turns out that these are
numbers that we can calculate. The reason we've done all of this
math is to get to this point, to be able to calculate this probability
of distribution that we care about, given these terms that we
can actually calculate. And we can calculate then,
given some data available to us. And this is what a lot of
natural language processing is about these days. It's about analyzing data. If I give you a whole bunch of
data with a whole bunch of reviews, and I've labeled them
as positive or negative, then you can begin to calculate
these particular terms. I can calculate the probability
that a message is positive just by looking at my data
and saying, how many positive samples were there, and divide
that by the number of total samples. That is my probability
that a message is positive. What is the probability that the
word loved is in the message, given that it's positive? Well, I can calculate
that based on my data too. Let me just look at how many positive
samples have the word loved in it and divide that by my total
number of positive samples. And that will give me
an approximation for, what is the probability that loved is
going to show up inside of the review, given that we know that
the review is positive. And so this then allows us to be able
to calculate these probabilities. So let's not actually
do this calculation. Let's calculate for the
sentence, my grandson loved it. Is it a positive or negative review? How could we figure out
those probabilities? Well, again, this up here is the
expression we're trying to calculate. And I'll give you a hint the
data that is available to us. And the way to interpret
this data in this case is that, of all of the messages, 49%
of them were positive and 51% of them were negative. Maybe online reviews tend to be a
little bit more negative than they are positive-- or at least
based on this particular data sample, that's what I have. And then I have distributions for
each of the various different words-- that, given that it's
a positive message, how many positive messages
had the word in my in them? It's about 30%. And for negative messages, how many
of those had the word my in them? About 20%-- so it seems like the word
my comes up more often in positive messages-- at least slightly more
often based on this analysis here. Grandson, for example--
maybe that showed up in 1% of all positive messages
and 2% of all negative messages had the word grandson in it. The word loved showed up in 32%
of all positive messages, 8% of all negative messages, for example. And then the word it up in
30% of positive messages, 40% of negative messages-- again, just
arbitrary data here just for example, but now we have data with which we can
begin to calculate this expression. So how do I calculate multiplying
all these values together? Well, it's just going to
be multiplying probability that it's positive times the
probability of my, given positive, times the probability of
grandson, given positive-- so on and so forth for
each of the other words. And if you do that multiplication and
multiply all of those values together, you get this, 0.00014112. By itself, this is not
a meaningful number, but it's going to be meaningful
if you compared this expression-- the probability that it's positive times
the probability of all of the words, given that I know that
the message is positive, and compare it to the same thing,
but for negative sentiment messages instead. I want to know the probability
that it's a negative message times the probability
of all of these words, given that it's a negative message. And so how can I do that? Well, to do that, you just multiply
probability of negative times all of these conditional probabilities. And if I take those five values,
multiply all of them together, then what I get is this value
for negative 0.00006528-- again, in isolation, not a
particularly meaningful number. What is meaningful is treating these
two values as a probability distribution and normalizing them, making it so
that both of these values sum up to 1 the way of probability
distribution should. And we do so by adding these two up
and then dividing each of these values by their total in order to
be able to normalize them. And when we do that, when we normalize
this probability distribution, you end up getting something like
this, positive 0.6837, negative 0.3163. It seems like we've been able to
conclude that we are about 68% confident-- we think there's
a probability of 0.68 that this message is a positive
message-- my grandson loved it. And why are we 68% confident? Well, it seems like we're more
confident than not because the word loved showed up in 32%
of positive messages, but only 8% of negative messages. So that was a pretty strong indicator. And for the others, while
it's true that the word it showed up more often
in negative messages, it wasn't enough to
offset that loved shows up far more often in positive
messages than negative messages. And so this type of analysis is
how we can apply naive Bayes. We've just done this calculation. And we end up getting not just a
categorization of positive or negative, but I get some sort of confidence level. What do I think the probability
is that it's positive? And I can say I think it's positive
with this particular probability. And so naive Bayes can be quite
powerful at trying to achieve this. Using just this bag of words
model, where all I'm doing is looking at what words
show up in the sample, I'm able to draw these
sorts of conclusions. Now, one potential drawback-- something
that you'll notice pretty quickly if you start applying
this room exactly as is-- is what happens depending on if
0's are inside this data somewhere. Let's imagine, for example, this same
sentence-- my grandson loved it-- but let's instead imagine that this
value here, instead of being 0.01, was 0, meaning inside of
our data set, it has never before happened that in a positive
message the word grandson showed up. And that's certainly possible. If I have a pretty small data
set, it's probably likely that not all the messages are
going to have the word grandson. Maybe it is the case that no
positive messages have ever had the word grandson in
it, at least in my data set. But if it is the case that
2% of the negative messages have still had the word
grandson in it, then we run into an interesting challenge. And the challenge is this-- when I
multiply all of the positive numbers together and multiply all the negative
numbers together to calculate these two probabilities, what I end up getting
is a positive value of 0.000. I get pure 0's, because when I
multiply all of these numbers together-- when I
multiply something by 0, doesn't matter what the other numbers
are-- the result is going to be 0. And the same thing can be said
of negative numbers as well. So this then would seem to be a problem
that, because grandson has never showed up in any of the positive
messages inside of our sample, we're able to say-- we seem to
be concluding that there is a 0% chance that the message is positive. And therefore, it must be negative,
because the only cases where we've seen the word grandson come
up is inside of a negative message. And in doing so, we've totally
ignored all of the other probabilities that a positive message is much more
likely to have the word loved in it, because we've multiplied
by 0, which just means none of the other probabilities
can possibly matter at all. So this then is a challenge
that we need to deal with. It means that we're
likely not going to be able to get the correct results if
we just purely use this approach. And it's for that reason there
are a number of possible ways we can try and make sure that we
never multiply something by 0. It's OK to multiply
something by a small number, because then it can
still be counterbalanced by other larger numbers, but multiplying
by 0 means it's the end of the story. You multiply a number
by 0, and the output's going to be 0, no matter how big any
of the other numbers happen to be. So one approach that's fairly
common a naive Bayes is this idea of additive smoothing, adding
some value alpha to each of the values in our distribution just to
smooth the data little bit. One such approach is
called Laplace smoothing, which basically just means adding one
to each value in our distribution. So if I have 100 samples and zero
of them contain the word grandson, well then I might say
that, you know what? Instead, let's pretend that I've had
one additional sample where the word grandson appeared and one additional
sample where the word grandson didn't appear. So I'll say all right,
now I have one 1 of 102-- so one sample that does have the
word grandson out of 102 total. I'm basically creating two
samples that didn't exist before. But in doing so, I've been able to
smooth the distribution a little bit to make sure that I never have
to multiply anything by 0. By pretending I've seen one more value
in each category than I actually have, this gets us that result
of not having to worry about multiplying a number by 0. So this then is an approach
that we can use in order to try and apply naive
Bayes, even in situations where we're dealing with words that we
might not necessarily have seen before. And let's now take a look at how we
could actually apply that in practice. It turns out that NLTK, in addition
to having the ability to extract n-grams and tokenize
things into words, also has the ability to be able to apply
naive Bayes on some samples of text, for example. And so let's go ahead and do that. What I've done is, inside of sentiment,
I've prepared a corpus of just know reviews that I've generated, but
you can imagine using real reviews. I just have a couple of
positive reviews-- it was great. So much fun. Would recommend. My grandson loved it. Those sorts of messages. And then I have a whole bunch of
negative reviews-- not worth it, kind of cheap, really bad,
didn't work the way we expected-- just one on each line. A whole bunch of positive
reviews and negative reviews. And what I'd like to do now
is analyze them somehow. So here then is sentiment up high,
and what we're going to do first is extract all of the positive
and negative sentences, create a set of all of the words that
were used across all of the messages, and then we're going to go ahead and
train NLTK's naive Bayes classifier on all of this training data. And with the training
data effectively is is I take all of the positive messages
and give them the label positive, all the negative messages and
give them the label negative, and then I'll go ahead and apply
this classifier to it, where I'd say, I would like to take all
of this training data and now have the ability to
classify it as positive or negative. I'll then take some input from the user. They can just type in
some sequence of words. And then I would like to
classify that sequence as either positive or
negative, and then I'll go ahead and print out what the
probabilities of each happened to be. And there are some helper functions here
that just organize things in the way that NLTK is expecting them to be. But the key idea here is that
I'm taking the positive messages, labeling them, taking
the negative messages, labeling them, putting them
inside of a classifier, and then now trying to classify
some new text that comes about. So let's go ahead and try it. I'll go ahead and go into sentiment,
and we'll run Python sentiment, passing in as input that
corpus that contains all of the positive
and negative messages-- because depending on the corpus, that's
going to affect the probabilities. The effectiveness of
our ability to classify is entirely dependent on how good our
data is, and how much data we have, and how well they happen to be labeled. So now I can try something and say-- let's try a review
like, this was great-- just some review that I might leave. And it seems that, all right,
there is a 96% chance it estimates that this was a positive message-- 4% chance that it was a negative,
likely because the word great shows up inside of
the positive messages, but doesn't show up inside
of the negative messages. And that might be something that
our AI is able to capitalize on. And really, what it's going to look
for are the differentiating words-- that if the probability
of words like this and was and is pretty similar between
positive and negative words, then the naive Bayes
classifier isn't going to end up using those values as
having some sort of importance in the algorithm. Because if they're the
same on both sides, you multiply that value for
both positive and negative, you end up getting about the same thing. What ultimately makes the
difference in naive Bayes is when you multiply by value
that's much bigger for one category than for another category--
when one word like great is much more likely to show
up in one type of message than another type of message. And that's one of the nice
things about naive Bayes is that, without me
telling it, that great is more important to care
about than this or was. Naive Bayes can figure
that out based on the data. It can figure out that this shows
up about the same amount of time between the two, but great,
that is a discriminator, a word that can be different
between the two types of messages. So I could try it again-- type in a sentence like,
lots of fun, for example. This one it's a little less sure about-- 62% chance that it's positive, 37%
chance that it's negative-- maybe because there aren't
as clear discriminators or differentiators inside of this data. I'll try one more-- say kind of overpriced. And all right, now
95%, 96% sure that this is a negative sentiment--
likely because of the word overpriced, because it's shown up
in a negative sentiment expression before, and therefore, it thinks,
you know what, this is probably going to be a negative sentence. And so naive Bayes has now given
us the ability to classify text. Given enough training data,
given enough examples, we can train our AI to be able to
look at natural language, human words, figure out which words
are likely to show up in positive as opposed to
negative sentiment messages, and categorize them accordingly. And you could imagine
doing the same thing anytime you want to take text
and group it into categories. If I want to take an email
and categorize as email-- as a good email or as a spam email,
you could apply a similar idea. Try and look for the
discriminating words, the words that make it more
likely to be a spam email or not, and just train a naive Bayes
classifier to be able to figure out what that distribution is and to be
able to figure out how to categorize an email as good or as spam. Now, of course, it's not going to be
able to give us a definitive answer. It gives us a probability
distribution, something like 63% positive, 37% negative. And that might be why our spam filters
and our emails sometimes make mistakes, sometimes think that a good
email is actually spam or vice versa, because ultimately,
the best that it can do is calculate a probability distribution. If natural language is
ambiguous, we can usually just deal in the world of
probabilities to try and get an answer that is reasonably good, even
if we aren't able to guarantee for sure that it is the number that we
actually expect for it to be. That then was a look at how
we can begin to take some text and to be able to analyze the text and
group it into some sorts of categories. But ultimately, in addition just being
able to analyze text and categorize it, we'd like to be able to figure
out information about the text, get it some sort of meaning
out of the text as well. And this starts to get us
in the world of information, of being able to try and
take data in the form of text and retrieve information from it. So one type of problem is known
as information retrieval, or IR, which is the task of finding relevant
documents in response to a query. So this is something like you type
in a query into a search engine, like Google, or you're
typing in something into some system that's going to look
for-- inside of a library catalog, for example-- that's going to
look for responses to a query. I want to look for documents that are
about the US constitution or something, and I would like to get a
whole bunch of documents that match that query back to me. But you might imagine that what
I really want to be able to do is, in order to solve
this task effectively, I need to be able to take
documents and figure out, what are those documents about? I want to be able to say what is it
that these particular documents are about-- what of the topics
of those documents-- so that I can then more effectively
be able to retrieve information from those particular documents. And this refers to a set of tasks
generally known as topic modeling, where I'd like to discover what the
topics are for a set of documents. And this is something
that humans could do. A human could read a document
and tell you, all right, here's what this document
is about, and give maybe a couple of topics for who are the
important people in this document, what are the important objects in the
document-- can probably tell you that kind of thing. But we'd like for our AI to
be able to do the same thing. Given some document, can you
tell me what the important words in this document are? What are the words that
set this document apart that I might care about if
I'm looking at documents based on keywords, for example? And so one instinctive idea-- an
intuitive idea that probably makes sense-- is let's just use term frequency. Term frequency is just
defined as the number of times a particular term appears in a document. If I have a document with 100 words and
one particular word shows up 10 times, it has a term frequency of 10. It shows up pretty often. Maybe that's going to
be an important word. And sometimes, you'll
also see this framed as a proportion of the total number
of words, so 10 words out of 100. Maybe it has a term frequency of
0.1, meaning 10% of all of the words are this particular
word that I care about. Ultimately, that doesn't
change relatively how important they are for
any one particular document, but they're the same idea. The idea is look for words that show
up more frequently, because those are more likely to be the important
words inside of a corpus of documents. And so let's go ahead
and give that a try. Let's say I wanted to find out what
the Sherlock Holmes stories are about. I have a whole bunch of
Sherlock Holmes stories and I want to know, in
general, what are they about? What are the important characters? What are the important objects? What are the important parts of
the story, just in terms of words? And I'd like for the AI to be able
to figure that out on its own, and we'll do so by looking
at term frequency-- by looking at, what are the words
that show up the most often? So we'll go ahead, and I'll go ahead
and go in to the tfidf directory. You'll see why it's
called that in a moment. But let's first open up tf0.py, which
is going to calculate the top 10 term frequencies-- or maybe
top five term frequencies for a corpus of documents,
a whole bunch of documents where each document is just
a story from Sherlock Holmes. We're going to load all
the data into our corpus and we're going to figure out,
what are all of the words that show up inside of that corpus? And we're going to
basically just assemble all of the number of the term frequencies. We're going to calculate, how
often do each of these terms appear inside of the document? And we'll print out the top five. And so there are some data
structures involved that you can take a look at if you'd like to. The exact code is not so important,
but it is the idea of what we're doing. We're taking each of these
documents and first sorting them. We're saying, take all
the words that show up and sort them by how
often each word shows up. And let's go ahead and just, for
each document, save the top five terms that happen to show up
in each of those documents. So again, some helper functions you can
take a look at if you're interested. But the key idea here is
that all we're going to do is run to tf0 on the
Sherlock Holmes stories. And what I'm hoping to get out of this
process is I am hoping to figure out, what are the important words in
Sherlock Holmes, for example? So we'll go ahead and run
this and see what we get. And it's loading the data. And here's what we get. For this particular story, the
important words are the, and and, and I, and to, and of. Those are the words that
show up more frequently. In this particular story, it's
the, and and, and I, and a, and of. This is not particularly useful to us. We're using term frequencies. We're looking at what words show
up the most frequently in each of these various different
documents, but what we get naturally are just the words that
show up a lot in English. The word the, and of, and happen
to show up a lot in English, and therefore, they happen
to show up a lot in each of these various different documents. This is not a particularly
useful metric for us to be able to analyze
what words are important, because these words are just part of
the grammatical structure of English. And it turns out we can categorize words
into a couple of different categories. These words happen to be known as what
we might call function words, words that have little meaning
on their own, but that are used to grammatically connect
different parts of a sentence. These are words like am, and
by, and do, and is, and which, and with, and yet-- words that,
on their own, what do they mean? It's hard to say. They get their meaning
from how they connect different parts of the sentence. And these function words are what we
might call a closed class of words in a language like English. There's really just some
fixed list of function words, and they don't change very often. There's just some list of
words that are commonly used to connect other grammatical
structures in the language. And that's in contrast with what
we might call content words, words that carry meaning independently--
words like algorithm, category, computer, words that
actually have some sort of meaning. And these are usually the
words that we care about. These are the words where
we want to figure out, what are the important
words in our document? We probably care about
the content words more than we care about the function words. And so one strategy we could apply is
just ignore all of the function words. So here in tf1.py, I've
done the same exact thing, except I'm going to load a whole bunch
of words from a function_words.txt file, inside of which are just a whole
bunch of function words in alphabetical order. These are just a whole
bunch of function words that are just words that are used
to connect other words in English, and someone has just compiled
this particular list. And these are the words
that I just want to ignore. If any of these words-- let's just
ignore it as one of the top terms, because these are not words
that I probably care about if I want to analyze what the
important terms inside of a document happen to be. So in tfidf1, we were
ultimately doing is, if the word is in my
set of function words, I'm just going to skip over it, just
ignore any of the function words by continuing on to
the next word and then just calculating the frequencies
for those words instead. So I'm going to pretend the
function words aren't there, and now maybe I can get
a better sense for what terms are important in each of the
various different Sherlock Holmes stories. So now let's run tf1 on the Sherlock
Holmes corpus and see what we get now. And let's look at, what is the most
important term in each of the stories? Well, it seems like,
for each of the stories, the most important word is Holmes. I guess that's what we would expect. They're all Sherlock Holmes stories. And Holmes is not a function in Word. It's not the, or a, or
an, so it wasn't ignored. But Holmes and man-- these are probably not what I mean when
I say, what are the important words? Even though Holmes does
show up the most often it's not giving me a whole
lot of information here about what each of the different
Sherlock Holmes stories are actually about. And the reason why is because Sherlock
Holmes shows up in all the stories, and so it's not meaningful for me to
say that this story is about Sherlock Holmes I want to try and
figure out the different topics across the corpus of documents. What I really want to know
is, what words show up in this document that show up less
frequently in the other documents, for example? And so to get at that idea, we're
going to introduce the notion of inverse document frequency. Inverse document frequency
is a measure of how common, or rare, a word happens to be
across an entire corpus of words. And mathematically, it's
usually calculated like this-- as the logarithm of the
total number of documents divided by the number of
documents containing the word. So if a word like Holmes shows
up in all of the documents, well, then total documents
is how many documents there are a number of documents containing
Holmes is going to be the same number. So when you divide these two together,
you'll get 1, and the logarithm of one is just 0. And so what we get is, if Holmes
shows up in all of the documents, it has an inverse
document frequency of 0. And you can think now of
inverse document frequency as a measure of how
rare is the word that shows up in this particular document
that if a word doesn't show up across many documents at all this
number is going to be much higher. And this then gets us that
a model known as tf-idf, which is a method for ranking what
words are important in the document by multiplying these two ideas together. Multiply term frequency, or TF, by
inverse document frequency, or IDF, where the idea here now is
that how important a word is depends on two things. It depends on how often it
shows up in the document using the heuristic that, if a
word shows up more often, it's probably more important. And we multiply that by
inverse document frequency IDF, because if the word is rarer,
but it shows up in the document, it's probably more important
than if the word shows up across most or all of the documents,
because then it's probably a less important factor in
what the different topics across the different documents
in the corpus happen to be. And so now let's go ahead and apply
this algorithm on the Sherlock Holmes corpus. And here's tfidf. Now what I'm doing is,
for each of the documents, for each word, I'm
calculating its TF score, term frequency, multiplied
by the inverse document frequency of that word-- not just
looking at the single volume, but multiplying these
two values together in order to compute the overall values. And now, if I run tfidf
on the Holmes corpus, this is going to try and get us
a better approximation for what's important in each of the stories. And it seems like it's
trying to extract here probably like the names
of characters that happen to be important in the
story-- characters that show up in this story that don't
show up in the other story-- and prioritizing the more
important characters that happen to show up more often. And so this then might be a better
analysis of what types of topics are more or less important. I also have another corpus, which
is a corpus of all of the Federalist Papers from American history. If I go ahead and run tfidf
on the Federalist Papers, we can begin to see what
the important words in each of the various different
Federalist Papers happen to be-- that in Federalist Paper Number 61,
seems like it's a lot about elections. In Federalist Papers 66, but
the Senate and impeachments. You can start to extract what
the important terms and what the important words are just by
looking at what things show up across-- and don't show up across
many of the documents, but show up frequently enough
in certain of the documents. And so this can be a
helpful tool for trying to figure out this
kind of topic modeling, figuring out what it is that
a particular document happens to be about. And so this then is starting to get
us into this world of semantics, what it is that things actually mean
when we're talking about language. Now, we're not going to
think about the bag of words, where we just say, treat a sample of
text as just a whole bunch of words. And we don't care about the order. Now, when we get into
the world of semantics, we really do start to care about what
it is that these words actually mean, how it is these words
relate to each other, and in particular, how we can
extract information out of that text. Information extraction is
somehow extracting knowledge from our documents-- figuring
out, given a whole bunch of text, can we automate the process of having
an AI, look at those documents, and get out what the useful or relevant
knowledge inside those documents happens to be? So let's take a look at an example. I'll give you two samples
from news articles. Here up above is a sample of a news
article from the Harvard Business Review that was about Facebook. Down below is an example of a
Business Insider article from 2018 that was about Amazon. And there's some information
here that we might want an AI to be able to extract-- information, knowledge
about these companies that we might want to extract. And in particular, what I
might want to extract is-- let's say I want to know data
about when companies were founded-- that I wanted to know that
Facebook was founded in 2004, Amazon founded in 1994-- that that is important information
that I happen to care about. Well, how do we extract that
information from the text? What is my way of being
able to understand this text and figure out, all right,
Facebook was founded in 2004? Well, what I can look for are
templates or patterns, things that happened to show up across multiple
different documents that give me some sense for what this
knowledge happens to mean. And what we'll notice
is a common pattern between both of these passages,
which is this phrasing here. When Facebook was
founded in 2004, comma-- and then down below, when Amazon
was founded in 1994, comma. And those two templates end up giving
us a mechanism for trying to extract information-- that this notion, when
company was founded in year comma, this can tell us something about
when a company was founded, because if we set our
AI loose on the web, let look at a whole bunch of papers
or a whole bunch of articles, and it finds this pattern-- when blank was founded in blank, comma-- well, then our AI can
pretty reasonably conclude that there's a good chance that this
is going to be like some company, and this is going to be like the year
that company was founded, for example-- might not be perfect, but at
least it's a good heuristic. And so you might imagine
that, if you wanted to train and AI to be able
to look for information, you might give the AI
templates like this-- not only give it a template like when
company blank was founded in blank, but give it like, the book blank
was written by blank, for example. Just give it some templates
where it can search the web, search a whole big corpus of documents,
looking for templates that match that, and if it finds that, then
it's able to figure out, all right, here's the
company and here's the year. But of course, that requires
us to write these templates. It requires us to figure out, what
is the structure of this information likely going to look like? And it might be difficult to know. The different websites are, of
course, going to do this differently. This type of method isn't going to be
able to extract all of the information, because if the words are
slightly in a different order, it won't match on that
particular template. But one thing we can do is, rather
than give our AI the template, we can give AI the data. We can tell the AI, Facebook was founded
in 2004 and Amazon was founded in 1994, and just tell the AI those
two pieces of information, and then set the AI loose on the web. And now the ideas that the AI can begin
to look for, where do Facebook in 2004 show up together, where do
Amazon in 1994 show up together, and it can discover these
templates for itself. It can discover that
this kind of phrasing-- when blank was founded in blank-- tends to relate Facebook to 2004,
and it released Amazon to 1994, so maybe it will hold the same
relation for others as well. And this ends up being--
this automated template generation ends up being quite
powerful, and we'll go ahead and take a look at that now as well. What I have here inside
of templates directory is a file called companies.csv,
and this is all of the data that I am going to give to my AI. I'm going to give it the pair
Amazon, 1994 and Facebook, 2004. And what I'm going to
tell my AI to do is search a corpus of
documents for other data-- these pairs like this--
other relationships. I'm not telling AI that this
is a company and the date that it was founded. I'm just giving it Amazon,
1994 and Facebook, 2004 and letting the AI do the rest. And what the AI is going to do is
it's going to look through my corpus-- here's my corpus of documents-- and it's going to find, like
inside of Business Insider, that we have sentences like, back when
Amazon was founded in 2004, comma-- and that kind of phrasing is going to be
similar to this Harvard Business Review story that has a sentence like,
when Facebook was founded in 2004-- and it's going to look across
a number of other documents for similar types of patterns to be able
to extract that kind of information. And what it will do is,
if I go ahead and run, I'll go ahead and go into templates. So I'll say python search.py. I'm going to look for the data
like the data and companies.csv inside of the company's directory, which
contains a whole bunch of news articles that I've curated in advance. And here's what I get-- Google 1998, Apple
1976, Microsoft 1975-- so on and so forth-- Walmart 1962, for example. These are all of the pieces
of data that happened to match that same template that
we were able to find before. And how was it able to find this? Well, it's probably because, if
we look at the Forbes article, for example, that it has a phrase in it
like, when Walmart was founded in 1962, comma-- that it's able to
identify these sorts of patterns and extract information from them. Now, granted, I have curated
all these stories in advance in order to make sure that there
is data that it's able to match on. And in practice, it's not always
going to be in this exact format when you're seeing a company related
to the year in which it was founded, but if you give the AI access to enough
data-- like all of the data of text on the internet-- and just have
the AI crawl the internet looking for information, it can very
reliably, or with some probability, try and extract information
using these sorts of templates and be able to generate
interesting sorts of knowledge. And the more knowledge it
learns, the more new templates it's able to construct,
looking for constructions that show up in other locations as well. So let's take a look at another example. And then I'll here show
you presidents.csv, where I have two presidents
and their inauguration date-- so George Washington 1789,
Barack Obama 2009 for example. And I also am going to give
to our AI a corpus that just contains a single
document, which is the Wikipedia article for the list of presidents
of the United States, for example-- just information about presidents. And I'd like to extract from this raw
HTML document on a web page information about the president. So I can say search in presidents.csv. And what I get is a whole
bunch of data about presidents and what year they were likely
inaugurated and by looking for patterns that matched-- Barack Obama 2009, for example-- looking for these sorts
of patterns that happened to give us some clues as to what it
is that a story happens to be about. So here's another example. If I open up inside the olympics,
here is a scraped version of the Olympic home page
that has information about various different Olympics. And maybe I want to extract
Olympic locations and years from this particular page. Well, the way I can do that is
using the exact same algorithm. I'm just saying, all right, here are two
Olympics and where they were located-- so 2012 London, for example. Let me go ahead and
just run this process, Python search, on olympics.csv,
look at all the Olympic data set, and here I get some information back. Now, this information--
not totally perfect. There are a couple of examples
that are obviously not quite right, because my template might
have been a little bit too general. Maybe it was looking for
a broad category of things and certain strange things happened to
capture on that particular template. So you could imagine adding rules to try
and make this process more intelligent, making sure the thing on the left
is just a year, for example-- for instance, and doing
other sorts of analysis. But purely just based
on some data, we are able to extract some interesting
information using some algorithms. And all search.py is really doing here
is it is taking my corpus of data, finding templates that match it-- here, I'm filtering down to just the
top two templates that happen to match-- and then using those templates
to extract results from the data that I have access to, being able
to look for all of the information that I care about. And that's ultimately
what's going to help me, to print out those results to figure
out what the matches happen to be. And so information extraction
is another powerful tool when it comes to trying
to extract information. But of course, it only works
in very limited contexts. It only works when I'm able will
find templates that look exactly like this in order to come up
with some sort of match that is able to connect this
to some pair of data, that this company was
founded in this year. What I might want to do, as we start
to think about the semantics of words, is to begin to imagine some way
of coming up with definitions for all words, being able to relate
all of the words in a dictionary to each other, because that's ultimately
what's going to be necessary if we want our AI to be able to communicate. We need some representation
of what it is that words mean. And one approach of doing this,
this famous data set called WordNet. And what WordNet is is
it's a human-curated-- researchers have curated
together a whole bunch of words, their definitions, their
various different senses-- because the word might have
multiple different meanings-- and also how those words
relate to one another. And so what we mean by this is-- I can show you an example of WordNet. WordNet comes built into NLTK. Using NLTK, you can
download and access WordNet. So let me go into WordNet,
and go ahead and run WordNet, and extract information about a
word-- a word like city, for example. Go ahead and press Return. And here is the information
that I get back about a city. It turns out that city has
three different senses, three different meanings,
according to WordNet. And it's really just kind
of like a dictionary, where each sense is associated with its
meaning-- just some definition provided by human. And then it's also got categories,
for example, that a word belongs to-- that a city is a type
of municipality, a city is a type of administrative district. And that allows me to
relate words to other words. So one of the powers of WordNet
is the ability to take one word and connect it to other related words. If I do another example, let me
try the word house, for instance. I'll type in the word house
and see what I get back. Well, all right, the house
is a kind of building. The house is somehow
related to a family unit. And so you might imagine
trying to come up with these various different
ways of describing a house. It is a building. It is a dwelling. And researchers have just
curated these relationships between these various different words to
say that a house is a type of building, that a house is a type
of dwelling, for example. But this type of
approach, while certainly helpful for being able to
relate words to one another, doesn't scale particularly well. As you start to think
about language changing, as you start to think about all
the various different relationships that words might have to one another,
this challenge of word representation ends up being difficult.
What we've done is just defined a word as just a sentence that
explains what it is that that word is, but what we really
would like is some way to represent the meaning
of a word in a way that our AI is going to be able
to do something useful with it. Anytime we want our AI to
be able to look at texts and really understand
what that text means, to relate text and
words to similar words and understand the
relationship between words, we'd like some way that a computer
can represent this information. And what we've seen all
throughout the course multiple times now is
the idea that, when we want our AI to
represent something, it can be helpful to have the AI
represent it using numbers-- that we've seen that we can
represent utilities in a game, like winning, or losing,
or drawing, as a number-- 1, negative 1, or a 0. We've seen other ways that
we can take data and turn it into a vector of features,
where we just have a whole bunch of numbers that represent
some particular piece of data. And if we ever want to past
words into a neural network, for instance, to be able
to say, given some word, translate this sentence
into another sentence, or to be able to do interesting
classifications with neural networks on individual words, we need
some representation of words just in terms of vectors-- way to represent words, just
by using individual numbers to define the meaning of a word. So how do we do that? How do we take words and
turn them into vectors that we can use to represent
the meaning of those words? Well, one way is to do this. If I have four words that I want
to encode, like he wrote a book, I can just say, let's let
the word he be this vector-- 1, 0, 0, 0. Wrote will be 0, 1, 0, 0. A will be 0, 0, 1, 0. Book will be 0, 0, 0, 1. Effectively, what I have here is what's
known as a one-hot representation or a one-hot encoding, which
is a representation of meaning, where meaning is a vector that has a
single 1 in it and the rest are 0's. The location of the 1 tells
me the meaning of the word-- that 1 in the first
position, that means here-- 1 in the second position,
that means wrote. And every word in the
dictionary is going to be assigned to some representation
like this, where we just assign one place in the vector
that has a 1 for the word and 0 for the other words. And now I have
representations of words that are different for a whole
bunch of different words. This is this one-hot representation. So what are the drawbacks of this? Why is this not necessarily
a great approach? Well, here, I am only
creating enough vectors to represent four words in a dictionary. If you imagine a dictionary with 50,000
words that I might want to represent, now these vectors get enormously long. These are 50,000 dimensional
vectors to represent a vocabulary of 50,000 words--
that he is 1 followed by all these. Wrote has a whole bunch of 0's in it. That's not a particularly tractable
way of trying to represent numbers, if I'm going to have to deal
with vectors of length 50,000. Another problem-- a subtler problem-- is that ideally, I'd
like for these vectors to somehow represent meaning
in a way that I can extract useful information out of-- that if
I have the sentence he wrote a book and he authored a novel, well, wrote
and authored are going to be two totally different vectors. And book and novel are going to be
two totally different vectors inside of my vector space that have
nothing to do with each other. The one is just located
in a different position. And really, what I would
like to have happen is for wrote and
authored to have vectors that are similar to one
another, and for book and novel to have vector representations
that are similar to one another, because they are words
that have similar meanings. Because their meanings are
similar, ideally, I'd like for-- when I put them in vector form and
use a vector to represent meanings, I would like for those vectors to
be similar to one another as well. So rather than this
one-hot representation, where we represent a word's meaning
by just giving it a vector that is one in a particular location,
what we're going to do-- which is a bit of a strange
thing the first time you see it-- is what we're going to call
a distributed representation. We are going to represent
the meaning of a word as just a whole bunch of different values--
not just a single 1 and the rest 0's, but a whole bunch of values. So for example, in he wrote a book,
he might just be a big vector. Maybe it's 50 dimensions, maybe it's
100, dimensions but certainly less than like tens of thousands, where
each value is just some number-- and same thing for
wrote, and a, and book. And the idea now is that, using
these vector representations, I'd hope that wrote and authored
have vector representations that are pretty close to one another. Their distance is not too far
apart-- and same with the vector representations for book and novel. So this is going to be the goal of a
lot of what statistical machine learning approaches to natural
language processing is about is using these vector
representations of words. But how on earth do we define
a word as just a whole bunch of these sequences of numbers? What does it even mean to talk
about the meaning of a word? The famous quote that
answers this question is from a British linguist in the
1950s, JR Firth, who said, "You shall know a word by the company it keeps." And what we mean by
that is the idea that we can define a word in terms of the words
that show up around it, that we can get at the meaning of a word based on the
context in which that word happens to appear. That if I have a sentence like
this, four words in sequence-- for blank he ate-- what goes in the blank? Well, you might imagine
that, in English, the types of words that might fill in
the blank are words like breakfast, or lunch, or dinner. These are the kinds of words
that fill in that blank. And so if we want to define,
what does lunch or dinner mean, we can define it in terms
of what words happened to show up around it--
that if a word shows up in a particular context and
another word happens to show up in very similar context, then
those two words are probably related to each other. They probably have a similar
meaning to one another. And this then is the
foundational idea of an algorithm known as word2vec, which is a
model for generating word vectors. You give word2vec a corpus of
documents, just a whole bunch of texts, and what word to that will produce is
it will produce vectors for each word. And there a number of
ways that it can do this. One common way is through what's known
as the skip-gram architecture, which basically uses a neural network
to predict context words, given a target word-- so
given a word like lunch, use a neural network to try and
predict, given the word lunch, what words are going to show up around it. And so the way we
might represent this is with a big neural
network like this, where we have one input cell for every word. Every word gets one node
inside this neural network. And the goal is to use this
neural network to predict, given a target word, a context word. Given a word like lunch, can I predict
the probabilities of other words, showing up in a context of one word
away or two words away, for instance, in some sort of window of context? And if you just give the AI, this neural
network, a whole bunch of data of words and what words show up in context,
you can train a neural network to do this calculation, to be able
to predict, given a target word-- can I predict what those context
words ultimately should be? And it will do so using
the same methods we've talked about-- back propagating
the error from the context word back through this neural network. And what you get is, if
we use the single layer-- just a signal layer of hidden nodes-- what I get is, for every single
one of these words, I get-- from this word, for example, I
get five edges, each of which has a weight to each of
these five hidden nodes. In other words, I get five
numbers that effectively are going to represent this
particular target word here. And the number of hidden nodes I
choose in this middle layer here-- I can pick that. Maybe I'll choose to have 50
hidden nodes or 100 hidden nodes. And then, for each of
these target words, I'll have 50 different values
or 100 different values, and those values we can
effectively treat as the vector numerical representation of that word. And the general idea here is
that, if words are similar, two words show up in similar contexts--
meaning, using the same target words, I'd like to predict
similar contexts words-- well, then these vectors and these
values I choose in these vectors here-- these numerical values
for the weight of these edges are probably going to be similar,
because for two different words that show up in similar
contexts, I would like for these values that are
calculated to ultimately be very similar to one another. And so ultimately, the high-level
way you can picture this is that what this word2vec
training method is going to do is, given a whole
bunch of words, were initially, recall, we initialize these
weights randomly and just pick random weights that we choose. Over time, as we train
the neural network, we're going to adjust these weights,
adjust the vector representations of each of these words
so that gradually, words that show up in similar
contexts grow closer to one another, and words that show up
in different contexts get farther away from one another. And as a result, hopefully
I get vector representations of words like breakfast, and lunch, and
dinner that are similar to one another, and then words like book,
and memoir, and novel are also going to be similar
to one another as well. So using this algorithm, we're
able to take a corpus of data and just train our computer, train this
neural network to be able to figure out what vector, what sequence
of numbers is going to represent each of these words-- which
is, again, a bit of a strange concept to think about representing a word
just as a whole bunch of numbers. But we'll see in a moment just
how powerful this really can be. So we'll go ahead and go into vectors,
and what I have inside a vectors.py-- which I'll open up now-- is I'm opening up words.txt, which
is a pretrained model that just-- I've already run word2vec
and it's already given me a whole bunch of vectors for
each of these possible words. And I'm just going to
take like 50,000 of them and go ahead and save their vectors
inside of a dictionary called words. And then I've also defined
some functions called distance, closest_word, so it'll get me what are
the closest words to a particular word, and then closest_word, that just gets
me the one closest word, for example. And so now let me try doing this. Let me open up the Python
interpreter and say something like, from vectors import star-- just import everything from vectors. And now let's take a look at
the meanings of some words. Let me look at the
word city, for example. And here is a big array that is the
vector representation of the words city. And this doesn't mean anything, in
terms of what these numbers exactly are, but this is how my
computer is representing the meaning of the word city. We can do a different
word, like words house, and here then is the vector
representation of the word house, for example-- just a
whole bunch of numbers. And this is encoding somehow
the meaning of the word house. And how do I get at that idea? Well, one way to measure how
good this is is by looking at, what is the distance between
various different words? There a number of ways
you can define distance. In context of vectors,
one common way is what's known as the cosine distance
that has to do with measuring the angle between vectors. But in short, it's just
measuring, how far apart are these two vectors from each other? So if I take a word like the word book,
how far away for is it from itself-- how far away is the
word book from book-- well, that's zero. The word book is zero
distance away from itself. But let's see how far away word
book is from a word like breakfast, where we're going to say one is
very far away, zero is not far away. All right, book is about
0.64 away from breakfast. They seem to be pretty far apart. But let's now try and calculate
the distance from words book to words novel, for example. Now, those two words are
closer to each other-- 0.34. The vector representation
of the word book is closer to the vector
representation of the word novel than it is to the vector
representation of the word breakfast. And I can do the same thing and,
say, compare breakfast to lunch, for example. And those two words are
even closer together. They have an even more
similar relationship between one word and another. So now it seems we have some
representation of words, representing a word using vectors, that
allows us to be able to say something like words that are
similar to each other ultimately have a smaller distance
that happens to be between them. And this turns out to be
incredibly powerful to be able to represent the meaning of
words in terms of their relationships to other words as well. I can tell you as well-- I have a function called
closest words that basically just takes
a whole bunch of words and gets all the closest words to it. So let me get the closest
words to book, for example, and maybe get the 10 closest words. We'll limit ourselves to 10. And right. Book is obviously closest
to itself-- the word book-- but is also closely related to books,
and essay, and memoir, and essays, and novella, anthology. And why are these words that it was
able to compute are close to it? Well, because based on
the corpus of information that this algorithm was trained
on, the vectors that arose arose based on what words
show up in a similar context-- that the word book shows up in a similar
context, similar other words to words like memoir and essays, for example. And if I do something like-- let me get the closest words to city-- you end up getting city,
town, township, village. These are words that happen to show up
in a similar context to the word city. Now, where things get really interesting
is that, because these are vectors, we can do mathematics with them. We can calculate the relationships
between various different words. So I can say something like, all
right, what if I had man and king? These are two different vectors,
and this is a famous example that comes out of word2vec. I can take these two vectors and
just subtract them from each other. This line here, the distance
here, is another vector that represents like king minus man. Now, what does it mean to take a
word and subtract another word? Normally, that doesn't make sense. In the world of vectors, though,
you can take some vector sum sequence of numbers, subtract
some other sequence of numbers, and get a new vector, get
a new sequence of numbers. And what this new sequence of
numbers is effectively going to do is it is going to tell me, what do I
need to do to get from man to king? What is the relationship
then between these two words? And this is some vector
representation of what makes-- takes us from man to king. And we can then take this value
and add it to another vector. You might imagine that the
word woman, for example, is another vector that exists
somewhere inside of this space, somewhere inside of this vector space. And what might happen if I
took this same idea, king minus man-- took that same vector
and just added it to woman? What will we find around here? It's an interesting
question we might ask, and we can answer it very easily,
because I have vector representations of all of these things. Let's go back here. Let me look at the
representation of the word man. Here's the vector representation of men. Let's look at the
representation of the word king. Here's the representation
of the word king. And I can subtract these two. What is the vector
representation of king minus man? It's this array right here-- whole bunch of values. So king minus man now represents the
relationship between king and man in some sort of numerical vector format. So what happens then
if I add woman to that? Whatever took us from man to king,
go ahead and apply that same vector to the vector representation
of the word woman, and that gives us this vector here. And now, just out of curiosity,
let's take this expression and find, what is the closest
word to that expression? And amazingly, what we get
is we get the word queen-- that somehow, when you take the
distance between man and king-- this numerical representation
of how man is related to king-- and add that same
notion, king minus man, to the vector representation
of the word woman. What we get is we get the vector
representation, or something close to the vector representation
of the word queen, because this distance somehow encoded
the relationship between these two words. And when you run it
through this algorithm, it's not programmed to do this,
but if you just try and figure out how to predict words
based on context words, you get vectors that are able to
make these SAT-like analogies out of the information that has been given. So there are more examples of this. We can say, all right,
let's figure out, what is the distance between
Paris and France? So Paris and France are words. They each have a vector representation. This then is a vector representation of
the distance between Paris and France-- what takes us from France to Paris. And let me go ahead and add the vector
representation of England to that. So this then is the
vector representation of going Paris minus
France plus England-- so the distance between
friends and Paris as vectors. Add the England vector,
and let's go ahead and find the closest word to that. And it turns out to be London. You do this relationship, the
relationship between France and Paris. Go ahead and add the England vector
to it, and the closest vector to that happens to be the vector
for the word London. We can do more examples. I can say, let's take
the word for teacher-- that vector representation
and-- let me subtract the vector representation of school. So what I'm left with is, what
takes us from school to teacher? And apply that vector to a
word like hospital and see, what is the closest word to that-- turns out the closest word is nurse. Let's try a couple more examples--
closest word to ramen, for example. Subtract closest word to Japan. So what is the relationship
between Japan and ramen? Add the word for America to that. Want to take a guess is what
you might get as a result? Turns out you get burritos
as the relationship. If you do the subtraction,
do the addition, this is the answer that you happen to
get as a consequence of this as well. So these very interesting
analogies arise in the relationships
between these two words-- that if you just map out all of
these words into a vector space, you can get some pretty interesting
results as a consequence of that. And this idea of representing
words as vectors turns out to be incredibly useful
and powerful anytime we want to be able to do
some statistical work with regards to natural language,
to be able to have-- represent words not just
as their characters, but to represent them as numbers,
numbers that say something or mean something about
the words themselves, and somehow relate the meaning
of a word to other words that might happen to exists-- so many tools then for
being able to work inside of this world of natural language. The natural language is tricky. We have to deal with the syntax of
language and the semantics of language, but we've really just seen just the
beginning of some of the ideas that are underlying a lot of natural language
processing-- the ability to take text, extract information out of it, get
some sort of meaning out of it, generate sentences maybe by having some
knowledge of the grammar or maybe just by looking at probabilities of what
words are likely to show up based on other words that have
shown up previously-- and then finally, the
ability to take words and come up with some distributed
representation of them, to take words and represent them as
numbers, and use those numbers to be able to say something
meaningful about those words as well. So this then is yet another
topic in this broader heading of artificial intelligence. And just as I look back
at where we've been now, we started our conversation by
talking about the world of search, about trying to solve problems
like tic-tac-toe by searching for a solution, by exploring our
various different possibilities and looking at what algorithms we
can apply to be able to efficiently try and search a space. We looked at some simple algorithms
and then looked at some optimizations we could make to this
algorithms, and ultimately, that was in service of trying to get our
AI to know things about the world. And this has been a lot of what
we've talked about today as well, trying to get knowledge out
of text-based information, the ability to take information, draw
conclusions based on those information. If I know these two things
for certain, maybe I can draw a third conclusion as well. That then was related to
the idea of uncertainty. If we don't know
something for sure, can we predict something, figure out
the probabilities of something? And we saw that again
today in the context of trying to predict whether
a tweet or whether a message is positive sentiment
or negative sentiment, and trying to draw that
conclusion as well. Then we took a look at
optimization-- the sorts of problems where we're looking
for a local global or local maximum or minimum. This has come up time and time
again, especially most recently in the context of neural networks, which
are really just a kind of optimization problem where we're trying to
minimize the total amount of loss based on the setting of our
weights of our neural network, based on the setting of what
vector representations for words we happen to choose. And those ultimately helped
us to be able to solve learning-related problems-- the
ability to take a whole bunch of data, and rather than us tell
the AI exactly what to do, let the AI learn patterns
from the data for itself. Let it figure out what makes an inbox
message different from a spam message. Let it figure out what
makes a counterfeit bill different from an
authentic bill, and being able to draw that analysis as well. And one of the big tools
in learning that we used were neural networks,
these structures that allow us to relate inputs to outputs
by training these internal networks to learn some sort of function that
maps us from some input to some output-- ultimately yet another model in this
language of artificial intelligence that we can use to
communicate with our AI. Then finally today,
we looked at some ways that AI can begin to communicate
with us, looking at ways that AI can begin to get an
understanding for the syntax and the semantics of language to
be able to generate sentences, to be able to predict things about
text that's written in a spoken language or a written
language like English, and to be able to do interesting
analysis there as well. And there's so much more
in active research that's happening all over the areas within
artificial intelligence today, and we've really only just seen the
beginning of what AI has to offer. So I hope you enjoyed this
exploration into this world of artificial intelligence with Python. A big thank you to the courses
teaching staff and the production team for making this class possible. This was an Introduction to
Artificial Intelligence with Python.