(moderator)
Hello. This is Structured Data
from Unstructured Text. I'd like to introduce
our speaker, Van Lindberg. [applause] (Van Lindberg)
Hello everyone. Wow! That's quite a bit louder
than I expected. I am happy to be here today. I hope that you are happy about it,
because a number of you were probably coming expecting
to see Smitha Milli. Smitha had an issue that -- a travel issue, I believe,
and wasn't able to make it. And so, I saw her abstract
and I saw her title, and this was something
that I have been passionate about for a long time, and I said,
"Well, I'll give that talk. "I don't have her slides
but I'll write something." And they said, "Okay." And so, I saddled myself
with a talk 24 hours ago. But this is something that I love. But I tell you,
I will not be offended if any of you walk out right now. Still here? Good. Okay. So why are we here? Why do we care
about getting unstructured -- getting structured data
from unstructured text? Well, I think it's because
we are creatures of language. We have -- we are surrounded
by masses of text. You know, there is an explosion
of publicly available information right now,
and most of it is in text format. And we want to use that text,
we are being who we are, as input to a function. You know, this is
very trendy right now. Half of the startups out there
are about applying machine learning to some sort of text database
or some sort of images, and people are throwing
money at them. We want to make
our processes more efficient. But I think it's about
a little bit more than that. As Dr. Barba said
in her keynote yesterday, we are creatures of language. We act through language. If we can teach computers
to take all of this text that surrounds us
and turn it into information that we can use in various ways,
then we have done something that is actually very profound. We have taught computers
to read for us, to understand, and to teach us back
the things that they understood. We have made computers
a little more human. That is a pretty profound thing. So, I think as soon as we start
talking about generating... Ah, I'm not even keeping up
with my slides. As soon as we start talking
generating structured data from unstructured text, I think
that we need a couple definitions. These are not dictionary definitions,
these are my definitions. And I'm not going to read them
to you, but I want to highlight one thing about them, and that is:
it is all about ambiguity. As humans, we have the capability of --
we have lots of context and we have lots of things
that we use for clues in order to reduce the ambiguity in English sentences
or other natural languages and come up with something approaching
what we think that other person meant. But computers don't have
that same advantage. And so, we --
what we are trying to do when we're generating structured data
is we're trying to reduce the -- use the structure of the language,
all the clues that we have, and teach a computer to reduce
the ambiguity so that we are able to then compute about it. So, to start with, we need some data. Now, I'm going to switch over
to a Jupyter notebook and may the demo gods
all be with us today. And I have a CSV -- oops, wrong way. I have a CSV that we exported actually
from the PyCon site that has tutorial names and --
talk and tutorial names and abstracts, and as far as possible I'm going to be
using that for some of the data today. Now, this is unstructured text. I can show it to you right here. And you can see
it's just in a CSV file. It's not very interesting. But even --
this is unstructured text, but I have to warn you
that I am cheating, because already, I have it
in a machine-parsible format. I have it in a CSV. Really, this information comes
from a database dump, or even you can get it off the HTML
by scraping the PyCon website. A lot of times, the data that you get
is going to come in a messy format embedded in some sort of document,
and half of machine learning is about the pipes that move the data around
and cleaning the data. And so, I'm going to start by getting the data,
and I'll run this here. And if you look at it, I just --
I went over this once. I removed all the -- it's in UTF-8
but I removed all the characters that have an ordinal number
higher than 128 because I really only needed
the ASCII data. I had some messy stuff,
but you'll see I didn't actually do it very carefully because I've got
a bunch of spurious newlines in there. So I'm going to actually clean it
a little bit more, and I'm going to replace
some of these newlines. And I'm going to also just make
a little index that I'm going to use to save and I can refer back
to these documents later. So, you can see the --
you saw these newlines and everything. Let's make it look
just a little bit nicer here. And, well, it is... It's harder to tell on this resolution;
however, those -- the spurious newlines are gone. But... You may think
that we are ready to go, but we actually have
one more important step, and that is: we need to tokenize. We need to decide
what we actually mean by "We have something that we're ready
to compute about. We have data." Now, when you're thinking about
words and language, you may think, "Well, obviously you would split
along the words." But that is not nearly as obvious
as it might seem. The reason why
is because there are lots of interesting things
that you can do -- for example, splitting
along every character line. How many of you have heard
of neural network, in particular,
recurrent neural networks? A lot of people.
There was a great demo the other day -- that came around a little while ago, where they decided,
"We're going to generate text "and we're going to have
a bunch of words "but we're not going to look at
them as words. "Instead we're going to look at them
on a character-by-character basis "and we're going to figure out
what is the most likely character, "after looking at all this stuff, "based upon the inputs
that we've had so far." And so I've got --
you know, "Python is..." Snd how many characters?
I thought obviously we should do 42. "Python is the legend
of John Newton at Storm. Sit." [laughter] And so, the amazing thing,
when you think about it, this was generated
character by character. It had no idea that it was
generating English words. It was computing on this idea
of one piece of text anticipates the next,
anticipates the next. And we can actually
redo this over and over and get new funny --
new funny things. We'll give it a few more. "Python is locked panel:
The current De Zoli (GSR) Cit." The point is that you may want
to think about, is a word really the smallest piece of information
that you need? Going the other way,
you may want to also look at things that are bigger than one single word. For example, bigrams
are sequences of two words. You would have those in cases
where you have -- especially for something like
a proper noun -- where "the queen" has a very separate and distinct
meaning from just the word "queen." And by throwing away the word "the",
the fact -- the information that it was next to the word "queen", you actually throw away
a lot of information. In fact, language models -- I don't know how many of you
know Doug Napoleone. He works for Nuance,
who does a lot of cutting-edge language research and work. And he said that they have
a six-way model -- 6 ngram is what they call it --
for predicting the next part of speech. But we are going to be
completely boring and we are going to tokenize
just upon words. And one of the things
that we have in Python is something called NLTK.
it's the Natural Language Toolkit. I'm going to be using
a lot of NLTK today and I'm going to be using
something called gensim. These are excellent,
almost jack of all trades in terms of doing a lot of things
with natural language processing. There are a few others
that we'll talk about. Also, all of this, just as an aside,
is running in Python 3.51 and so everything here
is available in Python 3. So, we're going to import NLTK
and we're going to tokenize our words. And so, let's see what --
I took a random sentence from one of our inputss.
Let's see what we've got. So let's run this. So, "The syntax in many programming
languages is complicated "and they use many characters
rarely found in written English", was the sentence, and you can see
the way in which it tokenized it. It broke it out mostly by white space,
and you'll notice it also separated out the period as a separate character
at the end, because it changes the --
it changes what it is that you are thinking about.
It changes the meaning of both the word before
and the word after. Now, now that we have our -- now that we have
our various tokens, the first place
that you might think about looking for ways
to reduce the ambiguity, to get information
out of these sentences, is by examining the structure
of the sentence itself. I don't know how many of you
ever had to do this. Did any of you
ever have to do these? They were pain, weren't they? This is a sentence diagram, where the different parts of speech
and the relationships between the different words
are represented by the types of lines
and the connections between them. This is something that we can also do
with computers, and in fact, Google just released
something called SyntaxNet where they use their TensorFlow
system in order to do very high quality parsing,
tagging, tokenization. And it uses both a rule-based system
and a probabilistic-based system based upon learning many, many
sentences across the entire Internet. This pre-trained model
that they provide for parsing English sentences
is called Parsey McParseface. And you can look it up. And they have a command line
interface to it, and so you can see that it generates
the same sort of tree. Not exactly the same,
but you can see the similarities. And you can also see it
allocating parts of speech tags to the various words: verbs, nouns,
subjects, direct objects, etc. So this is very useful for certain
types of information extraction, particularly if you want to know
about the relationships between -- within the sentence. But personally,
having done a lot of this, I found that it hasn't quite
lived up to my expectations. It hasn't been as useful
as I would have expected. And the reason why
is because it tells you a lot of things about
the structure of the sentence, but then it doesn't tell you about
anything relative to other sentences. It doesn't tell you about information
and how it relates to other symbols that represent things
in the real world. For example -- for example,
it is a lot more useful to know that Queen Elizabeth
is closely associated with England than that "Queen Elizabeth"
is a compound noun phrase. One ties two concrete facts together, the other ties a fact
to an abstract concept. Useful, but not always. The other thing is that Parsey
McParseface is state of the art. It can correctly diagram
and tag about 94% of sentences. I believe that that
is either at the best or near the best
that has been done. This is actually just a hair below
what a trained human linguist could do, which is about 97%. But you'll notice there is still a lot
of ambiguity in there. The thing that strikes me is,
97% of sentences, even a trained linguist still
gets wrong 3% of those sentences. So what are we to do? Turns out that the answer, as so frequently is the case
these days, is data. More and more data. And this is something
that has really been pioneered by Google, and their -- they had an interesting thing
many years ago where they said -- they talked about the unreasonable
effectiveness of data. And so, let's start again then
with our tokenized sentence and see if there's another way
we can work with them. And that's by thinking of them
as a collection of features. Now a feature is anything
that a computer can recognize. In this case... In this case,
what we're going to do -- let me switch back here. Oh, by the way, NLTK -- oops. Nope, that's not what I want to do. NLTK can also do that. Hmm. All right. So, moving on. [laughter] So, the... Ee're going to look at a series
of features. And when we're looking at language,
the thing that we're going to say is, a feature is the presence,
at least at the start, is the presence or the absence
of the word in a certain context. Now this throws away
a whole lot of context -- so much that you would be
really surprised. They call it a bag of words. And the image is actually rather apt
of you pour everything into a bag and you jumble it
and that is your set of features and that completely characterizes
your document. And so, we're going to prepare --
take our documents, our texts here, and we're going to prepare
a bag of words with them by removing some of
the most common ones. We'll use our tokenizer here.
You can see the text. And we're going to throw out the ones
that only appear once. And then we're going to create
a series of texts. So let's run this. It looks like I have... I need to... Let me run all the ones above this. [laughter] [laughter] Huh. It worked. Okay, there we go. And so, now we're going to create
a dictionary and a corpus. And a dictionary is something --
a corpus is simply a collection of documents,
and then a dictionary is something that maps a particular position
in this vector to a particular word. Now the nice thing about a vector is
it's a long, one-dimensional matrix. It's a 1-by-n matrix
where the number -- the n is the number of different
features that you have, and the 1 is the presence
or the absence of that feature, at least in the bag words model. And the thing that you --
and then you can create things later that use different representations
in that spot. But as long as you have that, you can do a lot
of really nifty matrix math on it, and that allows you to extract
some of the meaning from it. So we're going to use gensim here. You'll notice I've moved
to gensim in the text. And we are going to... Oops. Oh, dead kernel. [audience member inaudible] [laughter] Okay. [audience member inaudible] (audience member)
"Kernel" in the toolbar. (Van Lindberg)
Kernel in the toolbar. Restart. [applause] Thank you. And then we'll Run All Above. And let's see if we can. Do I have a dead kernel again? [audience member inaudible] Okay. So while we're waiting for this, I'm going to tell you
about what it's doing. So, we have this
"bag of words" model, and frankly it's not that useful
except as a starting place. But the thing that we can start to do
is we can start to talk about extracting information
by doing math on this vector. And the first thing
that we're going to do is we're going to look
at something called tf-idf which stands for "term frequency
over inverse document frequency." What this does is it looks at
the total frequency of each feature, in this case each word, and it scales that
by the number of times that that particular feature
appears across all documents. If you think about it,
this actually becomes a pretty intuitive way
of identifying things that are important
in a particular document. For example, the word "and"
is going to appear all over the place, but because it has a high divisor
it appears in lots of -- it has a high document frequency, well, we're going to --
it ends up not being very important. Whereas if you have
something like "docker," well, "docker" is going to show up
quite a quite a bit. Assuming that... You know what? I actually have all the text. So let's see if we can... No, not that one. Okay. This is not working. So, what we're going to do is... Man! Okay. So, after we look at term frequency over inverse document frequency,
what we have is a series of floats in each of those spaces that talk
about the relative importance of each of those features
in each document. And after -- and once we have that,
we have what is called a sparse vector. Most of those things
are going to be either zeros or very, very low probabilities. It's just, think about if you had
a long, long matrix that had a 1 for every single word
that was in a document. Now a document may have a lot of words,
but if the length of the vector is the number of all words in English,
or all the words in all the documents that you have, most of the time
that is going to be 0. Those are not going to exist at all. And so as a result, what happens is that you have
very, very high dimensional spaces. Typical models will use anywhere from
300 to 500 as sort of a golden range and can go up to thousands
or even millions of different features
in their vector space. This is simply too many
for us to meaningfully reason about. So one of the things that we do is
we take a look at ways in which we can do meaningful
dimensionality reduction, which means that we say,
"What are the most important pieces "of this vector? What are the things
that most correctly characterize it?" And we use those
because that ends up being a much more compact representation
of what is the essence of this particular document. So, how many of you have heard of
principal component analysis? Good. The first thing that came out --
this was either the late '90s or the 2000s -- was the use of LSI,
or latent semantic indexing, in order to identify
what a document is about, what other things are about,
by doing principal component analysis across the different vectors
associated with a particular document. Turns out that they were able
to find that certain things were very closely aligned
in various ways, and other things were not nearly --
were less closely aligned. And as a result... And as a result, they... As a result, they were able to say,
well, if you have a word that looks like -- if you have a word "Queen Elizabeth"
or a term "Queen Elizabeth," that tends to be
highly associated with ships, with England, with the Crown,
with things like that. And because those things
all tended to cluster together, you were able to not only be able
to have them together but also to be able to... You know what? We are out of time. (unidentified speaker)
I was just going to say, do you want a quick hand
fixing that up? (Van Lindberg)
I talked for too long. (unidentified speaker)
Do you want a quick hand fixing that up? (Van Lindberg)
Sure, please fix that up. In the meantime... In the meantime,
I'll tell you basically there are different ways
in which you could -- the latent semantic indexing was -- and then you had something called
latent Dirichlet allocation -- I'm not even sure
I'm saying that right -- LDA, in which case it allocated
a series of topics across a particular document. A series of topics across the document. The most interesting things
that have happened recently are the creation of two things
called word2vec and lda2vec. And what this is
is it's a conditional -- the evaluation
of a conditional probability of a word,
given the other words around it. And the interesting thing about this
is that it operates in the vector space, similar --
and allows you to do math on words. For example, if you say "king,
minus man, plus woman," what do you get? Queen. And if you think about your vectors,
when you add them in different ways, turns out that with word2vec
you end up with the same way. Lda2vec does the same thing
for topics. And with that, I hope
you haven't been too disappointed. But thank you for listening
and I'll take any questions as we go down. Thank you. [applause]