Van Lindberg Structured Data from Unstructured Text PyCon 2016

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
(moderator) Hello. This is Structured Data from Unstructured Text. I'd like to introduce our speaker, Van Lindberg. [applause] (Van Lindberg) Hello everyone. Wow! That's quite a bit louder than I expected. I am happy to be here today. I hope that you are happy about it, because a number of you were probably coming expecting to see Smitha Milli. Smitha had an issue that -- a travel issue, I believe, and wasn't able to make it. And so, I saw her abstract and I saw her title, and this was something that I have been passionate about for a long time, and I said, "Well, I'll give that talk. "I don't have her slides but I'll write something." And they said, "Okay." And so, I saddled myself with a talk 24 hours ago. But this is something that I love. But I tell you, I will not be offended if any of you walk out right now. Still here? Good. Okay. So why are we here? Why do we care about getting unstructured -- getting structured data from unstructured text? Well, I think it's because we are creatures of language. We have -- we are surrounded by masses of text. You know, there is an explosion of publicly available information right now, and most of it is in text format. And we want to use that text, we are being who we are, as input to a function. You know, this is very trendy right now. Half of the startups out there are about applying machine learning to some sort of text database or some sort of images, and people are throwing money at them. We want to make our processes more efficient. But I think it's about a little bit more than that. As Dr. Barba said in her keynote yesterday, we are creatures of language. We act through language. If we can teach computers to take all of this text that surrounds us and turn it into information that we can use in various ways, then we have done something that is actually very profound. We have taught computers to read for us, to understand, and to teach us back the things that they understood. We have made computers a little more human. That is a pretty profound thing. So, I think as soon as we start talking about generating... Ah, I'm not even keeping up with my slides. As soon as we start talking generating structured data from unstructured text, I think that we need a couple definitions. These are not dictionary definitions, these are my definitions. And I'm not going to read them to you, but I want to highlight one thing about them, and that is: it is all about ambiguity. As humans, we have the capability of -- we have lots of context and we have lots of things that we use for clues in order to reduce the ambiguity in English sentences or other natural languages and come up with something approaching what we think that other person meant. But computers don't have that same advantage. And so, we -- what we are trying to do when we're generating structured data is we're trying to reduce the -- use the structure of the language, all the clues that we have, and teach a computer to reduce the ambiguity so that we are able to then compute about it. So, to start with, we need some data. Now, I'm going to switch over to a Jupyter notebook and may the demo gods all be with us today. And I have a CSV -- oops, wrong way. I have a CSV that we exported actually from the PyCon site that has tutorial names and -- talk and tutorial names and abstracts, and as far as possible I'm going to be using that for some of the data today. Now, this is unstructured text. I can show it to you right here. And you can see it's just in a CSV file. It's not very interesting. But even -- this is unstructured text, but I have to warn you that I am cheating, because already, I have it in a machine-parsible format. I have it in a CSV. Really, this information comes from a database dump, or even you can get it off the HTML by scraping the PyCon website. A lot of times, the data that you get is going to come in a messy format embedded in some sort of document, and half of machine learning is about the pipes that move the data around and cleaning the data. And so, I'm going to start by getting the data, and I'll run this here. And if you look at it, I just -- I went over this once. I removed all the -- it's in UTF-8 but I removed all the characters that have an ordinal number higher than 128 because I really only needed the ASCII data. I had some messy stuff, but you'll see I didn't actually do it very carefully because I've got a bunch of spurious newlines in there. So I'm going to actually clean it a little bit more, and I'm going to replace some of these newlines. And I'm going to also just make a little index that I'm going to use to save and I can refer back to these documents later. So, you can see the -- you saw these newlines and everything. Let's make it look just a little bit nicer here. And, well, it is... It's harder to tell on this resolution; however, those -- the spurious newlines are gone. But... You may think that we are ready to go, but we actually have one more important step, and that is: we need to tokenize. We need to decide what we actually mean by "We have something that we're ready to compute about. We have data." Now, when you're thinking about words and language, you may think, "Well, obviously you would split along the words." But that is not nearly as obvious as it might seem. The reason why is because there are lots of interesting things that you can do -- for example, splitting along every character line. How many of you have heard of neural network, in particular, recurrent neural networks? A lot of people. There was a great demo the other day -- that came around a little while ago, where they decided, "We're going to generate text "and we're going to have a bunch of words "but we're not going to look at them as words. "Instead we're going to look at them on a character-by-character basis "and we're going to figure out what is the most likely character, "after looking at all this stuff, "based upon the inputs that we've had so far." And so I've got -- you know, "Python is..." Snd how many characters? I thought obviously we should do 42. "Python is the legend of John Newton at Storm. Sit." [laughter] And so, the amazing thing, when you think about it, this was generated character by character. It had no idea that it was generating English words. It was computing on this idea of one piece of text anticipates the next, anticipates the next. And we can actually redo this over and over and get new funny -- new funny things. We'll give it a few more. "Python is locked panel: The current De Zoli (GSR) Cit." The point is that you may want to think about, is a word really the smallest piece of information that you need? Going the other way, you may want to also look at things that are bigger than one single word. For example, bigrams are sequences of two words. You would have those in cases where you have -- especially for something like a proper noun -- where "the queen" has a very separate and distinct meaning from just the word "queen." And by throwing away the word "the", the fact -- the information that it was next to the word "queen", you actually throw away a lot of information. In fact, language models -- I don't know how many of you know Doug Napoleone. He works for Nuance, who does a lot of cutting-edge language research and work. And he said that they have a six-way model -- 6 ngram is what they call it -- for predicting the next part of speech. But we are going to be completely boring and we are going to tokenize just upon words. And one of the things that we have in Python is something called NLTK. it's the Natural Language Toolkit. I'm going to be using a lot of NLTK today and I'm going to be using something called gensim. These are excellent, almost jack of all trades in terms of doing a lot of things with natural language processing. There are a few others that we'll talk about. Also, all of this, just as an aside, is running in Python 3.51 and so everything here is available in Python 3. So, we're going to import NLTK and we're going to tokenize our words. And so, let's see what -- I took a random sentence from one of our inputss. Let's see what we've got. So let's run this. So, "The syntax in many programming languages is complicated "and they use many characters rarely found in written English", was the sentence, and you can see the way in which it tokenized it. It broke it out mostly by white space, and you'll notice it also separated out the period as a separate character at the end, because it changes the -- it changes what it is that you are thinking about. It changes the meaning of both the word before and the word after. Now, now that we have our -- now that we have our various tokens, the first place that you might think about looking for ways to reduce the ambiguity, to get information out of these sentences, is by examining the structure of the sentence itself. I don't know how many of you ever had to do this. Did any of you ever have to do these? They were pain, weren't they? This is a sentence diagram, where the different parts of speech and the relationships between the different words are represented by the types of lines and the connections between them. This is something that we can also do with computers, and in fact, Google just released something called SyntaxNet where they use their TensorFlow system in order to do very high quality parsing, tagging, tokenization. And it uses both a rule-based system and a probabilistic-based system based upon learning many, many sentences across the entire Internet. This pre-trained model that they provide for parsing English sentences is called Parsey McParseface. And you can look it up. And they have a command line interface to it, and so you can see that it generates the same sort of tree. Not exactly the same, but you can see the similarities. And you can also see it allocating parts of speech tags to the various words: verbs, nouns, subjects, direct objects, etc. So this is very useful for certain types of information extraction, particularly if you want to know about the relationships between -- within the sentence. But personally, having done a lot of this, I found that it hasn't quite lived up to my expectations. It hasn't been as useful as I would have expected. And the reason why is because it tells you a lot of things about the structure of the sentence, but then it doesn't tell you about anything relative to other sentences. It doesn't tell you about information and how it relates to other symbols that represent things in the real world. For example -- for example, it is a lot more useful to know that Queen Elizabeth is closely associated with England than that "Queen Elizabeth" is a compound noun phrase. One ties two concrete facts together, the other ties a fact to an abstract concept. Useful, but not always. The other thing is that Parsey McParseface is state of the art. It can correctly diagram and tag about 94% of sentences. I believe that that is either at the best or near the best that has been done. This is actually just a hair below what a trained human linguist could do, which is about 97%. But you'll notice there is still a lot of ambiguity in there. The thing that strikes me is, 97% of sentences, even a trained linguist still gets wrong 3% of those sentences. So what are we to do? Turns out that the answer, as so frequently is the case these days, is data. More and more data. And this is something that has really been pioneered by Google, and their -- they had an interesting thing many years ago where they said -- they talked about the unreasonable effectiveness of data. And so, let's start again then with our tokenized sentence and see if there's another way we can work with them. And that's by thinking of them as a collection of features. Now a feature is anything that a computer can recognize. In this case... In this case, what we're going to do -- let me switch back here. Oh, by the way, NLTK -- oops. Nope, that's not what I want to do. NLTK can also do that. Hmm. All right. So, moving on. [laughter] So, the... Ee're going to look at a series of features. And when we're looking at language, the thing that we're going to say is, a feature is the presence, at least at the start, is the presence or the absence of the word in a certain context. Now this throws away a whole lot of context -- so much that you would be really surprised. They call it a bag of words. And the image is actually rather apt of you pour everything into a bag and you jumble it and that is your set of features and that completely characterizes your document. And so, we're going to prepare -- take our documents, our texts here, and we're going to prepare a bag of words with them by removing some of the most common ones. We'll use our tokenizer here. You can see the text. And we're going to throw out the ones that only appear once. And then we're going to create a series of texts. So let's run this. It looks like I have... I need to... Let me run all the ones above this. [laughter] [laughter] Huh. It worked. Okay, there we go. And so, now we're going to create a dictionary and a corpus. And a dictionary is something -- a corpus is simply a collection of documents, and then a dictionary is something that maps a particular position in this vector to a particular word. Now the nice thing about a vector is it's a long, one-dimensional matrix. It's a 1-by-n matrix where the number -- the n is the number of different features that you have, and the 1 is the presence or the absence of that feature, at least in the bag words model. And the thing that you -- and then you can create things later that use different representations in that spot. But as long as you have that, you can do a lot of really nifty matrix math on it, and that allows you to extract some of the meaning from it. So we're going to use gensim here. You'll notice I've moved to gensim in the text. And we are going to... Oops. Oh, dead kernel. [audience member inaudible] [laughter] Okay. [audience member inaudible] (audience member) "Kernel" in the toolbar. (Van Lindberg) Kernel in the toolbar. Restart. [applause] Thank you. And then we'll Run All Above. And let's see if we can. Do I have a dead kernel again? [audience member inaudible] Okay. So while we're waiting for this, I'm going to tell you about what it's doing. So, we have this "bag of words" model, and frankly it's not that useful except as a starting place. But the thing that we can start to do is we can start to talk about extracting information by doing math on this vector. And the first thing that we're going to do is we're going to look at something called tf-idf which stands for "term frequency over inverse document frequency." What this does is it looks at the total frequency of each feature, in this case each word, and it scales that by the number of times that that particular feature appears across all documents. If you think about it, this actually becomes a pretty intuitive way of identifying things that are important in a particular document. For example, the word "and" is going to appear all over the place, but because it has a high divisor it appears in lots of -- it has a high document frequency, well, we're going to -- it ends up not being very important. Whereas if you have something like "docker," well, "docker" is going to show up quite a quite a bit. Assuming that... You know what? I actually have all the text. So let's see if we can... No, not that one. Okay. This is not working. So, what we're going to do is... Man! Okay. So, after we look at term frequency over inverse document frequency, what we have is a series of floats in each of those spaces that talk about the relative importance of each of those features in each document. And after -- and once we have that, we have what is called a sparse vector. Most of those things are going to be either zeros or very, very low probabilities. It's just, think about if you had a long, long matrix that had a 1 for every single word that was in a document. Now a document may have a lot of words, but if the length of the vector is the number of all words in English, or all the words in all the documents that you have, most of the time that is going to be 0. Those are not going to exist at all. And so as a result, what happens is that you have very, very high dimensional spaces. Typical models will use anywhere from 300 to 500 as sort of a golden range and can go up to thousands or even millions of different features in their vector space. This is simply too many for us to meaningfully reason about. So one of the things that we do is we take a look at ways in which we can do meaningful dimensionality reduction, which means that we say, "What are the most important pieces "of this vector? What are the things that most correctly characterize it?" And we use those because that ends up being a much more compact representation of what is the essence of this particular document. So, how many of you have heard of principal component analysis? Good. The first thing that came out -- this was either the late '90s or the 2000s -- was the use of LSI, or latent semantic indexing, in order to identify what a document is about, what other things are about, by doing principal component analysis across the different vectors associated with a particular document. Turns out that they were able to find that certain things were very closely aligned in various ways, and other things were not nearly -- were less closely aligned. And as a result... And as a result, they... As a result, they were able to say, well, if you have a word that looks like -- if you have a word "Queen Elizabeth" or a term "Queen Elizabeth," that tends to be highly associated with ships, with England, with the Crown, with things like that. And because those things all tended to cluster together, you were able to not only be able to have them together but also to be able to... You know what? We are out of time. (unidentified speaker) I was just going to say, do you want a quick hand fixing that up? (Van Lindberg) I talked for too long. (unidentified speaker) Do you want a quick hand fixing that up? (Van Lindberg) Sure, please fix that up. In the meantime... In the meantime, I'll tell you basically there are different ways in which you could -- the latent semantic indexing was -- and then you had something called latent Dirichlet allocation -- I'm not even sure I'm saying that right -- LDA, in which case it allocated a series of topics across a particular document. A series of topics across the document. The most interesting things that have happened recently are the creation of two things called word2vec and lda2vec. And what this is is it's a conditional -- the evaluation of a conditional probability of a word, given the other words around it. And the interesting thing about this is that it operates in the vector space, similar -- and allows you to do math on words. For example, if you say "king, minus man, plus woman," what do you get? Queen. And if you think about your vectors, when you add them in different ways, turns out that with word2vec you end up with the same way. Lda2vec does the same thing for topics. And with that, I hope you haven't been too disappointed. But thank you for listening and I'll take any questions as we go down. Thank you. [applause]
Info
Channel: PyCon 2016
Views: 12,352
Rating: undefined out of 5
Keywords:
Id: -K-XtxSyyvU
Channel Id: undefined
Length: 27min 35sec (1655 seconds)
Published: Mon Jun 20 2016
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.