Understanding Word2Vec

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

we're going to continue our discussion of how we can represent words and vectors and today we're going to be focusing explicitly on one algorithm for creating distributed representations word to vector and this is often connected with deep learning and neural networks deep learning and neural networks need these kinds of representations to be able to make the generalizations that have proven so effective and useful for deep learning approaches but word Tyvek is not actually a deep learning algorithm there isn't really a hidden layer to speak of is just learning the representations so what we'll be doing in this video is taking a look inside the black box understanding why it is producing the kinds of outputs that it is and why these outputs are useful for natural language processing tasks specifically when you're using techniques like deep learning we're going to be focusing on a specific technique called word Tyvek word Tyvek typically refers to this piece of code written by mikhalev at all and not necessarily an abstract algorithm well you can create your own implementations of word defects people typically are referring to one specific piece of code when they talk about word to back at a high level the way that word Tyvek works is that you feed in a bunch of text for example from wikipedia you let it run on a machine for a couple of hours and then you get out some vectors that represent the words and so how does word to that create vectors that encode meanings of words and are so useful and effective the sorts of things that you get out really do seem to capture the meaning dog is most similar to things like cat dogs duck swooned rabbit puppy poodle lot by the mixed-breed Doberman Pig etc and if you look through these other words you also see that the nearest neighbors in this vector space does seem to encode the meanings of words and going back to what we were talking about before this really does seem to be capturing the company that a word keeps this is a way of learning the distributional semantics of words and we didn't need to have any dictionaries we didn't need to have extensive knowledge bases built all of this is happening just from raw text I just showed you a word and its neighbors in this vector space what does it mean for a word to be similar when we talk about this so here we're going to just use the cosine similarity or the dot product to refer to similarity just like we did for tf-idf vectors here you can have a word representation in this vector space and if everything is links one you can take the dot product and see how similar they are or if they haven't been normalized you can take the cosine to figure out the angle between them and that figures out how similar these words are so it's much faster to do the dot product so a good rule of thumb is that you should normalize the vectors before you load them in into your memory so that you just have to do the normalization once otherwise every time you compute the cosine you have to implicitly normalize the vectors let's say that you wanted to compute the nearest neighbors of the word dog and so here you just want to take a single matrix product of all of the representations of all of the words in your vocabulary then you have the single vector representation of dog you can take those two matrices multiply them together and then the resulting matrix is just the similarity of dog to every other word and so then you can take a look at this matrix find the entries with the highest values and then that becomes your answer what are the most similar words to dog and because this is a single matrix computation GPUs can do this very quickly and there are fast libraries for computing this this brings us to another reason that these representations have become so popular unlike symbolic representations that are a little more expensive to compute you can put these computations on a GPU and they happen very quickly and when you have lots of things going on in a complicated in LP system this does save you quite a bit of time even for a very large vocabulary this is happening on the order of milliseconds so fast that a human wouldn't even notice and this isn't just something that Google can do you can do this on your own laptop very quickly you can load in these matrices you can create word representations for each word and then find out what vector corresponds to dog then take the dot product of your big word matrix with dog and then find what words are most similar let's say if you wanted to find the similarities to a set of words for example all the words that appear in a document what you could do is you could do this in x and then sum these similarities together but you can use some factorization to get a better more efficient answer first sum the words in the document and then take the dot product with the word representations and then you can find out what are the words most similar to this overall document these distributed representations are very efficient but where do they come from what we're going to talk about now is one specific word to Veck algorithm that can produce these vectors and when most people talk about word Tyvek they talk about it generally but if you actually take a look there are multiple ways of doing word Tyvek in the code that I mentioned before so one big decision point is what are you predicting and so there are two flavors of word Tyvek one is the continuous bag of words flavor and one is the skip gram model sometimes called skip in gram in the continuous bag of words model you take your context and then you sum the context together and then you try to predict the word in the middle of that context the word in the middle of the context is often called the focus word so you take a context all the words around the word and you try to predict a focus word from that context that's one way to do it the other way of doing it is taking your focus word and then trying to predict each of your context words one by one here you have K different predictions whereas in the continuous bag of words only had one prediction to me at least a continuous bag of words model is a little bit more intuitive it's like a fill-in-the-blank puzzle from a high-school test the man went to the bank to take out a blank on his house what fills in the blank is it mortgage marinara matrimony or muscle and then you select ah the answer is mortgage and that's the sort of thing that word Tyvek should be doing when you do those tests in high school that shows that you know what a word means when word Tyvek does those tests it shows that it knows what a word means and so that's intuitive that's what the continuous bag of words model does but the Skip Graham model does something a little bit different it instead goes the other direction okay here's the word mortgage what words appear in a sentence with mortgage and you need to predict all of them and so then you might think aha oh right well so maybe a bank maybe house maybe financing interest rate points Fannie Mae Freddie Mac and so then you list off all the words that you can think of and you need to give higher weight to the words that appear more often in context with mortgage so that's what the skip in gram model is doing even though the skip gram model is a little less intuitive than the continuous bag of words model it tends to work a little bit better for rarer words and that's typically what you care about frequent words are well modeled by the continuous bag of words model but frequent words get all the love anyway and you don't need to devote so much attention to them the skip gram model helps you understand rarer words a little bit better so it's more commonly used so as a result we're going to focus on a discussion of the Skip Graham model the other major choice that you have to make in deciding what word Tyvek model you're using is how are you going to predict your words and how are you going to optimize that prediction when it comes to learning the parameters of your model one way of doing it is something called a hierarchical softmax you need to make a prediction over a very large vocabulary so you basically build a binary tree over all the words and you predict a path through this this is called hierarchical softmax we're not going to talk about that instead what we're going to do is we're going to focus on the more intuitive option- sampling even though we're going to focus on negative sampling and the skip grande model once you understand the intuitions behind these two variants the others follow pretty easily and you can actually read the paper and understand what's going on okay so how does word Tyvek work in word to vac you have two vectors you have a word matrix and a context matrix and you're going to learn representations for each word in both of these matrices so every word in your vocabulary has a row in these really tall skinny matrices and these matrices are d wide D is the dimensionality of the representation that we're going to learn typically this is on the order of 300 or something like that you initialize both of these matrices to be random and then you put them into these matrices and you start your learning process so going back to what I said before we're going to use the Skip Gramm model of prediction and here you're trying to predict your context words from the words in the sentence and so the context word you're going to predict that from the focus word in your sentence both of these words have entries in both the word matrix and the context matrix and the way that we're going to do this is we're going to take the probability of a context word given the focus word as the dot product between the context word and the focus word and at this point this should remind you a lot like logistic regression you basically have the features corresponding to one vector and the evidence corresponding to the other vector and we can do something similar to optimize this logistic regression so our total objective function in log space looks something like this we're going to sum over all of our words and context and try to optimize the vector representations to give the highest probability of this expression so that's the math let's now dig into the intuition a little bit more with an example first let's find some window and so this will give us both a context and a focus word in this case our focus word is heifer and our context is a cow or close to calving there are six context words we want to predict those context words from the focus word so we know what the true context actually is so we have in essence six logistic regressions with true observations one for uh one is cow one is or one is close one is two one is calving these are the six predictions that our models should make correctly so we want these values to be high once you take the dot product and pass them through the sigmoid function but what does it mean to have a negative example in the case of word Tyvek and so in this case we're going to corrupt the example by choosing a different focus word here we'll say that the focus word is comet instead of heifer so we select that word and we want each of those values to be low so these are the negative examples for something that looks like logistic regression and the true word the original word the uncorrupted word is the positive example so we want the probability of negative to be low and the probability of positive to be high and just like logistic regression you can do is to cast a gradient descent to optimize those examples by drawing the context word and the corrupted negative sample from your data distribution one important detail of word Tyvek is that you sample your corrupted words from a very special distribution and inward Vivek typically what's done is you take your distribution over words the frequency distribution and you raise that to a power in this case the 3/4 power is the one that's most commonly used and then you normalize that distribution to get a distribution that looks like this so what does it do to the distribution when you raise it to the 3/4 power and then normalize it so here's an original distribution and here is a distribution raised to the 3/4 power and normalized so take a look at what's happening here and so the the y-axis is the same length in both of these plots so you can see that the most frequent terms are being brought down and the infrequent terms are being brought up a little bit and so as a result you're focusing on not super frequent words so you're not going to replace every focus word with the you're going to replace the focus word with things that are more likely in the middle range of your distribution and you can explore more at the long tail and remember at the very beginning we talked about how most languages has a zip theand distribution over words this makes the tail a little bit fatter and a little bit longer the end result is that the dot product between good context word pairs is high and low for things that never appear together and as a result words are brought close together if they have similar meanings and at the end of the day you just take the W matrix and you throw away C you just take the W matrix because this is where the useful information lives the information C is in a different vector space and in any case that's redundant word has all of the information so we'll just take that matrix that's the one that you'll use in downstream applications but imagine that we didn't throw it away let's say that we had both of these matrices and now let's take the product of the W matrix and C matrix but we'll transpose the C matrix if you took this product you would then recreate a matrix that is context by word and this just corresponds to how often words appear together and each cell is implicitly the association between a word and its context let's say that we added one additional square matrix inside this matrix product what would that be you may want to pause the video and think about this for a second if you add one additional square matrix in between this matrix product this is equivalent to a singular value decomposition over the co-occurrence matrix of your vocabulary there's a paper by Levy and Goldberg that shows the relationship between the well-known SVD and word Tyvek in fact word Tyvek is recreating something that looks a lot like the probabilistic point-wise mutual information there's a follow on paper by Levy Goldberg and agon that shows that you can get SVD to work as well as word Tyvek with a couple of tricks ie using negative sampling and things like that in the context of SVD to transform your probability assumptions so if that's the case why are we using word to Veck so SVD is really expensive computationally you have to do these giant matrix operations it's very inefficient you need giant computers but words that can run on your laptop you can build train and test models all on a single machine with multiple threads and it can easily scale to very large vocabularies on relatively modest hardware so this is why word pivot is still popular still often used and why most of the word representation builds on top of things like word Tyvek rather than things like SPD's hopefully we've given you an intuition about what it means to compute distributed representations of words and throughout the rest of this class and the rest of your interaction with natural language processing you'll see how useful these representations are and how well they encode word meeting we'll also see some of the limitations of these representations and talk about ways that we and improve them later on in the class

Info

Channel: Jordan Boyd-Graber

Views: 51,664

Rating: 4.8594027 out of 5

Keywords:

Id: QyrUentbkvw

Channel Id: undefined

Length: 17min 51sec (1071 seconds)

Published: Sun Feb 17 2019