Lecture 3 | GloVe: Global Vectors for Word Representation

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[MUSIC] Stanford University. >> Alright! Hello, everybody. Welcome to lecture three. I'm Richard, and today we'll talk a little bit more about word vectors. But before that, let's do three little organizational Items. First we'll have our first coding session this week. Next, the problem set one has a bunch of programming for you, as the first and only one where you will do everything from scratch. So, do get started early on it. The coding session is mostly to help you chat with other people go through small bugs. Make sure you have everything set up properly, your environments and everything, so you can get into the exciting deep learning parts right away. Then there's the career fair, the computer science forum. It's excited to help you find companies to work at, and talk about your career. And then my first project advice office hour's today, I'll just grab a quick dinner after this and then I'll be back here in the Huang basement to chat. Mostly about projects, so we encourage you to think about your projects early and so we'll start that today. Very excited to chat with you if wanna just bounce off ideas in the beginning, that will be great. Any questions around organization, yes. I think just like outside, yeah like, you can't miss it, like right here in front of the class. Any other organizational questions? Yeah. He will hold office hours too. And we have a calendar on the website, and you can find all our office hours on the calendar. Okay. We'll fix that. We'll add the names of who's doing the office hours, especially for Chris and mine All right, great. So we'll finish word2vec. But then where it gets really interesting is, we're actually asked what word2vec really captures. We have these objective functions we're optimizing. And we'll take a bit of a look and analyze what's going on there. And then we'll try to actually capture the essence of word2vec, a little more effectively. And then also look at our first analysis, of intrinsic and extrinsic evaluations for word vectors. So, it'll be really exciting. By the end, you actually have a good sense of how to evaluate word vectors, and you have at least two methods under your belt on how to train them. So let's do a quick review of word2vec. We ended with this following equation here, where we wanted to basically predict the outside vectors from the center word, and so lets just recap really quickly what that meant. So let's say I have the beginning of a corpus, and it says something like, I like deep learning, or just and NLP. Now, what we were gonna do is, we basically wanna compute the probability. Let's say, we start with these word vectors in this is our first center word, and that's deep. So, we wanna first compute the probability of the first outside word, I given the word deep and that was something like the exponent here Of UO. So the U vector is the outside word and so that's, in our case, I here transposed the deep. And then we had this big sum here and the sum is always the same, given for a certain VC. So that is the center word. Now, how do we get this V and this U? We basically have a large matrix here, with all the different word vectors for all the different words. So it starts with vector for aardvark. And a and so on, all the way to maybe the vector for zebra. And we had basically all our center words v in here. And then we have one large matrix, where we have again, all the vectors starting with aardvark and A, and so on, all the way to zebra. And when we start in our first window through this corpus, we basically collect, take that vector for deep here this vector V plug it in here and then we wanna maximize this probability. And now, we'll take the vectors for U for all these different words like I, like, learning, and and. So the next thing would be, I for like or the probability of like given deep. And that'll be the exponent of U like transpose of v deep. And again, we have to divide by this pretty large sum over the entire vocabulary. So, it's essentially little classification problems all over. So that's the first window of this corpus. Now, when we move to the next window, we basically move one over. And now the center word is learning, and we wanna predict these outside words. So now we'll take for this next, the second window here. This was the first window, the second window. We'll now take the vector V for learning and the U vector for like, deep and NLP. So that was the skip gram model that we talked about in the last lecture, just explained again with the same notation. But basically, you take one window at a time, you move that window and you keep trying to predict the outside words. Next to the center word. Are there any questions around this? Cuz we'll move, yep? That's a good question, so how do you actually develop that? You start with all the numbers, all these vectors are just random. Little small random numbers, often sampled uniformly between two small numbers. And then, you take the derivatives with respect to these vectors in order to increase these probabilities. And you essentially take the gradient here, of each of these windows with SGD. And so, when you take the derivatives that we went through in Latin last lecture, with respect to all these different vectors here, you get this very, very large sparse update. Cuz all your parameters are essentially all the word vectors. And basically these two matrices, with all these different column vectors. And so let's say you have 100 dimensional vectors, and you have a vocabulary of let's say 20,000 words. So that's a lot of different numbers that you have to optimize. And so these updates are very, very large. But, they're also very sparse cuz each window, you usually only see five words if your window size is two. yeah? >> [INAUDIBLE] >> That's a good question. We'll get to that once we look at the evaluation of these word vectors. This cost function is not convex It doesn't matter, sorry, I should repeat all the questions, sorry for the people on the video. So the first question was, how do we choose the dimensionality? We'll get to that very soon. And then, this question here. Was how do we start? And how much does it matter? It turns out most of the objective functions, pretty much almost of them in this lecture, are not convex, and so initialization does matter. And we'll go through tips and tricks on how to circumvent getting stuck in very bad local optima. But it turns out in practice, as long as you initialize with small random numbers especially in these word vectors, it does not tend to be a problem. All right, so we basically run SGD, it's just a recap of last lecture. Run SGD, we update now our cost function here at each window as we move through the corpus, right? And so when you think about these updates and you think about implementing that, which you'll very soon for problem set one, you'll realize well, if I have this entire matrix, this entire vector here, sorry. This vector of all these different numbers and I explicitly actually keep around these zeros, you have very, very large updates, and you'll run out of memory very quickly. And so what instead you wanna do is either have very sparse matrix operations where you update only specific columns. For this second window, you only have to update the outside vectors for like, deep and NLP and inside vector for learning. Or you could also implement this as essentially a hash where you have keys and values. And the values are the vectors, and the keys are the word strings. All right, now, when I told you this is the skip-gram model, I actually kind of lied a little bit to teach it to you one step at a time. It turns out when you do this computation here, the upper part is pretty simple, right? This is just the hundred-dimensional vector, and you multiply that with another hundred-dimensional vector. So that's pretty fast. But at each window, and again you go through an entire corpus, right? You do this one step at a time, one word at a time. And for each window, you do this computation. And you do also this gigantic sum. And this sum goes over the entire vocabulary. Again, potentially 20,000 maybe even a million different words in your whole corpus. All right, so each window, you have to make 20,000 times this inner product down here. And that's not very efficient. And it turns out, you also don't teach the model that much. At each window you say, deep learning, or learning does not co-occur with zebra. It does not co-occur of aardvark. It does not co-occur with 20,000 other words. And it's kind of repetitive, right? Cuz most words don't actually appear with most other words, it's pretty sparse. And so the main idea behind skip-gram is a very neat trick, which is we'll just train a couple of binary logistic regressions for the true pairs. So we keep this idea of wanting to optimize and maximize this inner product of the center word and the outside words. But instead of going through all, we'll actually just take a couple of random words and say, how about these random words from the rest of the corpus don't co-occur. And this leads us to the original objective function of the skip-gram model, which sort of as a software package is often called Word2vec. And the original paper title was Distributed Representations of Words and Phrases, and their compositionality. And so the overall objective function is as follows. Let's walk through this slowly together. Basically, you go again through each window. So T here corresponds to each window as you go through the corpus, and then we have two terms here. The first one is essentially just a log probability of these two center words and outside words co-occurring. And so the sigmoid here is a simple element wise function. We'll become very good friends. We'll use the sigmoid function a lot. You'll have to really be able to take derivatives of it and so on. But essentially what it does, it just takes any real number and squashes it to be between zero and one. And that's for you learning people, good enough to call it a probability. If you're reading statistics, you wanna have proper measures and so on, so it's not quite that much, but it's a number between zero and one. We'll call it a probability. And then we basically can call this here a term that we basically wanna maximize the log probability of these two words co-occurring. Any questions about the first term? This is very similar to before, but then we have the second term here. And the original description was this expected value here. But really, we can have some clear notation that essentially just shows that we're going to randomly sub sample a couple of the words from the corpus. And for each of these, we will essentially try to minimize their probability of co-occurring. And so one good exercise is actually for you in preparation for midterms. And what not to prove to yourself that one of sigmoid of minus x is the same as one minus sigmoid of x. That is a nice little quick proof to get into the zone. And so basically this is one minus the probability of this. So we'd subsample a couple of random words from our corpus instead of going through all the different ones saying an aardvark doesn't appear. Zebra doesn't appear with learning and so on. We'll just sample five, or ten, or so, and then we minimize their probabilities. And so usually, we take and this is again a hyperparameter, one that will have to evaluate how much it matters. I will take k negative samples for the second part here of the objective functions for each window. And then we minimize the probability that these random words appear around the center word. And then the way we sample them is actually from a simple uniform or unigram distribution here. We basically look at how often do the words generally appear, and then we sample them based on that. But we also take the power of three-fourth. It's kind of a hacky term. If you play around with this model for long enough, you say, well, maybe it should more often sample some of these rare words cuz otherwise, it would very, very often sample THE and A and other stop words. And would probably never, ever sample aardvark and zebra in our corpus, so you take this to the power of three-fourth. And you don't have to implement this function, we'll just give it to you cuz you kind of have to compute the statistics of how often each word appears in the corpus. But we'll give this to you in the problem set. All right, so any questions around the skip-gram model? Yeah? That's right, so the question is, is it a choice of how to define p of w? And it is a choice, you could do a lot of different things there. But it turns out a very simple thing, like just taking the unigram distribution. How often does this word appear works well enough. So people haven't really explored more complex versions than that. That's a good question. Should we make sure that the random samples here aren't the same as exactly this word? Yes, but it turns out that the probability for a very large corpora is so tiny that the very, very few times that ever happens is kind of irrelevant. Cuz we randomly sub-sample so much that it doesn't change. Orders of magnitude for which part? K, it's ten. It's relatively small, and it's an interesting trade-off that you'll observe in actually several deep learning models. Often, As you go through the corpus, you could do an update after each window, but you could also say let's go through five windows collect the updates and then make a really, a step in your... Mini batch of your stochastic gradient descent and we'll go through a lot these kind of options later in the class. All right, last question on skip gram What does Jt(theta) represent? It's a good question. So theta is often a parameter that we use for all the variables in our model. So in our case here for the skip-gram model, it's essentially all the U vectors and all the V vectors. Later on, when we call, we'll call a theta, it might have other parameters of the neural network, layers and so on. And J is just our cost function and T is at the Tth time step or the Tth window as we go through our corpus. So in the end, our overall objective function that we actually optimize is the sum of all of them. But again, we don't wanna do one large update of the entire corpus, right? We don't wanna go through all the windows, collect all the updates and then make one gigantic step cuz that usually doesn't work very well. So, good question I think, last lecture we talked a lot about minimization. Here, we have these log probabilities and in the paper you wanna maximize that. And it's often very intuitive, right? Once you have probabilities, you usually wanna maximize the probability of the actual thing that you see in your corpus happening. And then other times, when we call it a cost function, we wanna minimize the cost and so on. All right so, in word2vector's, another model, which you won't have to implement unless you want to get bonus points. But we will ask you to take derivatives of, and so it's good to understand it at least in a very simple conceptual level. And it's very similar to the skip-gram model. Basically, we want to predict the center word from the sum of the surrounding words. So very simply here, we sum up the vector of And of NLP and of deep and of like and we have the sum of these vectors. And then we have some inner products with just the vector of the inside. And basically that's called the continuous bag of words model. You'll learn all about the details and the definition of that in the problem set. So what actually happens when we train these word vectors, right? We optimize this objective function and we take gradients and after a while, something kind of magical happens to these word vectors. And that is that they actually start to cluster around similar kinds of meaning, and sometimes also similar kinds of syntactic functions. So when we zoom in, and again, this is, usually these vectors are 25 to even 500 or thousand dimensional, this is just a PCA visualization of these vectors. And what we'll observe is that Tuesday and Thursday and weekdays cluster together, number terms cluster together, first names cluster together and so on. So basically, words that appear in similar context turn out to often have dissimilar meaning as we discussed in previous lecture. And so they essentially get similar vectors after we train this model for a sufficient number of sets. All right, let's summarize word2vec. Basically, we went through each word in the corpus. We looked at the surrounding words in the window. We predict the surrounding words. Now, what we are essentially doing there is trying to capture the coocurrence of words. How often does this word cooccur with the other word? And we did that one count at a time. It's like, I see the deep and learning happen. I make an update to both of this vectors. And then you go over the corpus and then you probably will eventually see deep and learning coocurring again and you make again a separate update step. When you think about that, it's not very efficient, right? Why now we just go to the entire corpus once, count how often this deep and learning cooccur, of these two words cooccur, and then we make one update step that captures the entire count instead of one sample at the time. And, yes we can do that and that is actually a method that came historically before word2vec. And there are different options of how we can do this. The simplest one or the one that is similar to word2vec at least is that we again use a window around each word and we basically just go through the entire corpus. We don't update anything, we don't do any SGD. We just collect the counts first. And once we have the counts, then we do something to that matrix. And so when we look at just the window of length maybe two, like in this example here, or maybe five, some small window size around each word, what we'll do is we'll capture, not just the semantics, but also some of the syntactic information of each word. Namely, what kind of part of speech tag is it. So verbs are going to be closer to one another. Then the verbs are to nouns, for instance. If, on the other hand, we look at co-occurrence counts that aren't just around the window, but entire document, so I don't just look at each window. But i say, this Word appears with all these other words in this entire Wikipedia article, for instance, or this entire Word document. Then, what you'll capture is actually more topics, and this is often called Latent Semantic Analysis, a big popular model from a while back. And basically what you'll get there is, you'll ignore the part of speech that you ignore any kind of syntactic information and just say, well swimming and boat and water and weather and the sun, they're all kind of appear in this topic together, in this document together. So we won't go into too many details for these cuz they turn out for a lot of other downstream tasks like machine translation or so and we really want to use these windows, but it's good knowledge to have. So let's go over a simple example of what we would do if we had a very small corpus and wanna collect these windows and then compute word vectors from that. So it is technically not cosine cuz we are not normalizing over the length, and technically we are not optimizing inner products of these probabilities and so on. But continue. That's right. So the question is, in all these visualizations here, we kind of look at Euclidean distance. And it's true, we're actually often are going to use inner products kinds of similarities. So yes, in some cases, Euclidean distance works reasonably well still, despite not doing this in fact we'll see one evaluation that is entirely based or partly based on Euclidean distances and partly inner products. So it turns out both work well despite our objective function only having this. And even more surprising there're a lot of things that work quite well on this despite starting with this kind of objective function. We often yeah, so if despite having only this inner product optimizations, we will actually also do often very well in terms of Euclidean distances. Yep. Well, it get's complicated but there are some interesting relationships between the ratios of the co-occurence counts We don't have enough time to dive into the details, but if you are interested in that I will talk about a paper. I mentioned the title of the paper in five or ten slides, that will help you understand that a little better and gain some more intuition, yep. All right, so, window based co-occurrence matrices. So, let's say, we have this corpus here, and that's to find our window length as just 1, for simplicity. Usually, we have more commonly 5 to 10 windows around there. And we assume we have a symmetric window so, we don't care if a word is to the left or to the right of our center word. And we have this corpus. So, this is essentially what a window based co-occurrence matrix would be, for this very, very simple corpus. We just look at the word I and then, we look at which words appear next to I. And so, we look at I, we see like twice so, we have number two here. And we see enjoy once so, we put the count one here. And then, we know we have the word like. And so, like co-occurs twice with the word I on it's left and once with deep and once with NLP. And so, this is essentially we go through all the words in a very large corpus and we compute all these counts, super simple. Now, you could say, well, that's a vector already, right? You have a list of numbers here and that list of numbers now represents that word. And you already kinda capture things like, well, like and enjoy have some overlap so, maybe they're more similar. So, you already have a word vector, right? But now, it's not a very ideal word vector for a couple of reasons. The first one is, if you have a new word in your vocabulary, that word vector changes. So, if you have some downstream machine learning models now to take that vector's input, they always have to change and there's always some parameter missing. Also, this vector is going to be very high-dimensional. Of course, for this tiny corpus, it's small but generally, we'll have tens of thousands of words. So, it's a very high-dimensional vector. So, you'll have sparsity issues if you try to train a machine learning model on this afterwards and that moves up in a much less robust downstream models. And so, the solution to that is lets again have the similar idea to word2vec and have just don't store all of the co occurrence counts, every single number. But just store most of the important information, the fixed small number of dimensions, similar to word2vec, those will be somewhere around 25 to 1,000 dimensions. And then, the question is okay, how do we now reduce the dimensionality, we have these very large co-occurrence matrices here. In the realistic setting, we'll have 20,000 by 20,000 or even a million by a million, very large sparse matrix, how do we reduce the dimensionality? And the answer is we'll just use very simple SVD. So, who here is familiar with singular value decomposition? All right, good, the majority of people, if you're not then, I strongly suggest you go to the office hours and brush up on your linear algebra. But, basically, we'll have here this X hat matrix, which is going to be our best rank k approximation to our original co-occurrence matrix X. And we'll have basically these three simple matrices with orthonormal columns. U we often call also our left-singular vectors and we have here S the diagonal matrix containing all the singular values usually from largest to smallest. And we have our matrix V here, our orthonormal rows. And so, in code, this is also extremely simple, we can literally implement this in just a few lines, if we have, this is our corpus here, and this is our co-occurrence matrix X. Then, we can simply run SVD with one line of Python code and then, we get this matrix U. And now, we can take the first two columns here of U and plot them, right? And if we do this in the first two dimensions here, we'll actually get similar kinda visualization to all this other ones I've showed you, right? But this is a few lines of Python code to create that kinda word vector. And now, it's kinda reading tea leaves, none of these dimensions we can't really say, this dimension is noun, the verbness of a word, or something like that. But as you look at these long enough, you'll definitely observe some kinds of patterns. So for instance, I and like are very frequent words in this corpus and they're a little further to the left so, that's one. Like and enjoy are nearest neighbors in this space so that's another observation, they're both verbs, and so on. So, the things that were being liked, flying and deep and other things are closer together and so on. So, such a very simple method you get a first approximation to what word vectors can and should capture. Are there any questions around this SVD method in the co-occurrence matrix? It's a good question, is the window always symmetric? And the answer is no, we can actually evaluate asymmetric windows and symmetric windows, and I'll show you the result of that in a couple of slides. All right, now, once you realize, wow, this is so simple and it works kinda well, and you're a researcher, you always wanna try to improve it a little bit. And so, there are a lot of different hacks that we can make to this co-occurrence matrix. So, instead of taking the raw counts, for instance, as you do this, you realize, well, a lot of representational power in this word vectors is now captured by the fact that the and he and has and a lot of other very, very frequent words co-occur with almost all the nouns. Like the appears in the window of pretty much every noun out there. And it doesn't really give us that much information that it does over and over and over again. And so, one thing we can do is actually just cap it and say, all right, whatever the co-occurs with the most, and a lot of other one of these function words, we'll just maximize the count at 100. Or, I know some people do this also, we just ignore a couple of the most frequent words cuz they really, we have a power law distribution or Zipf's law where basically, the most frequent words appear much, much more frequently than other words and then, it peters out. And then, there's a very long tail of words that don't appear that often but those very rare words often have a lot of semantic content. Then, another way we can change this, the way we compute these counts is by not counting all the words equally. So, we can say, well, words that appear right next to my center word get a count of one. Or words that appear and they're five steps away, five words away only you get a count of 0.5. And so, that's another hack we can do. And then, instead of counts we could compute correlations and set them to 0. You get the idea, you can play a little around with this matrix of co-occurrence counts in a variety of different ways and sometimes they help quite significantly. So, in 2005, so quite a long time ago, people used this SVD method and compared a lot of different ways of hacking the co-occurrence matrix and modifying it. And basically found quite surprising and awesome results. And so, this is another way we can try to visualize this very high dimensional space. Again, these vectors are usually around 100 dimensions or so, so it's hard to visualize it. And so, instead of projecting it down to just 2D, here they just choose a couple of words and look at the nearest neighbours and which word is closest To what other word and they find that wrist and ankle are closest to one another. And next closest word is shoulder. And the next closest one is arm and so on. And so different extremities cluster together, we'll see different cities clustering together, and American cities are closer to one another than cities from other countries, and country names are close together, and so on. So it's quite amazing, right? Even with something as simple as SVD around these windows, you capture a lot of different kinds of information. In fact it even goes to syntactic and chromatical kinds of patterns that are captured by this SVD method. So show, showed, shown or take, took, taken and so on are all always together in often similar kinds of patterns. And it goes further and even more semantic in the verbs that are very similar and related to these kinds of nouns. Often appear even in roughly similar kinds of Euclidean distances. So, swim and swimmer, clean and janitor, drive and driver, teach and teacher. They're all basically have a similar kind of vector difference. And intuitively you would think well they appear, they often have similar kinds of context in which they appear. And there's some intuitive sense of why, why this would happen, as you're trying to capture these co-occurrence counts. Does the language matter? Yes, in what way? Great question. So if it was German instead of English. So it's actually a sad truth of a lot of natural language processing research that the majority of it is in English. And a few people do this. It turns out this works for a lot of other languages. But people don't have as good evaluation metrics often for these other languages and evaluation data sets which we'll get to in a bit. But we would believe that it works for pretty much all languages. Now there's a lot of complexity because some languages like Finnish or German have potentially a lot of different words, cuz they have much richer morphology, right? German has compound nouns. And so you get more and more rare words, and then the rarer the words are, the less good counts you have of them, and the harder it is to use this method in a vanilla way. Which eventually in the limit will get us to character-based natural language processing, which we'll get to in a couple weeks. But in general, this works for pretty much any language. Great question. So now, what's the problem here? Well SVD, while being very simple and one nice line of Python code, is actually computationally not always great, especially as we get larger and larger matrices. So we essentially have this quadratic cost here in the smaller dimension. So either if it's a word by word co-occurrence matrix or even a word by document, we'd assume this gets very, very large. And then it also gets hard to incorporate new words or documents into, into this whole model cuz you have to rerun this whole PCA or sorry, the SVD, singular value decomposition. And then on top of that SVD, and how we optimize that is quite different to a lot of the other downstream deep learning methods that we'll use like neural networks and so on. It's a very different kind of optimization. And so the word to vec objective function is similar to SVD, you look at one window at a time. You make an update step. And that is very similar to how we optimize most of the other models in this lecture and in deep learning for NLP. And so basically what we came with with post-doc and Chris' group, so Jeffery Pennington, me and Chris, is a method that tries to combine the best of both worlds. So let's summarize what the advantages and disadvantages are of these two different kinds of methods. Basically we have these count based methods based on SVD and the co-occurence matrix. And we have the window-based or direct prediction methods like the Skip-Gram model. The advantages of PCA is that it's relatively fast to train, unless the matrix gets very, very large but we're making very efficient usage of the statistics that we have, right? We only have to collect the statistics once, and we could in theory, throw away the whole corpus. And then we can try a lot of different things on just these co-occurence counts. Sadly, when you do this, it captures mostly word similarity, and not various other patterns that the word2vec model, captures and we'll show you what those are in evaluation. And we give often disproportionate importance to these large counts. And we can try various ways of lowering the importance that these function words and very frequent words have. The disadvantage of the Skip-Gram of model is that it scales with a corpus size, right? You have to go through every single window, which is not very efficient, and henceforth you also don't really make very efficient usage of the statistics that you have overall, of the data set. However we actually get, in may cases, much better performance on downstream tasks. And we don't know yet, those downstream tasks, that's why we have the whole lecture for this whole quarter. But for a variety of different problems like an entity recognition or part of speech tagging and so on. Things that you'll implement in the problem sets, it turns out the Skip-Gram like models turn out to work slightly better. And we can capture various complex patterns, some of which are very surprising and we'll get to in the second part of this lecture. And so, basically, what we tried to do here is combining the best of both of these worlds. And the result of that was the GloVe model, our Global Vectors model. So let's walk through this objective function a little bit. Again, theta here will be all our parameters. So in this case, again, we have these U and these V vectors. But they're even more symmetric now, we basically just go through all pairs of words that might ever co-occur. So we go through these very large co-occurrence matrix that we computed in the beginning and we call P here. And for each pair of words in this entire corpus, we basically want to minimize the distance between the inner product here, and the log count of these two words. So again, this is just this kind of matrix here that we're going over. We're going over all elements of this kind of co-occurrence matrix. But instead of running the large SVD, we'll basically just optimize one such count at a time here. So I have the square of this distance and then we also have this term here, f, which allows us to weight even lower some of these very frequent kinds of co-occurrences. So the, for instance, will have the maximum amount that we can weigh it inside this overall objective function. All right, so now what this allows us to do is essentially we can train very quickly. Cuz instead of saying, all right, we'll optimize that deep and learning co-occur in one window, and then we'll go in a couple windows later, they co-occur again. And we update again, with just one say or a deep learning co-occur in this entire corpus. Which could now be in all of Wikipedia or in our case, all of common crawl. Which is most of the Internet, that's kind of amazing. It's a gigantic corpora with billions of tokens. And we just say, all right, deep and learning in these billions of documents co-occur 536 times or something like that. Probably now a lot more often. And then we'll just optimize basically This inner product to be closed and it's value to the log of that overall account. And because of that, it scales to very large corpora. Which is great because the rare words appear not very often and just build hours to capture even rarer like the semantics of very rare words. And because of the efficient usage of the statistics, it turns out to also work very well on small corpora and even smaller vector sizes. So now you might be confused because individualization, we keep showing you a single vector but here, we again, just like with the skip gram vector, we have v vector, it's the outside vectors and the inside vectors. And so let's get rid of that confusion and basically tell you that there are a lot of different options of how you get, eventually, just a single vector from having these two vectors. You could concatenate them but it turns out what works best is just to sum them up. They essentially both capture co-occurence counts. And if we just sum them, that turns out to work best in practice. And so, that also destroys some of the intuitions of why certain things should happen, but it turns out in practice this works best, yeah? >> [INAUDIBLE] >> What are U and V again, so U here are again just the vectors of all the words. And so here, just like with the skip-gram, we had the inside and the outside vectors. Here, u and v are just the vectors in the column and the vectors in the row. They're essentially interchangeable and because of that, it makes even more sense to sum them up. You could even say, well, why don't you just have one set of vectors? But then, you'd have a more, a less well behaved objective function here, because you have the inner product between two of the same sets of parameters. And it turns out, in terms of the optimization having the separate vectors during optimization and combining them at the very end just was much more stable. That's right. Even for skip-gram, that's the question. Is it common also time for skip-gram to sum them up? It is. And it's a good, it's good whenever you have these choices and they seem a little arbitrary, also, for all your projects. The best thing to always do is like, well, there are two things. You could just come to me and say, hey what should I do? X or Y? And the true answer, especially as you get closer to your project and to more research and novel kinds of applications, the best answer is always, try all of them. And then have a real metric a quantitative of measure of how well all of them do and then have a nice little table in your final projects description that tells you very concretely what it is. And once you do that many times, you'll gain some intuitions, and you'll realize alright, for the fifth project, you just realized well summing them up usually works best, so I'm just going to continue doing that. Especially as you get into the field, it's good to try a lot of these different knobs and hyperparameters. >> [INAUDIBLE] >> That's right, they're all in the same scale here. Really they are quite interchangeable, especially for the Glove model. Is that a question? Alright I will try to repeat it. So in theory here you're right. So the question is does the magnitude of these vectors matter? Good paraphrase? And so you are right. It does. But in the end you will see them basically in very similar contexts, a lot of times. And so in this log here, they will eventually have to capture the log count, right? So they will have to go to a certain size of what these log counts usually are. And then the model just figures out that they are in the end roughly in the same place. There's nothing in the optimization that pushes some vectors to get really, really large, except of course, the vectors of words that appear very frequently, and that's why we have exactly this term here, to basically cap the importance of the very frequent words. Yes, so the question is, and I'll just phrase it the way it is, which is right. The skip-gram model tries to capture co-occurrences one window at a time. And the Glove model tries to capture the counts of the overall statistics of how often these words appear together, all right. One more question? I think there was one. No? Great. So now we can look at some fun results. And, basically, we found, the nearest neighbors for frog were all these various words. And we're first a little worried, but then we looked them up. And realize, alright, those are actually quite good. So you'll see here even for very rare words, Glove will give you very, very good nearest neighbors in this space. And so next, we will do the evaluation, but before that we'll do a little intermission with Arun. Take it away. >> [SOUND] Cool, so we've been talking about word vectors. I'm gonna take a brief detour to talk about Polysemy. So far we've seen that word vectors encode similarity, we see that similar concepts are even distributed in Euclidean space near each other. And the question I want you to think about is, what do we do about polysemy? Suppose you have a word like tie. All right, tie could mean something like a tie in a game. So maybe it should be near this cluster. Over here. It could be a piece of clothing, so maybe it should be near this cluster, or it could be an action like braid twist, should be near this cluster. Where should it lie? So this paper by Sanjeev Arora and the entire group, they seek to answer this question. And one of the first things they find is that if you have an imaginary you could split up tie into these polysemous vectors. You had tie one every time you talk about this sport event. Tie two every time you talked about the garment of clothing. Then, you can show that the actual tie that is a combination of all of these words lies in the linear superposition of all of these vectors. You might be wondering, how is this vector close to all of them, but that's because we're projecting this into a 2D plane and so it's actually closer to them in other dimensions. Now that we know that this tie lies near or in the plane of the different senses we might be curious to find out, can we actually find out what the different senses of a word are. Suppose we can only see this word tie, could we computationally find out to some core logistics that tie had a meaning about sport clothing etc. So the second thing that they're able to show is that there's an algorithm called sparse coding. That is able to recover these. I don't have time to discuss exactly what sparse coding how the algorithm works but let me describe the model. The model says that every word vector you have is composed as the sum of a small selected number of what are called context vectors. So these context vectors, there are only 2,000 that they found for their entire corpus, are common across every word. But every word like tie is only composed of a small number of these context vectors. So, the context vector could be something like sports, etc. There's some noise added in, but that's not very important. And so, if you look at the type of output that you get for something like tie, you see something to do with clothing, with sports. Very interestingly you also see output about music. Some of you might realize that actually makes sense. And now, we might wonder how this is qualitative. Is there a way we can quantitatively evaluate how good the senses we recover are? So it turns out, yes you can, and here's the sort of experimental set-up. So, for every word that was taken from WordNet, a number of about 20 sets of related senses were picked up. So, a bunch of words that represent that sense, like tie, blouse, or pants, or something totally unrelated, like computer, mouse, and keyboard. And so now they asked a bunch of grad students, because they're guinea pigs, to differentiate if they could find out which one of these words correspond to tie. And they also asked the algorithm if it could make that distinction. The interesting thing is that, the performance of this method that I alluded to earlier, is about at the same level as the non-native grad students that they had surveyed. Which I think is interesting. The native speakers do better on the task. So in summary, word vectors can indeed capture polysemy. It turns out these polysemies, the word vectors, are in the linear superposition of the polysemy vectors. You can recover the senses that a polysemous word has wIth sparse coding. And the senses that you recover are almost as good as that of a non-native English speaker. Thank you. >> Awesome, thank you Arun. >> [APPLAUSE] >> All right, so now on to evaluating word vectors. So we've had gone through now a bunch of new machinery. And you say, well, how well does this actually work? I have all these hyperparameters. What's the window size? What's the vector size? And we already came up with these questions. How much does it matter how do we choose them? And these are all the answers now. Well, at least some of them. So, in a very high level, and this will be true for a lot of your projects as well, you can make a high level decision of whether you will have an intrinsic or an extrinsic evaluation of whatever project you're doing. And in the case of word vectors, that is no different. So intrinsic evaluations are usually on some specific or intermediate subtask. So we might, for instance, look at how well do these vector differences or vector similarities and inner products correlate with human judgments of similarity. And we'll go through a couple of these kinds of evaluations in the next couple of slides. The advantage of intrinsic evaluations is that they're going to be very fast to compute. You have your vectors, you run them through this quick similarity correlation study. And you get a number out and you then can claim victory very quickly. And then or you can modify your model and try 50,000 different little knobs and combinations and tune this very quickly. It sometimes helps you really understand very quickly how your system works, what kinds of hyperparameters actually have an impact on this metric of similarity, for instance. However, there's no free lunch here. It's not clear, sometimes, if your intermediate or intrinsic evaluation and improvements actually carry out to be a real improvement in some task real people will care about. And real people is a little tricky definition. I guess real people, usually we'll assume are like normal people who want to just have a machine translation system or a question answering system or something like that. Not necessarily linguists and natural language processing researchers in the field. And so, sometimes you actually observe people trying to optimize their intrinsic evaluations a lot. And they spent years of their life optimizing them. And other people later find out, well, it turns out those improvements on your intrinsic task, when I actually applied your better word vectors or something to name entity recognition or part of speech tagging or machine translation, I don't see an improvement. So then the question is, well, how useful is your intrinsic evaluation task? So as you go down this route, and a lot of you will for their projects, you always wanna make sure you establish some kind of correlation between these. Now, the extrinsic one is basically evaluation on a real task. And that's really where the rubber hits the road, or the proof is in the pudding, or whatever. The problem with that is that it can take a very long time. You have your new word vectors and you're like, I took the Pearson correlation instead of the raw count of my core currents matrix. I think that's the best thing ever. Now I wanna evaluate whether that word vector really helps for machine translation. And you say, all right, now I'm gonna take my word vectors and plug them into this machine translation system. And that turns out to take a week to train. And then you have to wait a long time, and now you have ten other knobs, and before you know it, the year is over. And you can't really just do that every time you have a tiny, little improvement on your first early word vectors, for instance. So that's the problem, it takes a long time. And then often people will often make the mistake of tuning a lot of different subsystems. And then they put it all together into the full system, the real task, like machine translation. And something overall has improved, but now it's unclear which part actually gave the improvement. Maybe two parts where actually, one was really good, the other one was bad. They cancel each other out, and so on. So you wanna basically, when you use extrinsic evaluations, be very certain that you only change one thing that you came up with, or one aspect of your word vectors, for instance. And if you then get an improvement on your overall downstream task, then you're really in a good place. So let's be more explicit and go through some of these intrinsic word vector evaluations. One that was very popular and came out just very recently with the word2vec paper was these word vector analogies. Where basically they found, which was initially very surprising to a lot of people, that you have amazing kinds of semantic and syntactic analogies that are captured through these cosine distances in these vectors. So for instance, you might ask, what is man to woman and the relationship of king to another word? And basically a simple analogy. Man to woman is like king to queen. That's right. And so it turns out that, when you just take vector of woman, you subtract the vector of man, and you add the vector of king. And then you try to find the vector that has the largest cosine similarity. It turns out the vector of queen is actually that vector that has the largest cosine similarity to this term. And so that is quite amazing, and it works for a lot of different kinds of very intuitive patterns. So, let’s go through a couple of them. So you'd have similar things like, if sir to madam is similar as man to woman, or heir to heiress, or king to queen, or emperor to empress, and so on. So they all have a similar kind of relationship that is captured very well by these cosine distances in this simple Euclidean Subtractions and additions. It goes even more specific. You have similar kinds of companies and their CEO names. And you can take company, title, minus CEO plus other company, and you get to the vector of the name of the CEO of that other company. And it works not just for semantic relationships but also for syntactic relationships, so slow, slower, or slowest in these glove things has very similar kind of differences and so on, to short, shorter, and shortest, or strong, stronger, and strongest. You can have a lot of fun with this and people did so here are some even more fun ones like Sushi- Japan + Germany goes to bratwurst, and so on. Which as a German, I'm mildly offended by. And of course, it's very intuitive in some ways. But it's also questionable. Maybe it should have been [INAUDIBLE] or whatever. Other typical German foods. While this is very intuitive and for some people, in terms of the actual semantics that are captured here, you might really wonder why this has happened. And there is no mathematical proof of why this has to fall out but intuitively you can kind of make sense of it a little bit. Superlatives for instance might appear next to certain words, very often, in similar kinds of ways. Maybe most, for instance, appears in front of a lot of superlative. Or barely might appear in front of certain words like slower or shorter. It's barely shorter than this other person. And since in these vectors you're capturing these core occurrence accounts, as you take out, basically one concurrence you subtract that one concurrence intuitively it's a little hand wavy. There's no like again here this is not a nice mathematical proof but intuitively you can see how similar kinds of words appeared and you subtract those counts and hence you arrive in similar kinds of places into vector space. Now first you try a couple of these, and you're surprised that this works well. And then you want to make it a little more quantitative. All right, so this was a qualitative sub sample of some words where this works incredibly well. It's also true that when you really play around with it for a while, you'll find something things that are like Audi minus German goes to some crazy sushi term or something. It doesn't always make sense but there are a lot of them where it really is surprisingly intuitive. And so people essentially then came up with a data set to try to see how often does it really appear and does it really work this well? And so they basically collected this Word Vector Analogies task. And these are some examples. You can download all of them on this link here. This is, again, the original word2vec paper that discovered and described these linear relationships. And they basically look at Chicago and Illinois and Houston Texas. And you can basically come up with a lot of different analogies where this city appears in that state. Of course there are some problems and as you optimize this metric more and more you will observe like well maybe that city name actually appears in multiple different cities and different states have the same name. And then it kind of depends on your corpus that you're training on whether or not this has been captured or not. But still, a lot of people, it makes a lot of sense for most of them to optimize these at least for a little bit. Here are some other examples of analogies that are in this data set that are being captured, and just like the capital and the world, of course you know as those change if it doesn't change in your corpus that's also problematic. But in many cases the capitals of countries don't change, and so it's quite intuitive and here's some examples of syntactic relationships and analogies that are basically in this data set to evaluate. We have several thousands of these analogies and now, we compute our word vectors, we've tuned some knob, we changed the hyperparameter instead of 25 dimensions, we have 50 dimensions and then we evaluate which one is better for these analogies. And again, here is another syntactic one with past tense kinds of relationships. Dancing to danced should be like going to went. Now, we can basically look at a lot of different methods, and we don't know all of these in the class here, but we know the skip gram SG and the Glove model. And here is the first evaluation that is quantitative and basically looks at the semantic and the syntactic relationships, and then just average, in terms of the total. And just says, how often is exactly this relationship true, for all these different analogies that we have here in the data set. And it turns out that when both of these papers came out in 2013 and 14 basically GloVe was the best at capturing these relationships. And so we observe a couple of interesting things here. One, it turns out sometimes more dimensions don't actually help in capturing these relationships better, so thousand dimensional vectors work worst than 300 dimensional vectors. Another interesting observation and that is something that is somewhat sadly true for pretty much every deep learning model ever is more data will work better. If you train your word vectors on 42 billion tokens, it will work better than on 6 billion tokens. By you know, 4% or so. Here we have the same 300 dimensions. Again, we only want to change one thing to understand whether that one change actually has an impact. And we'll see here a big gap. It's a good question. How come the performance sometimes goes down? It turns out it also depends on what you're training your word vectors on. It turns out, Wikipedia for instance, is really great because Wikipedia has very good descriptions of all these capitals in all the world. But now if you take news, and let's say if you take US news and in US news you might not have Abuja and Ashgabat mentioned very often. Well, then the vectors for those words will also not capture their semantics very well and so you will do worse. And so some not, bigger is not always better it also depends on the quality of the data that you have. And Wikipedia has less misspellings than general Internet texts and so on. And it's actually a very good data set. And so here are some of the evaluations and we have a lot of questions of like how do we choose this hyperparameter the size and so on. This is I think a very good and careful analysis that Geoffrey had done here three years ago on a variety of these different hyperparameters that we've observed and kind of mentioned in passing. And so this is also a great sort of way that you should try to emulate for your projects. Whenever I see plots like this I get a big smile on my face and your grades just like improve right away. >> [LAUGH] >> Unless you make certain mistakes in your plots. But let's go through them. Here we look at basically the symmetric context, the asymmetric context is where we only count words that have happened after the current word. We ignore the things that's before but it turns out symmetric usually works better and so a vector dimension here is a good one to evaluate. It's pretty fundamental how high dimensional. Should these be. And we basically observe that when they're very small it doesn't work as well in capturing these analogies but then after around 200, 300 it actually kind of peters out and then it doesn't get much better. In fact, over all it's pretty flat between 300 and 600. And this is good. So, the main number we often look at here is the overall accuracy and that's in red here. And that's flat. So, one mistake you could make when create such a plot is you can prove you have some hyperparameter and you have some kind of accuracy. This could be the vector size, and you create a nice plot and you say look, things got better. And then my comment if I see a plot like this would be, well why didn't you go further in this direction? It seems to just be going up and up. Like, so that is not good. You should find your plots until they actually kind of peter out, and you say all right now, I really found the optimum value for this hyperparameter. So, another important thing to evaluate here is the window's size, and there are sometimes considerations around this. So word vectors for instance, maybe the 200 worked here slightly better than, or 300 works slightly better than 200. But, larger word vectors also means more RAM, right? Your software now needs to store more data. And you need to, you might want to ship it to the cellphone. And now yes you might get 2% improvement on this intrinsic task. But you also have 30% higher RAM requirements. And maybe you say, well, I don't care about those 2% or so improvement in accuracy on this intrinsic task. I still choose a smaller word vector. So, that's a legit argument, but in general here, we're just trying to optimize this metric. And so we wanna look at carefully what these are. All right, now, window's size, again this is how many words to the left and to the right of each of the center words do we wanna predict and compute the counts for. Turns out around eight or so, you get the highest. But again that also increases the complexity and the training time. The longer the windows are, the more times you have to compute these kind of expressions. And then for asymmetric context, it's actually slightly different windows size that works best. All right, any question around these evaluations? Great. Now, it's very hard actually, to compare glove and the skip gram model, cuz they're very different kinds of training regimes. One goes through the one window at a time, the other one first computes all the counts, and then works on the counts. So this is kind of us trying to do well and answer a reviewer question of when you compare them directly. So what we did here is we looked at the Negative Samples. So remember, we had that sum and the objective function for the skip gram model of how many words we want to push down the probability of cuz they don't appear in that window and so that is one way to increase training time, and in theory do better on that objective. Versus different iterations of how often do we go over this cocurrence counts to optimize each pair in the cocurrence matrix for GloVe. And in this evaluation GloVe did better regardless of how many hours you sort of trained both models. And this is more data helps, that the argument already made. Especially Wikipedia. So here Gigaword is I think mostly a news corpus. So news, despite being more actually it does not work quite as well, overall, and especially not for semantic, relationships and analogies, but Common Crawl, which is a super large data set of 42 billion tokens, works best. All right, so now these amazing analogies of king minus man plus woman and so on were very exciting. Before that, people used often just correlation judgements. So basically they asked a bunch of people, often grad students, to give on a scale of one to ten, how similar do you think these two words are? So tiger and cat, when you ask three or five humans on a scale from one to ten how similar they are, they might say, one might say seven, the other eight, the other six or something like that and then you average. And then you get basically a score here of similarities our computer and internet are seven. But stock and CD are not very similar at all. So a bunch of people will say on a scale from one to ten, it's only 1.3 on average. >> [INAUDIBLE] >> And now, we could try to basically say all right. We want to train word vectors such that the vectors have a high correlation and their distances be it cosine similarity or Euclidian distance, or you can try different distance metrics too and look at how close they are. And so here's one such example. You take the word of Sweden and you look in terms of cosine similarity and you basically find lots of words that are very, very close by or have the largest cosine similarity and you basically get Norway and Denmark to be very close by. And so, if you have a lot of these kinds of data sets and this one, WordSim353 has basically 353 such pairs of words. And you can look at how well do your vector distances correlate with these human judgements. So the higher the correlation, the more intuitive we would think are the distances in this large vector space. And again, Glove does very well here across a whole host of different kinds of datasets like the WordSim 353 and, again, the largest training dataset here did best for Glove. Any questions on word vector similarities and correlations? No, good, all right. Now, basically, intrinsic's evaluations have this huge problem, right? We have these nice similarities, but who knows? Maybe that doesn't actually improve the real tasks that we care about in the end. And so the best kinds of evaluations, but again they are very expensive, are those on real tasks or at least subsequent kinds of downstream tasks. And so one such example is named entity recognition. It's a good one cuz it's relatively simple. But it's actually useful enough. You might want to run a named entity recognition system over a bunch of your corporate emails. To understand which person is in relationship to what company, and where do they live and the locations of different people and so on. It's actually a useful system to have, a named entity recognition system. And basically we'll go through the actual models for doing a named entity recognition in the next lecture. But as we plug in different word vectors into these downstream models that we'll describe in the next lecture we'll observe that for many of them GloVe vectors again do very, very well on these downstream tasks. All right. Any questions on extrinsic methods? We'll go through the actual model that works here later. That's right. Well, so you're not optimizing anything here, you're just evaluating. You're not training anything. You've trained your word vectors with your objective function from skip-gram, and you fix them, and then you just evaluate them. And so what you're evaluating here now is you look at for instance Sweden and Norway, and they have a certain distance between them, and then you want to basically look at the human measure of how similar do humans think these two words are. And then you want these kinds of human judgements of similarity to correlate well with the cosine distances of the vectors. And when they correlate well, you think, the vectors are capturing similar kinds of intuitions that people have, and hence they should be good. And again, intuitively it would make sense that if Sweden has good cosine similarity and you plugged it into some other downstream system, that that system will also get better at capturing named entities. Because maybe at training time it sees the vector of Sweden and at test time it sees the vector of Norway and at training time you told that Sweden is a location, and so a test time it might be more likely to correctly identify Norway or Denmark also as a location. Because they're actually close by in the vector space. And we'll go actually through example of how we train word vectors and so on in the next lecture. Or train downstream tasks. So I think we have until 5:50, so we got 8 more minutes. So, let's look briefly at simple, single word classification. So you know we talked about these word vectors and I basically showed you the difference between starting with these very simple co-occurrence counts and these very sparse large vectors versus having small dense vectors like Word2vec. And so the major benefits are basically that because similar words cluster together, we'll be able to classify and be more robust in classifying different kinds of words that we might not see in the training data set. So for instance, because countries cluster together and our goal is to classify location words then we'll do better if we initialize all these country words to be in a similar part of the vector space. It turns out later we'll actually fine tune these vectors too. So right now we learned an unsupervised objective function. It's unsupervised in the sense that we don't have human labels that we assigned to each input, we just basically took a large corpus of words, and we learned with these unsupervised objective functions. But other tasks where that doesn't actually work as well. So for instance sentiment analysis turns out to not be a great downstream task for some word vectors because good and bad might actually appear in similar contexts. I thought this movie was really good or bad. And so when your downstream task is sentiment analysis it turns out that maybe you can just initialize your word vectors randomly. So this is kind of a bummer after listening to us for many hours on how word vectors should be trained. But fret not, it's in many cases word vectors are helpful as your first step for your deep learning model, just not always. And again, that will be something that you can evaluate. Can I just initialize my words randomly or should I initialize them with the Word2vec or the glove model. So as we're trying to classify words, what we'll use is the softmax. And so you've seen this equation already in the very beginning in the first slide of the lecture. But we'll change the notation a little bit because all the math that will follow will be easier to go through with this kind of notation. So this is going to be the softmax that we'll optimize. It's essentially just a different word term for logistic regression. And we'll in many cases, have generally a matrix W here for our different classes. So x, for instance, could be in a simplest form, just a word vector. We're just trying to classify different word vectors with no context of just like, are these locations or not. It's not very useful, but just for pedagogical reasons, let's assume x, our input here, is just a word vector. And I want to classify, is it a location, or is it not a location. And then we give it basically, these different kinds of word vectors that we compute it, for instance, for Sweden and Norway, and then we want to classify is now Finland, Switzerland, and also a location, yes or no. So that's the task. And so our softmax here might just have in the simplest case two, two doesn't really make sense so let's say we have multiple different classes and each class has one row vector here. And so this notation y is essentially the number of rows that we have, so the specific row that we have. And we have here inner product with this rho vector times this column vector x. And then we normalize just like we always do for logistic regression to get an overall vector here for all the different classes that sums to 1. So W in general for classification will be a C by d dimensional matrix. Where d is our input and C is the number of classes that we have. And again, logistic regression, just a different term for softmax classification. And the nice thing about the softmax is that it will generalize well above for multiple different classes. And so, basically this is also something we've already covered. So the loss function will use a similar term for all the subsequent lectures. Loss function, cost function and objective functions, we kind of use interchangeably. And what we'll use to optimize the softmax is the cross entropy loss. And so I feel like the last minute, I'll just give you one extra minute, cuz if we start now, it'll be too late. So that's it, thank you. >> [APPLAUSE]
Info
Channel: Stanford University School of Engineering
Views: 195,555
Rating: 4.853786 out of 5
Keywords: Natural Language Processing, Global Vectors for Word Representation, GloVe, hyperparameters, word vector distances, Window classification
Id: ASn7ExxLZws
Channel Id: undefined
Length: 78min 39sec (4719 seconds)
Published: Mon Apr 03 2017
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.