Word Embeddings

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

today I'm going to be talking about word embeddings which I think are one of the coolest things you can do with machine learning right now and to explain why I think that I'm just going to jump right in and give you an example of something you can do with word embeddings so a few months ago I set up a program that just downloads tweets as people write tweets on Twitter and save them to a file and after running this program for about a month or two I had collected a massive file with over 5 gigabytes of tweets after I compressed it so I got a massive amount of data and it's just raw data that people typed on the internet after getting all this data I set it directly into a word embedding algorithm which was able to figure out tons of relationships between words just from the raw things that people happen to type on the internet so for example if I put in a color it can tell me a bunch of other colors it never actually knew the idea of color going in it didn't really know anything and piece together that all these words are related you can see I can put in other things like a kind of food and I get other kinds of food out and I'll actually have a link in the description so that you can try this out yourself just to see how well it really learned the relationships between words so you're probably wondering how this actually worked because it's kind of baffling all I gave it was things that people typed on the internet I didn't give it a dictionary I didn't give it a list of synonyms I mean I chose English but I could have chosen any language and it still would have been just as successful so it's pretty impressive that it was able to do this just from raw text and in this video I really want to explain why this works and how how it works and just I think it's super cool so I want to share with you so pretty much every word embedding algorithm uses the idea of context and to show you what I mean I'm just going to give you a really simple example so here is an example of a sentence where there's a word missing so it says I painted the bench blank and we're expected to fill in the blank so the obvious thing here is to put a color right I painted the bench red I painted the bench green something like that and already we can see that you know if a word can show up in this context it's likely to be a color but unfortunately that's not always true you could also say I painted the bench today today is out of color but the main takeaway is the context is really kind of closely related to meaning so that was an example where multiple different words could go into the same context and we presume that those words are somehow related at least a lot of them are but there's another way the context can help us and that's if two words happen to always appear in the same context at once so here are three different sentences that will help us understand this idea in the first sentence I actually have two examples Donald and Trump are likely to appear together because one of the first name of a person and one's the last name of that same person so those words are closely related we also have United States which is kind of just one logical word broken up into two smaller words so United States are likely to appear together and then on the second and third example we have a joke and laughs are kind of related words you laugh at a joke so they're also likely to appear in the same context now there's one subtle thing that I'd like to point out in this example which is that last and last are might be different words like laughs is the present tense and last is the past tense and likewise you know we could think about joke versus jokes like one is plural one a singular these are different forms of the same word and ideally since we knew nothing about English going in our algorithm our word embedding is going to have to learn the different forms of the same word or related and that's the learned that last is somehow related to laugh and what you can see is you know these examples give you an idea of how the model might be able to do that because you can see last appears with the word joke in the second sentence and laugh appears with the word joke in the third sentence so ideally the word embedding would figure out then that laugh and laughter related since they're both related to joke so that's kind of where a word embedding gets its knowledge from it learned things via context it sees you know what words occur near other words but what is the word embedding actually do I still have to kind of formalize what what we're after so in one sentence a word embedding just converts words into vectors so you might give in a word like hamburger and you would get out a list of say 64 numbers and those numbers would describe the word and forward embedding to be good we kind of require that the vectors carry some meaning so if I put in hamburger and cheeseburger into my model I want those vectors to be very close to each other because they're very related words whereas if I put in something else like Ferrari like a kind of car totally unrelated to hamburger I want the vector for Ferrari to be far away from the vector for hamburger and of course all these distances are relative but you can kind of see what I mean that we want the closeness of these vectors to resemble the closeness of the words that they represent and in addition to this kind of idea of closeness we might also want there to be even more structure for example if I do the vector from man minus the vector for woman and to subtract vectors we just subtract each number from the corresponding number of the other vectors so if I take the vector from an and I subtract the vector for a woman I want that to somehow represent the difference between male and female and then if I add that vector to the vector for for Queen I want it to give me out something very very close to the vector for King so I you know I want these vectors to be related and I want the differences between vectors to also carry some meaning and you know I might add other constraints but the idea is I just want to encode as much meaning as I can into the vectors so how are we actually going to do that how are we going to you know solve for vectors for all of the words that ever appear on Twitter you know how are we going to produce the set of vectors that work so the first approach I'm going to be talking about is known as word Tyvek and it's probably the most famous kind of word of netting because it was the first word embedding to get the kind of impressive results that word embeddings the state-of-the-art word embeddings get today so essentially word Tyvek is just a really simple neural network and if you've already seen my other videos on neural networks you might actually already be able to implement or Tyvek but I'm just going to try to describe it here on a high level to give you an idea of how it works so here's a simple picture of what a word to Veck neural network looks like you essentially feed in a word and it produces in the middle it produces a small vector which is a word embedding and then it produces as output something like a context and to describe what this is in a little more detail I'm actually going to give an example of something we might ask a word to vectorial network to do what I've done is I've picked out a random tweet from my corpus and I picked out a random word from within that tweet in this case it was yellow and I'm going to feed as input the word yellow to the word to that control network and I'm going to try to get it as output to give me all the other words that were in the tweet so the word Tyvek neural network in this case is just trying to predict context words from a word so how exactly is it that I feed in the word yellow and I get out all these contacts words you know how do I represent that for the neural network well basically the network has a different input neuron for each different word so I take the neuron for whatever word I want to feed in and I'll set that neuron to one and I'll set all the other neurons to 0 then the neural network will just use regular neural network stuff to produce a small vector basically that's just a hidden layer with 64 nods and then using more neural network magic I'll produce back out a vector maybe with a hundred thousand components in each neuron and that output vector corresponds to a word as well and I want every neuron to be set that it in the context and I want every neuron that whether in the context not to be said so why does this work why do we expect that the middle layer when the word gets turned into a small vector why do we expect that small vector to actually be meaningful well the answer is that that small vector is all the network has to figure out the context so you know it goes right from that small vector to the context so if two words have very similar contexts it's really helpful for the neural network if if the small vector is similar for those two words because if it produces a similar output a similar vector makes sense so essentially this model is just forced the middle layer of the neural network to to correspond to meaning you know close words words with similar contexts will have close vectors in the middle of the network just because that's what's easiest for the network to do so that's kind of just a really general overview of how word Tyvek works and there's a lot more to it so I'll have a link to the actual original word defect paper in the description if you want to read more about it so besides word Tyvek there are a bunch of other ways to generate word embeddings and the majority of them are based on this thing called the co-occurrence matrix so here's a really simple example of what this might look like basically both the rows and the columns correspond to different words and the entry at any given point in the matrix just counts how many times those two words happen in the same context so you can imagine how we might generate this thing for example with Twitter data we might just loop through all the tweets go through all the words in all those tweets and for every time two words occur on the same tweet we just add one to that entry in this matrix now different methods will use this matrix in different ways but pretty much all of the methods utilize some amount of linear algebra so I'm going to be just talking about matrices and matrix multiplications dot products things like that and if you don't know linear algebra you're not really going to get much from this but that's why I kind of left it at the end of the video because some people will get something out of this and I think it's somewhat interesting so probably the simplest approach to generating word embeddings with the co-occurrence matrix is to just decompose the co-occurrence matrix into the product of two much smaller matrices so I've drawn out the picture here you can basically see that I get this massive square matrix which is our co-occurrence matrix by multiplying a tall skinny matrix and a short live matrix you know you think about how many entries are in the big matrix there's a hundred thousand squared that's a lot more information than is stored on the right side of this equation which is you know two relatively small matrices multiplied together so by decomposing this big co-occurrence matrix into these smaller ones we're clearly compressing some information and in doing so hopefully we have to extract a lot of meaning from the matrix so that we can do that compression and that should allow us to at least generate decent embeddings once we have this matrix decomposition which I haven't described exactly how we might find this yet but you can imagine there's plenty of methods in linear algebra to decompose a matrix like singular value decomposition or you could use gradient descent or something like that but once you have this matrix decomposition we actually get word vectors pretty much for free for example in the big and the big co-occurrence matrix each row in each column corresponds to a word so if I go into this tall skinny matrix and I grab the the you know end entry you know the entry for a certain word that is going to give me a vector which is pretty small in this case 64 components and I can call that award embedding for that for that work and of course I didn't have to pick it from the tall skinny matrix I could have picked it from the short wide matrix or I could even just decide to average these two vectors and use that as be embedding overall and there's actually a good reason to assume that these vectors would represent a decent amount of meaning the reason is that an entry in the big co-occurrence matrix is equivalent to the dot product of a word vector taken from the tall skinny matrix and a word vector taken from the wide short matrix so basically a given co-occurrence is approximated by a dot product between two word vectors basically so if I use these word vectors what it tells me is that now the dot product represents how likely two words are to co-occur so I've gotten the structure in my vectors you know correlation and vectors corresponds to correlation and context so that is why you might expect matrix decompositions to give you good embeddings so now I'll tell you a little bit about the particular method that I use to generate the word embeddings at the beginning of this video the method I used is known as Bluff which is short for global vectors and it is a kind of co-occurrence decomposition method now it's a little unique in that it decomposes the aaaghh the logarithm of the co-occurrence matrix instead of the actual co-occurrence matrix and it also is weighted it uses a model where certain entries in the co-occurrence matrix are more important than others and you use gradient descent to learn the embedding so it's similar to training a neural network and has really good results and it's extremely fast so I like glove I had a lot more fun implementing glove than I did implementing or Tyvek and I will certainly have a link to that paper in the description that describes glove because it's an excellent paper it explains why word to vet works as well as why glove works and it talks about a bunch of other word embedding methods so that's pretty much all I had planned for today I hope I got you really interested in word embeddings and if you want to know more I highly recommend you read that glove paper in the description and I'll try to link to other resources as well because this is a really interesting topic and I think a lot of people will find it cool so anyway thanks for watching subscribe and goodbye

Info

Channel: macheads101

Views: 122,188

Rating: 4.9358349 out of 5

Keywords: mac, apple, programming, computers, internet, terminal, machine learning, artificial intelligence, word2vec, glove, embeddings, word embeddings, natural language processing, nlp

Id: 5PL0TmQhItY

Channel Id: undefined

Length: 14min 27sec (867 seconds)

Published: Sat Jul 15 2017