Vectoring Words (Word Embeddings) - Computerphile

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Does anyone know if Rob is using an available open source library in this video?

👍︎︎ 1 👤︎︎ u/crobz 📅︎︎ Nov 19 2019 🗫︎ replies

Captions

if we're moving from cat to dog which are similar things so we go away from cat and towards dog let me go further go beyond in that direction yes so the first result is dogs which is kind of a nonsense result the second is pitbull so that's like the dog use of dogs right the least cat like dog that feels rods yeah yeah well if you go the other way what the most cat like cat the most undock like let's find out it's gonna be kitten right it's gonna be cats feline kitten it's gonna not really giving us anything much to work with I thought I would talk a little bit about word embeddings word to vac and just wording betting's in general the way I was introduced to word embeddings or their their sort of context that I'm most familiar with them in is like how do you represent a word to a neural network well it's a set of characters isn't it I mean it needed me more than the set of characters that make it up right so you can do that but you remember the thing we were talking about before in language models you have a problem of how far back you can look I would much rather be able to look back 50 words than 50 characters and like if you if you're training a character based model a lot of the capacity of your network is going to be used up just learning what characters count as valid words right what combinations of characters are words and so if you're trying to learn something more complicated than that you're spending a lot of your time training just like what words are and a lot of your network capacity is being used for that as well but this isn't a hard problem we know what the words are right you can give the thing a dictionary and then you're kind of it gives it it gives it a jump start the point is a neuron networks they view things as like a vector of real numbers or a vector of floats which is like some of the real numbers and so if you think about something like an image representing an image in this way is like fairly straightforward you just take all of the pixels and put them in a long row and if they're black then in 0 and if they're white then it's 1 and you just have greyscale in between for example it's like fairly straightforward and so then you end up with a vector that represents that image it's a reasonably good representation it sort of reflects some elements of the structure of what you're actually talking about so like if you take if you take the same the same image and make it a little bit brighter for example that is just making that vector a bit longer right or a point in that configuration space that's a bit further from the origin you can make it darker by moving you're close to the origin by reducing the length of that vector if you take an image and you apply a small amount of you know noise to it that represents just like jiggling that vector around slightly in that configuration space so you've got you've got a sense in which two vectors that are close to each other are actually kind of similar images and that some of the sort of directions in the vector space are actually meaningful in terms of something that would make sense famous and the same is true with numbers and whatever else and this is very useful when you're training because it allows you to say if your neural network is trying to predict a number and the value you're looking for is ten and it gives you nine you can say no but that's close and if it gave you seven thousand you can be like no and it's not close and that gives more information that allows the system to learn and in the same way you can say yeah that's almost the image that I want whereas if you give the thing a dictionary of words say you've got your ten thousand words and the usual way of representing this is with a one heart two vector if you have ten thousand words you have a vector that's ten thousand long ten thousand dimensions and all of the values are zero apart from one of them which is one so like the first word in the dictionary if it's like a then that's represented by a 1 and then the rest of the 10,000 to zeros and then the second word is like a 0 and then a 1 a 10 all zeroes and so on but there you're not giving any of those clues if the thing is looking for one word and it gets a different word all you can say is yeah that's the correct one or no that's not the correct one something that you might try but you shouldn't because it's a stupid idea is rather than rather than giving it as a one heart vector you could just give it as a number but then you've got this indication that like two words that are next to each other in the dictionary we are similar and that's not really true right like if you have a language model and you're trying to predict the next word and it's saying I love playing with my pet blank and like the word you're looking for is cat and the word it gives you is car lexicographically they're pretty similar but you don't want to be saying to your network are you know close that was very nearly right because it's not very nearly right it's a nonsense prediction but then if it said like dog you should be able to say no but that's close right because that is a plausible completion for that sentence and the reason that that makes sense is that cat and dog are like similar words what does it mean for a word to be similar to another word and so the assumption that word embeddings use is that two words are similar if they are often used in similar contexts so if you look at all of the instances of the word cat in a giant database you know giant corpus of text and all of the instances of the word dog they're going to be surrounded by you know words like pet and words like you know feed and words like play and you know that kind of thing cute etcetera right and so that gives some indication that these are these are similar words the challenge that word embeddings are trying to come up with is like how do you represent words as vectors such that two similar vectors two similar words and possibly so that directions have some meaning as well because then that should allow our networks to be able to understand better what we're talking about in in in text so the thing people realized was if you have a language model that's able to get good performance of like predicting the next word in a sentence and you the architecture of that model is such that it doesn't have that many neurons in its hidden layers it has to be compressing that information down efficiently so you've got the inputs do you network let's say for the sake of simplicity your language model is just taking a word and trying to guess the next word so we only have to deal with having one word in our input but so our input is this very tall thing right 10000 tall and these then feed into a hidden layer which is much smaller I mean it's more than five but it might be like a few hundred maybe let's say 300 and these are sort of their connections and all of these is connected to all of these and it feeds in and then coming out the other end you're back out to ten thousand again right because your output is it's gonna make one of these hi you do something like softmax to turn that into a probability distribution so you give it a word from your dictionary it then does something and what comes out the other end is probability distribution where you can just like look at the highest value on the output and that's what it thinks the next word will be and the higher that value is the more like confident it is but the point is you're going from 10,000 to 300 and back out to 10,000 so this 300 has to be if this if this is doing well at its task this 300 has to be encoding sort of compressing information about the word because the information is passing through and it's it's going through this thing that's only 300 wide so it in order to in order to be good at this task it has to be doing this so then they were thinking well how do we pull that knowledge out it's kind of like in an egg drop competition is this where you have to devise some method of safely getting the egg to the floor right it's not like the teachers actually want to get an egg safely to the ground right but they've chosen the task such that if you can do well at this task you have to have learned some things about physics and things about engineering I'm probably team work and yeah right right exactly so it's there it's the friends you make along the way so so the way that they the way that they build this is rather than trying to predict the next word although that will work that will actually give you word embeddings but they're not that good because they're only immediately adjacent word you you look sort of around the word so you you give it a word and then you sample from the the neighborhood of that word randomly another word and you train the network to predict that so the idea is that at the end when that when this thing is fully trained you give it any word and it's going to give you a probability distribution over all of the words in your dictionary which is like how likely are each of these words to show up within five words of this first word or within ten or you know something like that if their system can get really good at this task then the weights of this hidden layer in the middle have to encode something meaningful about that input word and so if you imagine the word cat comes in in order to do well the probability distribution of surrounding words is going to end up looking pretty similar to the output that you would want for the word dog so it's going to have to put those two words close together if it wants to do well at this task and that's literally all you do so so so if you run this on a lot it's it's absurdly simple right but if you run it on a large enough data set and give it enough compute to actually perform really well it ends up giving you each giving you for each word a vector that's of length however many units you have in your hidden layer which for which the the nearby nurse of those vectors expresses something meaningful about how similar the contexts are that those words appear in and our assumption is that words that appear in similar contexts are similar words and it's slightly surprising how well that works and how much information it's able to extract so it ends up being a little bit similar actually to the way that the generative adversarial network does things where what we're training it to produce good images from random noise and in the process of doing that it creates this mapping from the latent space to images by doing basic arithmetic like just adding and subtracting vectors on the latent space would actually produce meaningful changes in the image so what you end up with is is that same principle but for words so if you take for example the vector and it's required by law that all explanations of word embeddings use the same examples to start with so if you take the vector for King subtract to the vector for man and add the vector for woman you get another vector out and if you find the nearest point in your word embeddings to that vector it's the word Queen and so there's a whole giant swath of like ways that ways that ideas about gender are encoded in the language which are all kind of captured by this vector which we won't get into but it's interesting to explore I have it running and we can play around with some of these vectors and see where they end up so I have a this running in Google collab which is very handy I'm using word embeddings that were found with the word Tyvek algorithm using Google News each word is mapped to 300 numbers let's check whether what we've got satisfies our first condition we want dog and cat to be relatively close to each other and we want cat to be like further away from car than it is from dark right we can just measure the distance between these different vectors I believe you could just do model dot distance distance between car and cat ok is 0.78 4 and then the distance between let's say dog and cat 0.23 right Dogen counter closer to each other this a good start right and in fact we can let's find all of the words that are closest to capped for example okay so the most similar word to cat is cats it makes sense followed by dog kitten feline beagle puppy pup Pet felines and Chihuahua right so so this is already useful it's already handy that you can throw any word at this and it will give you a list of the words that are similar whereas like if I put in car I get vehicle cars SUV minivan truck right so this is working the question of directions is pretty interesting so yes let's do the classic example which is this if you take the vector for King subtract the vector form an add the vector for womenn what you get somewhat predictably is Queen and if you put in boy here you get girl if you put in father you get mother yeah and if you put in shirt you get blouse so this is reflecting something about gender that's that's in the in the dataset that it's using this reminds me a little bit of the unicorn thing where you know the transformer was able to infer all sorts or or appear to have knowledge about the world because of language right right but the the thing that that I like about this that that is that that transformer is working with 1.5 billion parameters and here we're literally just taking each word and giving 300 numbers you know if I go from London and then subtract England and then add I don't know Japan we'd hope for Tokyo we hope for Tokyo and we get Tokyo we got Tokyo twice weirdly Tokyo Tokyo why is oh oh sorry it's no we don't we get Tokyo and toyco a typo I guess and so yeah we're USA in New York ah okay interesting maybe it's thinking largest city of yeah right right like the exact relationship here isn't clear we haven't specified that what does it give us for Australia I bet it's yes Sydney Sydney Melbourne so it's yeah it's not doing capital it's just do largest city right but that's cool it's cool that we can extract the largest city and like this is completely unsupervised it was just given a huge number of news articles I suppose and it's pulled out that there's this relationship and that you can follow it for different things you can take the vector from pig to oink right okay and then like you put cow in there that's Moo you put out in there and you get knowing but dog in there you get box right close enough for me yeah yeah you put but then then it's it gets surreal you put Santa in there oh ho ho what does the Fox say Phoebe what so it doesn't know basically although the second thing is chittering two foxes chitter I don't know gobble they go ding a ling ling ding ding ding ding ding ding ding not in this data set

Info

Channel: Computerphile

Views: 107,305

Rating: 4.9636803 out of 5

Keywords: computers, computerphile, computer, science, Computer Science, Rob Miles, Robert SK Miles, Ai, Machine Learning, Neural Networks, Word Embedding

Id: gQddtTdmG_8

Channel Id: undefined

Length: 16min 56sec (1016 seconds)

Published: Wed Oct 23 2019