A Complete Overview of Word Embeddings

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
word embeddings are mathematical representations of text but of course that is easier said than done so in this video let's learn what word embeddings are how they are created and how you can start using them so the first question to arise of course is why do we need text embeddings at all well the problem is when you're working with nlp models you are working with text and text is not good for machine learning models they cannot deal with text they don't know what to do with it but what they know what to do with is numbers so that's why you have to represent your text in a numbers format there are a bunch of ways how you can represent your text data and it's not only embeddings you can try using a hat encoded approach you can try using a count based approach or of course you can try using embeddings but before we get into embeddings let's learn what what one hot encoding is and also count based approaches are you might have heard of one hot encoding in other contexts too but in this context in text representation what it does is create one really long vector that is as long as the number of words that you have in your vocabulary and to represent each word what you do is you fill this vector with zeros except for the cell that corresponds to the word that you're trying to represent as you can see this creates a very sparse vector and it is not the most efficient use of space so let's see how else we can achieve text representation count-based representation techniques generally try to squeeze a whole sentence into just one vector there are a bunch of different approaches under the umbrella of count based representation techniques some of them are for example bag of words in bag of words what you do is you do not think about the order of the words in a sentence you just look at how many times each word occurs and you create a vector to represent this then you have the engram approach that is quite similar to the bag of words approach but this time you do not just take one word but you make groups of n words and count their occurrence in a sentence and then you have the tf idf approach where you keep track of how many times a word occurs in a document also called sentence and how many times this word occurs in other documents or sentences throughout the training data and this way it is aiming to differentiate the words that are just commonly used like d of and and the words that are specifically important for a certain sentence or document these approaches even though they've been really helpful in the nlp area for years have some serious shortcomings so for example they do not take any context into account furthermore they cannot deal with words that they did not see in the training examples and lastly the embeddings that they produce are very sparse so they are not the most efficient use of space now let's talk about embeddings the goal of making word embeddings is to represent the word in a dense vector while making sure that similar words are close to each other in the embedding space okay let's break that down what does a dense vector mean having a dense vector means that the vector representing the word does not mostly consist of zeros and typically the embedding of the word has fewer dimensions than the number of words that you have in your vocabulary okay and what does it mean for words to be similar so similar words are words that are used in similar or same context most of the time you would see them being used around same words so for example for words tea and coffee you can think of them as similar words because you would always see them being used around the other words such as breakfast drink or enjoy whereas the word p which is a legume and t are not really similar words even though they're spelled really similarly because they are used in vastly different contexts so once we have the embeddings of these words we would expect the vectors of the words t and copy to be much closer to each other than t and p and lastly what is embedding space embedding space is where your embedded data lives so let's say you have 10 different data points and their mathematical embedding of one-dimensional embedding correspond to these numbers if you represent them in the embedding space this is what you would have and the distance between these two points would give you the similarity between these two data points instead if you embed your data to two values you can represent them in a two-dimensional space and now they will look like vectors with the direction only after the third dimension we will have some issues visualizing the vectors but they follow a similar logic so if we embed a word into a vector of length 32 this means we are turning it into a vector in a 32-dimensional space we cannot visualize it anymore of course but we can still calculate the distance of two different vectors or embeddings to each other like we did with the one-dimensional example by using things like cosine similarity here is what we would want the embedding space to look like if we trained a successful word embedding since we cannot visualize 32-dimensional space let's use this representative 2d space in this embedding space we want similar words to be closer to each other for example kink queen sovereign kingdom or another group could be cat dog pet bird animal for some cases it is even possible to make sure that the relative distances between words represent contextual information in a very commonly used example if this is how a man and woman are positioned related to each other this is how king and queen would be related so by subtracting the vector for man from the vector for king and by adding woman vector to it you get the vector for queen okay so this has been cool but how are they made word embeddings are learned from big corpora that is basically just a lot of text there are a bunch of different approaches to how to make this happen so let's look into that now one thing you can do is to have a custom embedding layer in your model so what you need to do is let's say this is the core of your model where the actual learning happens before you feed the text to your model you can have an embedding layer and just initialize it with random weights and parameters and let it train and let it learn during the training of your actual model of how to represent the words in the best way the advantage when you do it this way is that you will have an embedding that is very specific to your use case and it will be specialized to your data sets but the problem is to have a very successful and well-performing embedding layer you would need to train it with a lot of data and it will probably take a long time the transformer architecture for example does exactly this so before the core of the model before the encoder and the decoder it has an embedding layer to take text and turn it into numbers if you would like to learn more about the transformer architecture and how that model works you can go and check out our video on transformers another approach is word to back word to back works by getting the one hot encoded versions of the words and creates embeddings of these words by using the context of the sentence too there are two different approaches to word to wreck one of them is called continuous bag of words and the other one is called skipgram so here's how it works given a corpus of text so a lot of sentences what we do is we divide these sentences into groups of n words so for the sake of this example let's say three so in this sentence we take three words with c both or continuous bag of words we take the two words surrounding the middle word and we feed it to a neural network and we try to guess the word that should be in the middle whereas for skip gram we do the exact opposite we take the middle word and we try to guess the words that should be surrounding this word we keep training this model by sliding this window to the right every time in this model the neural network only has one layer so only one hidden layer and the number of neurons in this hidden layer is the size of the embedding once the network has good performance we can extract the embeddings of words from this network for sibo these are the output weights and for skip gram these are the incoming weights corresponding to a word next we have glove gloss stands for global vectors and it is an extension of the approach that we just seen with tobacco it is an extension of word to vect because it not only looks at the local dependencies of words but it also looks at the global context of the whole sentence it does that by taking into consideration the co-occurrence metrics of the sentence the training principle of glove is a little bit too complicated to get into in this video but all you need to know is that the training objective of glove is to learn word vectors such that their dot products equals to the logarithm of the words probability of co-occurrence next we have past text fast text again is an extension on the word to back algorithm but this time in a bit of a different way so instead of taking a skip gram model and training it on whole words what it does is to separate these words into sub words of length n and train the model on top of these subwords thanks to the subwords approach it works really well with rare words or words that were not seen in the training data and this is honestly a huge advantage of fast text over the previous ones we saw another advantage of using subwords is that it does really well with morphologically rich languages like german or turkish and lastly we have elmo elmo is one of the latest innovations in the area of word embeddings with elmo the embedding of the word depends on its context so in a way the embedding of the word is created dynamically in the paper they mention this by saying our representations differ from traditional word embeddings in that each token is assigned a representation that is a function of the entire input sentence elmo representations of words are derived from a bi-directional lstm model that is trained on a language model task of predicting the next and the previous words in a sentence by training on a language modeling task elmo takes the context of the sentences into account while it's creating the embeddings and this way it is able to distinguish homonyms of words for example so the embedding of the word fair would be different in these two sentences he was known to be fair the fair was so much fun and because elmo's first layer deals with characters instead of whole words it is really good at dealing with misspelled words or typos but all in all elmo needs its own video to explain to you fully what it does and how it trains if you want to see that video leave a comment and let us know okay so we've seen some cool ways of how word embeddings are made but how can you use them in your project so there are mainly two different ways you can either make a word embedding yourself from scratch or you can use pre-trained word embeddings it is not that hard to make your own word embedding because for all of these algorithms and models that we talked about there is a library that offers you a pre-made model that you just need to push data through but the problem is like we also mentioned in making an embedding layer by yourself is that you need to have a lot of training data and it will probably take a very long time to achieve a good performance but at the end you will have a word embedding that is specific to your use case and it will be very relevant to all of the words that you have in your data set another way is to use a pre-trained word embedding and there are libraries that offer pre-trained word embeddings and even the research groups that come up with the new ways of embedding words most of the time release their word embedding model to the public use these word embeddings of course will not be specific to your use case but at the end of the day they will save you a lot of time and effort there are two ways how you can use those pre-trained word embeddings depending on the embedding that you're using you can either use them statically so not update them during your training just kind of plug and play or you can also make them part of your training process and kind of fine tune them as you're training your model before we wrap up i want to show you how to import patreon word embeddings from the genzome library because i think it's quite cool to see how they work and also kind of poke around a little bit so the first thing that you need to do of course is to install the library you can find more detailed instructions on how to do that on their website of course after we import the library the first thing i want to look at is different types of word embeddings pre-trained word embeddings they have and you can see immediately they have fast text glove and word to back and even for glove and word to vect they have word embeddings pre-trained word embeddings that are trained on different types of data you can read more about what kind of data they use etc there but you can see they have twitter data wikipedia data and google news for example so it's kind of cool you can try and see which one works better for your specific project and then i install or load these different types of word embeddings to my um project and i would like to show you kind of the details of them how to kind of poke around so the first thing that i want to see is what word to back thinks is the closest thing to thee and it thinks it is well i wrote it with lower letters things as t with the capital letter t's and some other things that i do not fully recognize but if i try the same thing with glove for example it says coffee milk wine cream which is actually quite accurate it's quite good it thinks that the closest things to tea are coffee milk wine cream ice juice etc these are all beverages or they have something to do with tea so that's quite good uh if you try the same thing with fast text again it says tea coffee teas copper tea bags babies again quite acceptable uh you can also see the distance between two different terms so for example if i like the example i gave in the video if you look at the distance between tea and coffee it comes down to for vertebrate 0.43 whereas the distance between t and p is 0.7 so that's quite good as we said what we want is that the similar words are closer together and the words that are not that similar are farther apart even if they're spelled very similarly another thing you can do another example that people use all over the place is when you take the word king and then you subtract man from it and then you add woman to it you want to arrive at a point to a vector that says queen right so here's what we can set that up we can we say basically positive is king and woman and negative is men so basically subtract men from king and add woman to it and we arrive at queen the the closest thing is queen which is amazing um and i wanted to i mean this is an example that has tried a lot but i wanted to try something that i came up with so i wanted to see if from the word restaurant i subtract dinner and i add cocktail to it i want to arrive at a place where you would drink a cocktail so maybe a bar or something like that so let's say let's see what word tubeeck comes out with eatery bartender bartenders it is close enough but of course you are not pre-determining these relationships when you're making the word embeddings your word embedding model extracts these relationships from the text itself so you know of course you cannot always expect it to work perfectly but i want to try the glove and see if it's better or worse glove things it's parasol espresso brewery again not exactly what i was looking for but maybe with fast text okay this is a little bit better it says bar restaurant restaurant bar cocktail making vine bar nightclub okay it seems to be doing better and as far as i read also fast text does better in terms of these analogy making tasks uh compared to glove and word to back but it is still i think quite cool to explore word embeddings like this so this is not how word embeddings are normally used of course we were just kind of poking around and trying to explore they are normally used in combination with a bigger model a core model where you're trying to do some sort of nlp task so if you're interested in that if you would like us to make a video on for example training a sentiment analysis model using a pre-trained word embedding let us know in the comments section but for now thanks for watching i hope this video was helpful for you and i will see you in the next video [Music]
Info
Channel: AssemblyAI
Views: 24,730
Rating: undefined out of 5
Keywords:
Id: 5MaWmXwxFNQ
Channel Id: undefined
Length: 17min 17sec (1037 seconds)
Published: Sun May 01 2022
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.