Transformers for beginners | What are they and how do they work

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
transformers came into our lives just a couple of years ago but they have been taking the nlp area by storm libraries like hugging phase has made it very easy for everyone to use transformers or implementations like bert or gpt3 is the reason that everyone is talking about them but what are they and how do they work so in this video we will look closely into transformers and understand their working principles this video is part of the deep learning explained series by assembly ai which is a company that is making a state-of-the-art speech to text api if you want to use assembly ai for free get your free api token using the link in the description before transformers word is coming we were using rnns to deal with text data or any sequence data really but the problem with rnns is that when you give it a very long sentence it tends to forget the beginning of the sentence when it comes to the end of the sentence and because they rely on recurrence well it's in the name recurrent neural network they cannot be paralyzed then we start using lstms lstms are a little bit more sophisticated they tend to remember information for a little bit longer of a time but they take very long to train well then we have transformers transformers only rely on attention mechanisms to remember things they do not have any recurrence at all and thanks to this they are faster because we can parallelize them we can train them in a parallel way okay but what is this attention we can definitely make another video to talk about that and if you're interested in that definitely comment and let me know but generally attention is the ability of a model to pay attention to the important part of a sentence or an image or any kind of input really so if it's a sentence this is what it would look like so let's say we have a english sentence and the sentence is the agreement on the european economic area was signed in august 1992 and the other side is the french translation of that but i do not know the first thing about french so i'm not even going to try to pronounce it but as you can see in this chart what we see is that the lighter the color of the square the more attention our model is paying to the word in that line or in that row or column and as you can see it does not always go in a diagonal way when it is translating european economic area because the word order is reversed in french it is paying attention in a reversed way if this was an image and let's say we are looking for dogs and images and we are trying to um classify different breeds of dogs then you can see what your model is paying attention to is it the noses of the dogs is it the ears of the dogs what exactly in an image is the model paying attention to to be able to understand the difference between dog breeds alright now that we briefly looked at what attention is let's look into how the transformer networks learn and how what their architecture is so this is what a transformer network more or less looks like but we will go and start from the higher levels and then start breaking down everything and understand how they work together so the basic thing on a very high level what transformers have is an encoder and a decoder part but actually what they have is six encoders and six decoders but basically the right left-hand side is the encoders and the right-hand side is the decoders each encoder has one self-attention layer that is paying attention to the sentence itself and one feed forward forward neural network layer and every decoder has two self-attention layers and one feed forward neural network layer the parallelization comes from how we feed the data into this network we feed all the words of the sentence at the same time to our network specifically the encoder inside the first step which is the self-attention sub layer all the words of the sentence is compared to all the other words in the sentence so there is some communication between the words whereas in the next step in the feed-forward neural network they are passed through a feed-forward neural network separately so they do not have any information exchange but the feed-forward neural networks that they are passed through are the same inside the same layer but as we said there are six encoders and each each in each of these six encoders the neural networks are different okay so this has been kind of the middle part of the network we also have the inputs and the outputs so all the inputs that go in either the encoder or the decoder the raw inputs are embedded what are embeddings well that's a little bit of a longer topic for this video but again if you like us to make a video on this leave a comment uh but what you need to know for now is that embeddings are a way to represent these words in a n length vector in this specific transformer architecture they are using 512 length vectors and that's basically what they use in the original paper but this is a hyper parameter that you can change and on top of these word embeddings we are adding positional encodings so if you remember we said transformers do not have any recurrence so the model has no way of understanding which word comes first and the other one comes second or which word comes where in the sentence so by adding a positional encoding you are letting or you are adding some information or injecting some information with each word that tells the modal where this word in the sentence comes in and lastly for the output as you can see we have a linear layer and a softmax layer at the end of the decoders so the output of the decoders can be transformed into something that we can understand and basically what they turn into is a vector that has the length of the amount of words that we have in our vocabulary and each of these cells tells us how likely it is that this word in this cell is going through the next word in our sequence and those are the main components but there are two little things that make transformers a little bit better one of them is the normalization layers so if you realize in between the sub layers that is the self-attention layers and the feed-forward neural networks we have some add and normalized layers and what they do is to normalize the output that comes from the sub layer the normalization technique that is used there is called layer normalization and that is basically an improvement over batch normalization and if you don't know what batch normalization is we already made a video about that i will link it somewhere here and you can go watch that to understand a little bit better what batch normalization or layer normalization is and the second little detail is the skip state so if you look at the original architecture image we see that there are some arrows that are going around some of the sub-layers well actually all of the sub-layers so some of the information that does not go into either the self-attention layer sub-layers or the feed-forward neural networks are sent directly to the normalization layer this kind of helps the model not forget things and it helps the model to forward information that is important to further in the network inside these normalization layers what we do is add the information that went through the sub layer and also just skip the sub layer and then normalize them together and that's all there is to the architecture of transformers and if you look into it actually most of the things that are inside this architecture are things that we've known from before things have been around for a long time for example linear transformations soft max layers or word embeddings for example or the feed forward neural networks but there are two really noble ideas inside the original transformer paper that really made the difference for transformers and those were positional encodings and multi-headed attention so let's take a closer look at how they work let's start with multi-headed attention layers so if you look at the original architecture you see that there are two different types of multiheaded attention layers one of them is just multi-headed attention and the other one is masked multi-headed attention well it actually does the same thing no matter if it's called masked or not if it's in an encoder and a decoder and the only difference is in a normal multiheaded attention layer all the words are compared with all the other words that are inputted that are in a sentence but that will make more sense to you in a second what i mean by comparing and for masked multi-headed attention layers only the words that are coming before a word are compared to that word in the sentence in the attention layer something called the scaled dot product attention is used and then it is multiplied and is done multiple times to create that multi-headed effect and of course everything is done in mattresses to make things faster but i will show you how attention is calculated using just the vectors of words what do we have in the beginning are embeddings of words if you remember we embedded the words into a vector and also we added positional encodings and then this is fed to the first encoder and of course at first is fed to the multiheaded attention sub layer of the first encoder so in there what is done first is to multiply these embedding vectors with some mattresses these mattresses are called query key and value mattresses and these are values that are initialized randomly and are trained during the training process to be learned kind of like the weights and biases we have in neural networks as a result of this multiplication we get the query key and value vector for each word and from this point on we are going to use these vectors to keep going with the calculation the first thing that we want to do is to calculate a score for each word against all the other words in the sentence what we do for this is we dot take the dot product of the curie vector of each word against the key vector of all the other words so if you want to get the score of the first word on the first word what we do is we get dot product of the query vector of word one with the key vector of word one if we want to get the score of the first word against the second word we must get the dot product of the query vector of the first word with the key vector of the second word so when we get the dot product of all the key vectors of all the other words compared with or combined with the query vector of the first word we have the score of the first word against all the other words so they all belong all of these scores belong to the first word if you want to get the scores for the second word we're going to have to multiply its query vector with all the other words key vectors and this is what it's done and it's all done in parallel so that's why we do not have any recurrence or we don't have to wait for other words to be done before starting to process the words further in the sentence we can do these calculations for all the words at the same time once we have all the scores of all the words against all the other words what we're going to do is to divide them by eight so that might sound like a very random number for you but it's basically the square root of 64 which is again sound like a random number that comes out of nowhere but actually 64 is the square root of the length of the query key and value vectors and that's why the authors of the original transformers paper are using that number after we divide everything by 8 we pass all these values to a softmax layer we do this to normalize all these values and the score values of one word against all the other words are now going to sum up to one the resulting number serves kind of like a weight from this point on what we're going to do is to multiply all the value vectors of all the words with this weight and finally you sum up all the weighted value vectors of all the words based on this one word that we were doing the calculations for and create the output of the self-attention layer for this one word and then you have to do all of these calculations for all the other words well this is done simultaneously but at the end you have the output of the attention layer as i mentioned multi-headed attention does this eight times so effectively it is training eight different query key and value mattresses not the vectors for the words but the mattresses that we multiply the input embeddings with this way the model is able to pay attention to not one other word but many other words in the sentence so in the model they're using the number eight but you can basically change it if you like to and let's look at this example again that we gave at the beginning of this video as you can see there are some of the um cells that are really bright but at the same time we also have some cells that are just kind of gray and that means that our model was also paying attention to those other words other than the primary word that they're paying attention to a little bit and this multiheaded attention thing is also one of the reasons why transformers are so seamlessly able to deal with sentences of different lengths one thing you might catch here is that if we do the same thing eight times what we're going to have is eight different resulting mattresses right and inside these mattresses one line is going to correspond to one word but we're going to have eight of them so how are we going to deal with this well what they do in the paper or what they propose to do is basically concatenate them all together and then multiply them with yet another weight matrix that is going to produce a matrix that is going to look like only one output of the attention layer this weight matrix is of course yet another thing to train inside the transformers on top of the key value and curing mattresses that we multiply all of the word embeddings with next is positional encoding so as i mentioned before position line codings is a way to inject or add information to the word embeddings that we've created before to show where in a sentence a word is so basically the location information of a word you can either use learned positional encodings or fixed positional encodings but in the paper in the original paper they suggest or they recommend that we use fixed positional encodings because they have the advantage of being able to handle lengths of sentences that we haven't seen in the training set you might say why do we need any sophisticated solution for this anyways why can't we just assign a number to the word specifying where in the sentence this word is so that wouldn't really work because let's say if you assign a number that goes from zero to one what's going to happen is that you're not going to really understand how many words are in that sentence just by looking at this one word and this value will not be consistent in between examples another solution could be to assign integers to words of course starting from one or zero to however long the sentence is but the problem with that one is that those numbers can get very high right if you have a very long sentence that could get out of control and on top of that there could be sentences with specific lengths that you do not have in the training data and that could cause some problems in terms of generalization so what they did as a solution to this positional problem in the original transformers paper was to use sine and cosine functions in different frequencies but of course i don't expect you to know this so let's look into how that looks so this is what sine and cosine functions in different frequencies look like the colors here show us numbers that range from -1 to 1. the x-axis shows us the length of the word embeddings in transformers we are using 512 as i mentioned before and the y-axis is the position of this token of this word so if i want to get the positional encoding of a word that is in let's say the 20th position i need to get the horizontal line that corresponds to 20 in the y-axis and the nice thing about this positional encoding is that it's going to be unique no other place no other horizontal line in this graph has the same composition of values as in that line and one other nice thing about this positional encodings is that you can always tell the difference between two words looking at these positional encodings it's always going to be the same one thing that really helped me understand this concept was to look at binary representations of integers so let's look at these examples if you realize as you increase your numbers what happens is the smaller digit in the binary representation changes from one to zero with every new integer whereas the second digit changes every two integers so at first it is zero and zero and in the second two integers it is one and one and the third two integers it is zero and zero again and again this pattern kind of follows itself and what happens is all of these binary representations are unique no two binary representations are the same and on top of that you can always tell the difference between two integers by looking at their binary representations this could also be a perfectly useful positional encoding for us too but it is only ones and zeros and we are not actually using the information that can be provided with continuous values so that's why instead we use sine and cosine functions okay what do we do once we have these encodings right let's say we have this encoding of 512 values that we extracted from this graph well what we do is we basically add them together we add the word embedding and the positional encoding together and then we feed it to the encoders all right so we learned everything that we need about the architecture there are encoders specifically six of them and there are recorders again six of them we have the uh last processing and the output we have the embeddings at the input and also the positional encodings but how does it all work so basically to bring it together what happens is you first get your inputs run them through the embeddings run them through the positional encodings and then run them through six levels of encoders and then you get an output this output is fed to all of the decoders so we have six decoders as we mentioned six layers of decoders this the information from the output from the encoder is fed to all of the decoders but this information is only fed to the second sub-layer so the multi-headed attention sub-layer of the coders and the first masked multi-headed attention a sub-layer of decoders get the input from what was outputted from the decoder section of the model in the previous time step that way the decoders are taking into consideration what was the word in the previous time step on the previous position and also the context that they learned from the encoding process of the network to create the output these decoders all work together and then they create a output vector this output vector is sent to through the linear transformation that creates a logic's vector this logic's vector is as long as the amount of words that we have in our vocabulary and it has the possibilities of how likely is the next word going to be one word or the other and then we pass this through soft max to be able to get the probabilities of each word and then these probabilities will also add up to one so basically a normalized version of the logit's vector the output of the softmax layer basically tells us what the next word is going to be and that's all there is to know about transformers really it's quite simple even though it looks a bit complicated at first all you need to know that there are encoders and decoders and the two noble ideas that came into our lives with transformers are the positional encodings and the multiheaded attention layer to fully understand transformers and how they work you might need to watch this video multiple times and maybe even support your learning with some of the written resources that are out there so for that i've left links to my favorite resources in the description if there was anything that was not clear or if you have a question leave a comment and let me know if you like this video don't forget to give us a thumbs up and maybe even subscribe to be one of the first people to know when we make a new video but before you leave don't forget to grab your free token for assembly ai's special text api i'll see you in the next video
Info
Channel: AssemblyAI
Views: 133,470
Rating: undefined out of 5
Keywords:
Id: _UVfwBqcnbM
Channel Id: undefined
Length: 19min 59sec (1199 seconds)
Published: Sat Nov 27 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.