Transformer Neural Network: Visually Explained

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

the year was 2017 when a research paper introducing Transformers was published looking back this was probably the moment that changed the field of AI forever as it led to the development of generative pre-trained Transformers or GPT yes Chad GPD is an application that has gp3 at its score all this might look intimidating but the concepts behind Transformers are very simple this video has two main goals explain Transformers and its various components and then build and train a simple Transformer based classifier from scratch so without further Ado let's get started at its core Transformers get their superpowers from a mechanism called self attention in simple terms self attention is a sequence to sequence operation uh what I mean by this is that it takes in a sequence which can can be a sentence or just a series of numbers and returns a different sequence how exactly that happens let's look at the details everything starts with the data set for the illustration I'm using the simple IMDb data set for classifying a review as positive or negative there are two main parts to this data set the actual review which is in raw Tex form and a corresponding label mentioning whether it's a good or a bad review the label one in the this case means that the review is actually good this is one example but for this problem we have a total of 20,000 such reviews some good and some bad we know that the computers can only make sense of numbers so we have to find a way to do this a very common way is just to collect all the individual words in the data set and store them to create a vocabulary now we can simply assign all of them a unique number looking at the this vocabulary we can map any review to a set of numbers so finally we have a bunch of sequences of numbers that essentially represent a sentence but there's a problem with this all these are of different lengths and to feed these into a Transformer and utilize parallelism we have to make sure that they are of same length and this can be done by simply ping each sequence with zeros at the end to understand how self attention Works let's just look at at one of these sequences and only consider the first five words now representing each word with a single number is okay but it doesn't convey much information it would be a lot better if we can write each word as a vector of S and S rather than just a single number uh by the way I'm picking seven as an arbitrary number and you can choose something else as well this can be done using embedding class from P Dodge itself now there are a lot of things that I can say about in this embedding class but I wouldn't go go down the rabbit hole and for the purpose of this video just think of this as another unique mapping from an integer to a floating Point Vector of desired length which is based on the vocabulary size so just to recap we have now transformed and stored the review in form of a tensor of shape 1 cross 5 cross 7 where one is because I'm only looking at a single review and five is for the number of words in this review and seven is the embedding Dimension by the way you can find the code for doing all of this in the video description if you remember self attention is just a sequence to sequence operation in this case these sequences are individual words what self attention does is takes the weighted average of each of these words and this can be seen as pre-multiplying the input matrix by a weight Matrix the output Matrix is then of the same size as that of the input Matrix now we have a way to get the output but what about this weight Matrix the weight Matrix is Cal calculated by simply multiplying the input with its transpose in this case the input Matrix is seven elements wide but it can be longer so it's better to scale the output by dividing it with the square root of the embedding Dimension finally as it is a weightage matrix we would like each row to sum to one and this can be done by applying softmax to each row of this Matrix writing this as a python function is pretty straightforward for a given input tensor X we can use py Dodge to Define matrix multiplication and return the output now there's a problem here if I change the sequence of the words in a sentence it can alter the meaning of a sentence however the output from self attention does not change it's only rearranged in the same order as the inputs this is problematic as the self attention operation does not seem to care too much about the order of the words in a sentence this can be solved by incorporating position encoding position encoding is done by simply adding another Matrix which contains the information related to the positions of the words in a sentence this position Matrix can be generated using p toor embedding layer as well all we have to do is Define the embedding Dimension which is s in this case max number of words in a sentence that you anticipate and then we can just generate the Matrix and reshape it to match the shape of our input Matrix as you can see that the self attenti operation isn't that complicated however you might be thinking that where is machine learning well in theory self attention has nothing to do with machine learning but we can modify this framework to incorporate that if you look carefully there are three ways in which the input Matrix X is being used in self attention instead of using X as it is if we send this Matrix through three separate y Nets and then apply self retention to the outputs we can incorporate learning from data in self retention by the way these three output matrices are called query key and value Transformers have some historical baggage associated with them which has led to this query key and value name in any case it's irrelevant as far as the mathematics or the technique is concerned so finally with this we have a selfit engine mechanism that can actually learn from the data now I would like to take a moment and discuss why are we actually doing this what exactly we want to achieve by first passing a say a sentence through three different n Nets and then perform different operations on them to get the final output as the input is represented by a matrix where each row represents an individual word in a vector form whose length is pred decided the output from each NE net is of the same Dimension we can think of these as a transformed version of the input and this transformation is dictated by the data set the weight Matrix is an interesting one with the dimensions of number of words by number of words this Matrix essentially captures the relationship between different words in a sentence here is an attention weight Matrix from a Transformer trained on the IMDb data set from this we can see it capturing things like the word screenplay which is related to movie and that the enchanting word is for the screenplay however it also missed a lot there is no understanding as to what is good in this sentence or in fact what is being said about the movie nevertheless this weight Matrix is then used to get the output which is the same Dimension as the input this output can be thought of as the input Modified by the attention mechanism to capture the important information you might have noticed how I quickly glossed over this shortcoming of the self attention as it turns out we can do something to mitigate this using multi-head tension we know that in single head attention an input Matrix is sent to three ual Nets to get query key and value a self attention is then applied to these three matrices to get the output in multi-head attention we get query key value from the input just like the single Ed version however now these large matrices are split into our smaller matrices and then our parallel self retentions are performed to get our different output matrices this is where the terminology multi-hit comes in as we are performing our different self retentions in parallel after this these output matrices are concatenated to form a larger output Matrix which is finally sent through a matrix multiplication to unify all these heads a thing to notice is that in this case we'll have our different attention matrices and I trained a Transformer with four heads and these are the four attention matrices from that now you can see connections being made between a lot of different words as Fineman once said what I cannot create I do not understand so let's code self attention the input is 3D tensor with B representing the batch size or the number of reviews in this case T are the number of words in the review and K is the vector representation of each word another thing we have to ensure is that the number of heads divide the embedding length evenly now we can Define the three NE Nets and pass the input tensor through them to get query key and value next we perform the slicing operation to split these large tensors into multiple heads remember that self attention on different heads can be performed in parallel and if for fa itate this I'll just roll the dimension representing different heads into the batch one finally comes the parallel self for tension which gives different output tensors followed by the concatenation of different heads finally the unifying neet layer which gives the final output now we can use this code to build a full-fledged Transformer and train it on IMDb data set there are five steps to Transformers everything starts with the token embedding where each word is is represented as a vector we then add position information to these Vector using positional impeding which is followed by the self attention and then a fully connected Network which leads to the output so I trained a Transformer on IMDb data set for 20 hex as you can see I selected the embedding Dimension to be 32 and four heads for the self attention here is the plot showing accuracy on the test data set after each update and you can see the accuracy increasing as the network trains [Music] I'm providing the complete code for the Transformers new network and would encourage you to try it out yourself and please do let me know what you think about it down in the comments that's it for today and I'll see you next [Music] time

Info

Channel: Algorium

Views: 6,315

Rating: undefined out of 5

Keywords:

Id: 96KqiPQlP4s

Channel Id: undefined

Length: 10min 50sec (650 seconds)

Published: Sun Feb 25 2024