CS480/680 Lecture 19: Attention and Transformer Networks

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
okay so for the next set of slides I'm gonna talk about attention in more detail so we discuss already attention in the context of machine translation but attention has led to a new class of neural networks that are known as transformer networks okay and this this material that I'm going to present today is actually quite recent so it's not described in any of the textbooks that are associated with this course so for this there is a paper that was published in 2017 called attention is all you need okay so it's a very interesting title and it actually does suggest that perhaps we don't need recurrent neural networks anymore and and then a lot of the building blocks that people would design let's see let's same units gr use and and other things like this this essentially is suggesting that we don't need them all we need is really just attention ok so let's see how this works ok so it turns out that the concept of attention was was first studied in the context of computer vision and here let's say that we have some images we're going to recognize some objects so as humans let's say that we have a large image and then we're looking for a tiny object then what we will do naturally with our eyes is roughly scan the image and eventually focus on some regions of interest and and then eventually identify the object and then when we identify an object we eventually I guess focus precisely on the location where the object is and then some studies have shown that our eyes are indeed focusing on the right regions when when we do this type of identification so I guess here an interesting question is is there a mechanism computer vision that could be similar and would that be beneficial and the answer is yes so here let's say that we have an image okay this image doesn't appear very well on the screen on my laptop it looks better so essentially you're supposed to see some scene where there's a house and a tower here so it says essentially two buildings and then there's there's the rest of the scene around it and let's say that we're doing object detection where we want to recognize buildings so what this shows here is a heat map that's overlaid on top of the image so that's what you don't actually see while the scene but in any case this heat map shows that the pixels or the region where buildings are more likely to be are right here in the red part and if you look at it more carefully so it turns out that there is indeed a house here and then there's a tower here okay so now if you train some network to do some classification an interesting question is when the network outputs a class how can we trust that it comes up with the right class and you know we could ask it if you tell me that indeed there's a building in this image can you show me where that building is right and this could be a nice way of at least validating understanding what it thinks is a is a building in the image and whether it's correct or not so so here we can use attention to essentially see which pixels are aligned with the concept of a building and then so the heat map corresponds to the weights here for the attention mechanism so in this case you could imagine having an attention mechanism over the entire image so you've got weights with respect to all of the pixels right and then you try to see I guess which pixels would have some semantic meaning or some embedding that would be essentially aligned with the notion of of the object that we're trying to recognize and then so some researchers demonstrated that you can do this and in fact here attention can be used to highlight the important parts of an image that contribute to the desired output so it's very nice in terms of explaining what is the decision process going on and it can also be used to as a building block as part of the recognition process okay so this was in computer vision then in the National anguish processing in 2015 we saw the work about machine translation where you can get your decoder to essentially peek or look back at what was the input sentence so that it doesn't lose track of what it's translating and it doesn't have to remember completely the sentence so so this was an important breakthrough that allowed us to deal with sentences of arbitrary length inside very long length but then further than that in 2017 some researchers show that we can use attention to develop general language modeling techniques so here language modeling simply refers to the idea of developing a model that would simply predict the next word in a sequence so it could generate words and then also if let's say we've got a word missing somewhere in a sequence and it could also recover that missing word so language model is essentially just a model that can predict or recover words in a sequence a lot of tasks can be formulated as language modeling problems so whenever we do translation you could imagine that it's just a sequence where we have a first part in one language a second part in another language and what we're really doing it's just continuing the sequence where we're predicting words of the second part that happened to be in the next language and for doing sentiment analysis we can think of it this way too so most of the the tasks in natural language processing can be cast as some form of language modeling and then what they showed is that we can also design some architecture that uses pretty much exclusively some attention blocks and then they call that architecture a transformer okay so so we'll see in more details now those transformer networks and they've become now the state of the art for natural anguish processing so these surpass now recurrent neural networks okay so if we do a comparison between a recurrent neural network and a transformer Network turns out that the recurrent neural network has several challenges the first problem that we've discussed a lot is how can we deal with long-range dependencies and here the solution was in fact to combine the Recker neural network with some some attention mechanism it also suffers from great invention great and explosion the number of steps for training recurrent neural networks can be quite large and this has to do with the fact that a recurrent neural network we can think of it as essentially an unrolled network that is very deep because when we unroll it we have to unroll it for as many steps as needed for a corresponding sequence so recur neural networks effectively arbitrarily deep networks and then so they have lots of powers those powers are essentially correlated between each other because we in fact tied the weights from one step to another and then so the optimization tends to be quite difficult as a result and it tends to require a lot of steps okay so so training Eric all Network usually takes a lot longer than a convolutional or net or regular feed-forward neural net beyond the number of steps there's also the question of paralyzing computation so today GPUs have become key for working with large neural networks and then what they really do is that they enable us to do some computation in parallel but then if you have a recurrent neural network what happens is that the sequence of steps write the computation has to be done sequentially we can't process all those steps in parallel because of the recurrence ok so so there's an inherent problem here we can't quite leverage GPUs as well in the context of recurrent neural networks now if we consider transformer networks we're going to see in a moment that because they essentially use pretty much exclusively some attention blocks then there won't be any recurrence and attention will also help us to draw some connections between any part of the sequence so long-range dependencies won't be a problem so in fact long-range dependencies will have like the same likelihood of being taken into account as short-range dependencies so so that's not an issue anymore now great advancing and explosion will also not be an issue because instead of having computation that goes linearly with the length of the sequence into a deep network for a recurrent neural network and a transformer we're going to do the computation for the entire sequence simultaneously and then just build several layers for that but there won't be so many layers in practice so so great invention and explosion won't be as much of an issue in terms of steps it will take as well mark fewer steps to Train and then because there's no recurrence and those networks we're gonna be able to do the competition in parallel for every step so so so this is really nice so I guess we've got lots of great advantages for those transformer networks okay so to introduce transformer networks let's review briefly attention so we're gonna see again attention but more generally so we've seen it already in the context of machine translation but now let's let's think about attention essentially being some form of approximation of a select that you would do in a database so in a database if you want to retrieve some value based on some query and and then also based on some key right then there's some operations that you can do where you can use a query to identify a key that aligns well and then simply output the corresponding value so we can think of attention and essentially mimicking this type of retrieval process but in a more fuzzy or probably bill sick way okay so let me draw a picture just to illustrate how things would normally work in in a database and then we'll see that in our case we're going to enable the same type of competition that would normally happen in in database but using this equation here [Applause] okay so let's say I've got a query and then I've got a database so in my database that's it I have stored some keys with associated values okay so I've got my database now when I issue a query all right I will I will see how well this query aligns with different keys perhaps the key that is the right one is let's see key number three then what I would do is essentially produce an output that corresponds to the value okay so so retrieval in the database would more or less correspond to this and now an attention mechanism is essentially a neural architecture that mimics this type of process and the way we're going to mimic this is by using the following equation so here we're going to measure what is the similarity between our query queue and each key value ki and then this similarity is going to return a weight and then we'll simply produce an output that is a weighted combination of all the values in our database now normally with the database when we do retrieval we simply return one value and this would correspond here to finding a similarity between the query and some key where the similarity would have value one and then all the other keys the similarity would be value zero so if we have this if if we've got a similarity function essentially produces a one hot and coding right then we would effectively just return one value now in practice because we'll we'll want to embed this as part of a neural network and be able to do back propagation to differentiate through that type of operation they'll be useful for us to think of the similarity as instead computing a distribution so I guess weights that are between 0 & 1 and then even if we have multiple keys that have some similarity the ideas that we're going to produce a value that's going to be a weighted combination based on weights okay so I guess we can think of this as like a generalization of the mechanism for a database where we make the retrieval process become a convex combination or weighted combination of the values for which the keys have a high similarity in in the database okay so let's try now a neural architecture that will correspond to this attention mechanism so so it's going to be essentially the same as what we've seen already in the context of machine translation but now we're just gonna make it the main agnostic so you're gonna see more generally what attention really corresponds to [Applause] okay so let's say that I've got t1 t2 actually shift this a little bit t1 t2 t3 and t4 we're going to have our query and let's have s1 s2 s3 and s4 query is going to influence the computation of each one of those things so here what I've drawn is a first layer we're starting with the keys I'm gonna compute a similarity measure so these s's correspond to similarity okay so here SI is going to be equal to some function of the query with respect to some function of the query Q and the ki KI and now there are many functions that we could consider so let me suggest a few the first one could simply be a dot product so just like this okay so the similarity if we think of the query and the key as just embedding vectors right if I want to measure their similarity a simple thing is just to compute their dot product so that's something common a variant on this is going to be a scaled dot product okay and then here D is the dimensionality of each King okay something slightly more general we could also have QT q transpose wk I so I'll call that a general dot product and then finally let's have W transpose Q of Q plus w transpose k ki so this will be an additive similarity okay so this first layer is computing the similarity between some query Q and each keys each key in in our database or in our memory and to do this the many choices some common choices in practice or just to do a dot product or scale the dot product by dividing by the square root of the dimensionality this has the benefit of simply keeping the dot products in in a certain scale now more generally we can also project the query into a new space by using a weight matrix W and then after that taking a dot product by a ki and then another one is just to take the sum of the sum some combination of QN and ki and this is known as as some form of additive similarity now you could think of other types of similarity for instance earlier in the course we talked about kernel methods that also measure similarity they do this by essentially mapping two vectors into a new space through some nonlinear function so we could have as well in here some form of kernel similarity to yeah [Music] okay so in this case we're not going to have convolutions but here are you suggesting that instead of comparing the query to every key we could just do it with respect to a subset of the case [Music] also the okay this w think of it as more like we're simply transforming our query to be in the same space as the keys so okay to give you a concrete example let's say that we're doing question answering and then we would like to let's say we have a database of possible answers we've got a query and now they say we've got an embedding of every answer that corresponds to a key now the query is a question so we can embed it as well and now the problem is that depending on the type of embedding we compute if those embeddings are simply computing the semantic meaning of these sentences right there's no reason for us to expect that the question and the answer have the same meaning right so in fact they should have different meaning because the answer is supposed to provide something that the question doesn't have right but still then what you can do is map them into a new space where there we could interpret the question and the answer is really being things that we can compare directly and then this matrix W serves that purpose okay so it's just a high-level intuition but the idea is that if you're not confident that you can compute the similarity directly then you can allow your neural network to learn a mapping W so here W is a set of weights and some matrix that will essentially map our query into a new space and then it will just learn what that new space should be okay so that's our first step now the second step or the second layer after this will be to compute the weights so have a 1 a 2 a 3 and a 4 those weights they depend on everything ok so here this will be done through a softmax so essentially AI is going to be equal to the exponential of Si divided by the sum over J of the exponential of SJ ok so yeah so here it's a fully connected network but not in the classical sense it's more I'm just showing what hidden nodes are using and computing these weights this is a softmax we don't have any weights all we're doing is just computing this this expression ok and then after this we're going to have our weighted combination so what I've shown here is that we multiply a 1 by v1 alright so this means that we multiply them then we add them to the product of a 2 by V 2 we add this to the product of a 3 plus times v3 and so on and then this produces the attention value ok so the attention value is just a sum over I of AI di ok so I guess you see this is a general scheme where we have some query we have some keys and then we're going to produce an output where the output is really a linear combination of some values where the weights come from some notion of similarity between our query and and the keys okay any questions regarding this yeah right so here the W matrix should span the space that we care about now we don't specify W right these are variables that are going to get optimized by the neural network itself so that's the beauty of this right so here again W in general indicates weights that are powers or variables in the neural network and then whatever the task right we're going to do back propagation and then these weights are going to get a justice or the neural network is going to learn on its own what might be a good space to project Q into so that then we can take a dot product with K yeah oh good question okay yeah the AIS are scalars so I guess okay so the key the key eyes are vectors okay the S eyes are scalars the AIS are scalars and then the V eyes are vectors yeah so here these scalars or our weights and then there's going to be a weight like this for every possible world like if we're doing machine translation and now we're about to produce an output and we want to compute the attention of respect to the input words then we're going to have essentially one one yeah one weight per output word and then the VI is are essentially going to be the hidden vectors associated with each input word okay so in fact this is what I've got here on this slide right so as a concrete example we've talked about machine translation in a previous set of slides and here if I simply use as a query si the hidden vector for the if output word and then for the keys and the values and this particular setting I'm gonna have the same thing so both the keys and the values are going to be the h JS these are the hidden vectors for the input word that allows me to essentially compare my my hidden vector for an output word to each one of the hidden vectors for an input word and then essentially combine them together to produce a context vector that reflects what are the words that I'm interested in in decoding next or translating next yeah okay great question yeah so we haven't talked about yet what is a transformer all I'm doing so far is just explaining in a general form what is the attention mechanism but I believe it's coming up in a few slides all right so we've discussed all kinds of networks and more recently it's been the focus has been on sequential data we've seen hidden Markov models then the recurrent neural network and now we're talking about transformers so I want to go back and discuss in more detail the transformer network that was presented in 2017 so this network is special because as we discuss it gets rid of recurrent so this was a major thing so recurrences mean that the optimization tends to take longer for two reasons the number of iteration so the number of steps ingredient descent will be higher and also recurrence means that we have several operations that are going to be sequential and then we cannot paralyze them as easily so the beauty of having a GPU is that in principle you can paralyze lots of operations but if these operations are going to be sequential then you cannot write so so we would like to reduce as much as possible recurrent okay so on this slide here we've got a picture of the transformer network that was proposed in 2017 and this network has now displays pretty much recurrent neural network for sequential data so this was a major shift in in the thinking that people have now with respect to sequential data okay so if we take the example of machine translation so this network even though it's not obvious has two parts the first part that corresponds to an encoder the second part that corresponds to a decoder and here you see in machine translation you would use the encoder to encode your initial sentence and then you would use the decoder to produce a translated sentence now this will process an entire sentence in parallel as opposed to a recurrent neural network that will essentially processed one word at a time in the sentence so if you look carefully you see the inputs here would actually be the entire sequence of words so we would feed them all in at once and then they would get embedded and then after this we would add a positional encoding I'll come this I'll come back to this later on in a few slides but this is important essentially to make sure that we can distinguish words that occur in different positions within a sentence the problem is that if we don't have a positional encoding then we would effectively have a model here that treats the word as if it was a bag of words as opposed to a sequence and we all know that in languages the sequence actually matters so the ordering matters so we need to still capture this information and then so the positional coding at Eve's that ok so now the important part of a transformer network is essentially this block here it consists of two sub parts so does the multi-head attention and then here a feed-forward neural network now the multi-head attention is is really where all the good stuff happens so here the ideas that we feed in again a vector that would consist of sub vectors of all the the words that we have in our sentence and then the multi-head attention is going to compute the attention between every position and every other position so we have vectors that embed the words in one of those positions and now we simply carry out an attention computation that will essentially treat each word as a query and then find some keys that correspond to the other words in the sentence and then take a convex combination of the corresponding value so here the values are going to be the same as the keys and then take a dot product of that to produce a better embedding so the idea is that this multi-head attention will essentially take every word combine it with some of the other words through the attention mechanism to essentially produce a better embedding that that merges together information from pairs of words now when we do this in one block we essentially look at pairs of words together but now if we repeat this multiple times so this n times here means that we're going to have this block that's going to be repeated n times we're going to have n stacks of those blocks and now you see in the first block we look at pairs and the second block we're looking at pairs of pairs and then the third block we're going to look at pairs of pairs of tears so essentially we're combining now more than just two words but groups of words that that get larger and larger and larger okay so that's what the multi-head attention does then we have on top here another layer and that's called a Donora this is essentially adding a residual connection that takes the original input to what comes out of the multi hat attention and then it normalizes this so here nor is essentially a layer normalization we'll come back to this in a second but it essentially means that we we take all of our entries and then we normalize them to have zero mean as well as variance while then we feed this into a feed-forward Network there's again a residual connection and then a normalization so this block is repeated n times so that then we can combine not just pairs of words but pairs of pairs of words and and so on so that eventually you see you can combine together all the words in in the sentence okay so the output of this is going to be again a sequence of embeddings there's going to be one embedding per position and twitted li the embedding in that position captures the original word at that position but also information from the other words that it attended to throughout the network okay so you can think of this is essentially just a large embedding of all those words corresponding to each position so that's our encoding of the input sentence then after this we have the decoder which will do something similar but obviously the main purpose of the decoder is to produce some output not just to embed but produce some output so that's what we're gonna have some additional stuff on top here where we have a softmax that produces some probabilities for let's see outputting a label in each position okay now inside the block so this this block will also repeat n times what we do is we have first a multi-head attention that looks at simply combining output words with previous output words and then there's another block of multi-head attention that now combines output words with input words and then finally a feed-forward Network again okay so here you see we have two layers of attention the first layer is really just self attention between the output words so now the problem though with output words is that when you generate a sequence as output you can only generate the next word based on the previous words so when you do the attention you need to be care to make sure that you only attend to previous words and that's why this one is called a masque multi-head attention because we can mask the future words so that each word is only attending to the previous words the second multi had attention here is now combining or is is I guess that making sure that each position in the output is attending to positions in the input so this is where a bit like in machine translation whenever you want to produce an output and it's good if you can kind of peek and look back at what was your input sentence and here we're gonna look at the embeddings of each position in the input so that's why you see we've got these arrows that go in okay and this will be repeated and times again so that the ideas that we gradually build up combinations and then get better and better embeddings until we produce an output and here the output can be a distribution over the words in the dictionary alright so for every position there is a word that we're trying to generate and then we're gonna compute some distribution over the words and in the dictionary any questions regarding this slide okay good alright so in the transformer Network perhaps the most important part is this multi-head attention so I'm gonna draw on the board what this corresponds to now mathematically the multi-head attention is essentially this expression that decomposes according to these operations okay so for the multi-head attention as we talked about last class whenever we want to design an attention mechanism the general way of thinking about it is that we have some key value pairs just like in a database and then there's a query that we're going to compare to each keys and and then the keys that have the greatest similarity are going to have the highest weights and now we can take a weighted combination of the corresponding values to produce the output so we're going to feed V key and Q into some linear layer then after this we're going to compute a scale that product attention then we're going to concatenate these outputs then we'll have another linear layer and then the output of this is going to be our multi-head attention okay now this is called multi hat because in a reality and I haven't drawn this yet on the board we're going to compute multiple attentions so here when we take a linear combination perhaps there's different we can think of this linear combination as really being a projection of the values V same thing for a case anything from Q and we could consider several projections so here I'm gonna use this to indicate that I could compute three different types of projections by kinda looking at three different linear combinations of the values same thing for K same thing for Q so that I get essentially three different projection now each one of them I can now compute a skill that product attention so I will get three scale dot product attention and the way to think about these different linear combination these different scale product attentions it's a bit like feature maps and convolutional neural network so we saw the in convolutional neural network you can compute multiple feature Maps simply by having different filters so here these linear combination are a bit like different filters although in this case you can think of them as more like projecting or simply changing the space in which the values reside okay so so here this will give us different projection on different spaces a bit like multiple filters into a convolutional neural net and then when we compute the scale that product attention then for each projection is going to be a different one different scale dot product attention so we're going to get multiple of them and that corresponds more or less to having multiple feature Maps that's the same intuition so then this contact layer is going to concatenate these different skel product product attention and in founding we take a linear combination of them and this gives us a multi-head attention because we essentially computed multiple attentions so here there's three of them so we can think of these as H so this would be the number of heads and multi-head attention okay so the idea is that there's one head per linear combination so here there's three of them and in general there's going to be h of them and that's where the idea of the the name multi hat comes from okay any questions regarding this good ok all right so besides just a regular multi-head attention we also have in the decoder a mask multi-head attention and here the idea behind the mask multi-head attention is that some of the values should be masked meaning that the probabilities should be nullified so that we don't create any combinations so for instance in the decoder when we produce the output let's say we're doing machine translation so we have a sentence let's say in English then we're translating that into French we start producing the words in French when we produce a word right then it's okay for that word to depend on the previous words in the translation because we're generating them sequentially but it doesn't make sense for that word to depend on the future words because we haven't produced them yet right so so here what we need to do is to essentially change our attention mechanism so that we we would nullify or effectively remove links that would create dependencies on words that we haven't generated yet and so this is what we call a mask multi-head attention okay so here the main difference is that in the attention mechanism normally we just computer soft max according to this expression but now instead with a mask attention we're going to add a mask here that will effectively produce some probabilities that are zero for the the terms that we don't want to attend to because it would be future terms so here you see in a soft max what I normally do is take the exponential divided by the sum of exponential so now if I add a mask which is a matrix of zeros and minus infinity so adding minus infinity when I take the exponential of minus infinity it gives me zero so this has the effect of ensuring that the probabilities of certain items here are going to be zero because I'll take the Financial of minus infinity and and this will have the same effect as removing connections so at some level you can think of this as a form of dropout but here it's not a dropout in the same sense that we saw at the beginning of the course right so dropout for regularization you would drop out some connections or remove some connections at random according to some distribution here we remove connections that are essentially pointing at words that we haven't produced yet so this is more like a deterministic type of dropout if you wish because we we would never have those connections so here as opposed to sampling those connections with a distribution you see we use a mask and then inside the softmax the mask will have the effect of nullifying some of the connections because the exponential of minus infinity will be zero any questions regarding this yeah yeah okay good question so yeah here in the paper they add a mask with values that are minus infinity now perhaps the more intuitive approach would simply be to multiply the softmax by some sort of Hadamard product with values that are 0 and 1 now when we do it outside with values that are zero in one what happens is that the softmax produces a distribution that adds up to 1 now we're going to nullify some of those probabilities and then the the some of the properties that are left is not gonna add up to 1 whereas if we do it inside right by adding a matrix that might have values that are minus infinity it means that when we take the softmax these are gonna have zero probability but then all the other values are gonna have probabilities that still sum up to one so this ensures that we still have a proper distribution okay and if we just go back to this slide you see what happens is that when we produce an output let's say I produce my first word here then that first word could be fed as input for the next position right so when I want to produce a word for a certain position it's okay to look at the previous words and this is where the mass multi-head attention will will apply so we're gonna have a mask it's a matrix that will essentially be lower triangular that has essentially values that are zero in the lower triangular part and minus infinity in the upper triangular part to essentially nullify everything that happens in the future and the other thing that might not be obvious is that it looks like here when we've got some output it gets fed back in as as input here and that looks like it's creating a recurrence so here this is not creating a recurrence per se simply because there is a method for training known as teacher forcing where the idea is that when you train the network you have both what is the input sentence and the output sentence and then you can simply say well let me assume that my outputs are correct everywhere so I'm going to feed that as input here so I'm going to feed in what are the correct output words for the previous positions and then I'll simply try to predict what is the next word based on that okay so with this scheme that is known as teacher forcing then you can essentially decouple the output here from the input here and we don't have any recurrence relation in training now at test time then what happens that you really have to execute this network with the recurrence relation but that's okay so where there's really again is a training time so training is what takes a long time and if we can remove all recurrence relations so that we can do all the computation in parallel this will be a lot faster and then through this teacher forcing trick then what we do is we simply assume that we have the correct output words for the previous words and then we feed them in here as if they were given to us and then we simply try to predict what is the next output any questions regarding this okay let's continue okay so the other important layers are the nollans ation layer and also the positional embedding so the normalization layer is actually quite important and quite interesting so it's this layer that we saw right here so on top of every multi-head attention and feed-forward network there's here a normalization layer so what this does is that it helps to essentially reduce the number of steps needed by gradient descent to optimize the network here whenever we've got a network with multiple layers we've got weights in each layer and then those weights are going to be trained by gradient descent but when you look at the formula for the gradient it's it's often that the case that to compute the gradient of one set of weights then it depends on the output of the layers below and also what is being computed in the layers above right so depends on on what is being computed below and above now the problem is that if we're still adjusting the weights but below and above now when we compute the gradient then things are not going to be stable since some level it's like we'd rather wait till all these layers have stabilized and then we can optimize the gradient in the middle properly the problems we can't do this because we have to optimize essentially all of those layers simultaneously and then there's this effect where you see you change some weights that affects the other layers then you change those weights that affects the layer that you just changed and and so on and then so it it makes the the convergence quite slow because we've got all these inter dependencies now there's no way of completely getting rid of the inter dependencies because if we did that would mean that we're essentially breaking our network into parts that are not connected anymore but one thing we can do is to do some normalization when we normalize what this does is that it ensures that the output of that layer regardless of how we set the weights are going to be normalized they're going to have a mean of 0 and a variance of 1 so the scale of these outputs is going to be the same ok so so now to obtain the same scale what we can do is you see for each hidden unit we can subtract from it the mean so the mean would be just the average the empirical average like this and then we can divide by the standard deviation which is here the square root of the empirical variance and then there's also a variable G which is known as the gain that's added to essentially compensate for the fact that we've just normalized but the idea is that with this approach then we can ensure that you see if G is is set to 1 then this would ensure that H is always normalized with zero mean and variance 1 and therefore you see if there's some gradient competition that depends on the output of that layer the outputs of that layer are always going to be the same scale they're going to vary but these are going to remain on the same scale and then as a result the other gradients when we compute them they don't have to adjust simply because we were changing the scale of those outputs so that reduces the dependencies between the layers and it tends to make the the convergence faster okay any questions regarding normalization okay perhaps one thing I should see as well some of you might have heard about batch normalization so this is closely related to batch normalization but the main difference is that we're doing the normalization at the level of a layer whereas batch normalization would do it for one hidden unit but by normalizing across a batch of inputs the advantage of layer normalization is that we don't need to worry about how large our batch is so bash no machine only works well if you have fairly large batches and this here we can feed in one data point at a time we can have mini batches that are very small in fact we can be in an online or streaming setting where we just feed in one data point at a time and we can still do the normalization and it still has the same effect as bash normalization in terms of decoupling how the gradients evolve in different layers okay so the other part that is important is the positional embedding here so if I just go back we introduced a positional embedding right after the input embedding the idea is that with the attention mechanism it doesn't care what's the position of the words so the words could be all shuffled we could consider them as a bag of words and if it wasn't for the positional embedding right we would get the same answer and at some level that's not good because sentences the ordering of the words is important in order to tell us what's the meaning right so the ordering does carry some meaning so we need to still capture some of that information and here this is really an engineering hack okay so it's not clear that this is really the best way to to still capture the ordering but the idea is that we're gonna you see we have already an embedding here that is supposed to capture information about each word now let's just make that embedding capture information about the word and also its position so we're simply going to add a vector that is known as the positional encoding and that vector is going to be different depending on what is the position so it's just a vector that embeds the position which is an integer and then we add that to the embedding of the word okay so what's the precise formula for the the positional embedding it's given here the idea is that we we have a position which is an integer and then we embed it into a vector so it's a vector what with multiple entries and now each entry is going to be computed according to the sign of the position divided by 10,000 to the 2i divided by D or the cosine of the position divided by 10,000 to the 2i divided by T okay so here just to illustrate let me draw something out of board so we have the position we're going to compute these on this a position embedding so here this is a scaler and this is a vector okay so the idea is that we already have an embedding for the word and this embedding is a vector now we want to encode as well the position so obviously there's multiple ways in which we could do this the simplest could just be to have an integer for the position maybe append that or concatenate that with the embedding of the word but in in this work the authors simply chose to add to this a vector and then it's going to be a vector of the same dimensionality as the embedding of the word so often these vectors are going to be hundreds long and now how do we go from a scalar to a vector well this is where the formula gives us a way to obtain a different value for each entry of the vector so you see for the even entries we're going to compute the sine of this and then for the odd entries we're going to compute the cosine of this and here really this is something that's debatable okay we could consider perhaps different ways of coming up with a positional embedding but at least the key is that it carries information about the position which allows us to distinguish each word so that then the our sentence still retains ordering information yeah so we [Music] yeah very good points we are just adding the positional embedding to the word embedding could maybe affect the information that is included in the word embedding and yeah my gut feeling as well is that it might just be better to concatenate this so that we don't lose the information but in any case that's what the author's chose to do and it seems to work relatively fine okay all right so if we compare now a transformer network with a recurrent network or convolutional neural network we get the following complexity estimates so here the transformer network is the one called self attention in a self attention Network what happens is that in each layer here I guess a layer would consist of n position so it's if we've got a sentence of size and so we're going to have n positions and now when we compute an embedding for each position it would have dimensionality D so the order of computation the complexity of the computation in one layer is going to be order n square simply because for every position we're going to try to attend to every other position so all the pairs is n square and then for each pair we're going to compute an embedding of dimensionality D so that's that's the complexity we get now the benefit is that if we want to capture long-range dependencies then the maximal path between any two positions is just one simply because we have our attention mechanism that that combines in one operation essentially every pair of words so now information can flow between pairs of words in one step so so then we don't have to worry about information being lost like in the recurrent neural network where the first word that we process you know gets embedded but then after that when we process additional words then this embedding changes and eventually it loses information from the first word all right so so here it allows us to essentially combine together information about pairs of words immediately so that's a good thing so here a path length of size one is great and the other important aspect is that here there's no sequential operation since that we have a sentence of a size and but now we essentially process all of the words simultaneously we're going to look at every pair of words simultaneously so I guess this is where we have this n square that creeps in but in the other hand all of this can be paralyzed so today we gpus we want to exploit paralyzation and it's better to in fact not have to process the word sequentially but then to do everything in parallel so even though we have a factor n square this n square and practice might not be so bad simply because we're going to do a lot of those operations in parallel in contrast a recurrent neural network will have this complexity so in each so I guess here the idea of a recurrent neural net having multiple layers the way to think about it is that normally we think of a layer as being every word that we process but then another way of thinking about is that you could have stacks of recurrent neural networks so we haven't talked about this but it's actually something common that people would do in practice and that makes the network even more complex and heavier to Train but if you have stacks of recurrent neural networks then here we will have some computation let's say that we have n of those stack together so the competition would be n times d square in in just one layer in one stack ok and then the d square comes from the fact that we have an embedding of size d and after you consider gru on LST m or even just a linear unit that produces the next embedding of size d then typically you're going to have a matrix of weights that's d by d that will essentially multiply the this hidden vector to get the next hiddenvector so that's how the d square shows up it's we have sequential operations because we have to go through the entire sequence and then the path length is at it can be up to size n because if you want to combine information for the first and the last word then you've got a long path to go okay so in general this will be quite advantageous it will help to reduce competition and improve scale I guess I'm yeah I'm proved scalability quite a bit any questions regarding this okay now in the paper in 2017 regarding transformers there was a comparison done for machine translation here they compared English to German plus a DD they did transition between English and German as well as English and French and then they compared a bunch of models and their models are down here okay so they've got a base model and then a bigger transformer if you look at the results they're not really outstanding I mean they're improving at least in this case for English German improving a little bit the accuracy so here blur if you recall is a measure of precision where you look at the percentage of words in the output translation that are part of some human translation roughly speaking so the higher the better is the score and and then so I guess I'll perform a little bit here they came close to the state of the art but now what's beautiful is the fact that they reduce the computation significantly so here I mean those numbers look horrible because when you see 10 to the 18 I mean that's scary and then 10 to the 19 hours is worse than to the 20 10 to the 21 but now the difference between 10 to the 18 and 10 to the 19th that's a factor of 10 right so something that would take 10 days that might take one day and with respect to 10 to 20 that's a reduction of a factor of a hundred and for 10 to the 21 that's a reduction by a factor of a thousand so this is a major reduction in terms of the training time and here I'm not sure I don't recall from the paper whether that takes into account parallelism or not but in any case yeah this gives you a sense about the fact that really a big advantage is that it reduces computation while still achieving the state of the art okay any questions regarding this good yeah okay so that's a good question yes so the training cost is different for different languages it might have to do with how much data they use for training but then presumably that would have an effect here too I'm not sure because they're presumably there should be a difference here - I'm not sure we'll have to look it up in the paper okay so transformer was essentially the starting point of a new class of neural network that do not rely on recurrence and then another important type of transformer is known as GPT and then I guess an improved version known as GPT - so these were proposed in 2018 here when they were proposed the idea is that the did unsupervised language modeling and your language modeling is is a general task where you say I've got a sequence of tokens a sequence of words and I'm simply going to predict what the next word is and it turns out that a lot of tasks in natural language processing can be formulated as some form of language modeling so if you take machine translation and you concatenate the input sentence and the output sentence into just one long sequence and now let's say you've got a language model that essentially simply predicts what is the next word in the sequence and furthermore let's say that this language model doesn't care about whether it's English or French or any language it just predicts the next word in in the sequence then you can train it to essentially do translation right so you just feed it with the inputs and then it will predict what the next words are in the other language so a lot of tasks and be formulated this will where you just create a sequence and then just by the virtue that is gonna predict the next thing if the next thing is what you care about maybe it's a classification maybe it's another sequence of words then you can do do all of those tasks with a language model so here they did something really interesting where they they trained some decoder transformer so because they were only going to predict the next word given the previous word and then there was no need to really separate some input sequence from an output sequence then they did not really need to have the encoder part so in the transformer architecture they actually got rid of the encoder and then they worked only with the decoder so the decoder attends to the previous outputs which could be considered the input and then it never attends to the future output so it can just generate sequentially right so so that's the main thing compared to the transformer network they essentially just got rid of the the encoder the other thing they did here is what they call a zero shot learning so it means that they simply took a very large corpus they trained on this corpus to predict the next word in the sequence irrespective of what the task is and then they simply applied this to different tasks where the network was not tailored or was not fine-tuned for that task it was just trained generally speaking to to predict what the next word is in the sequence and then so they did this for some tasks that correspond to reading comprehension translation summarization question answering and we can see in blue their performance here the performance improves as we increase the number of pounders for the language model and then the compare to state-of-the-art techniques now their approach was general it was not trained specifically for a particular task whereas in this case for since PG net dr qadr QE + PG net and so on are trained specifically for that task so what's beautiful is that in a completely unsupervised fashion without really being tailored to that task they managed to come close to the state of the art and then a it's fairly general it can be used for many tasks right so you can see the results it doesn't beat the state of the art on most of those tasks but it it it does beat at least some techniques that were tailored and then i guess if you keep on improving it suggests that those curves i would lead to further improvements okay any questions regarding GPT okay so let's continue now GPT was not the last one there's another one called Burt that has become quite popular and it was proposed this here bird stands for bi-directional and coded representations from transformers so it's another variant of the transformer network and the main advance that is being proposed here is that instead of just predicting the next words in the sequence why don't we predict a word based on the previous word and then the future words so there's lots of tasks including machine translation where if you think about it if you're given a sentence as input there's no reason why you have to really produce the the sentences output sequentially one word at a time you could work on your translation by coming up with some sections of the translation and gradually building up your your your translation but you don't have to do it perfectly sequentially right so a lot of tasks are actually like that and then what it means is that you could take advantage of what comes before and what comes after so it's a bit like bi-directional recurrent neural networks that improve with respect to you need actual return neural network so here it's a bi-directional transformer and naturally it does better than GPT so the they tried it on a bunch of on a bunch of tasks in fact eleven tasks and then here what he did is they they did unsupervised free training there's just like GPT but then to really compete with the state-of-the-art they did some further fine-tuning with data specifically for that task so so the other proposal is we first trained a general network unsupervised with lots of data and then we fine-tuned the powders by doing some further training with data specifically for that task and when they did that then they obtained those results and here this is quite impressive because okay they improve the state of the art on eleven tasks and if you look at some of those tasks like for instance this task here they improve the state of the art from 45 to 60 okay so this is a major improvement okay any questions regarding Burt alright so Burt is I again not the last network there is another network that was just made public about a month ago called Excel net and now Excel net beats birth as well okay I don't have a slide for it but I can tell you roughly speaking that the main difference is that Burt essentially assumes that we've got I guess everything in the window before and after whereas Excel net allows missing inputs in a sense and then tends to generalize better by looking at different subsets of words before and after and as a result I tends to generalize better and then it improves again on a lot of tasks I don't remember what the number of tasks is but in general it beats berth across the board for most tasks okay so this has been a fruitful direction and then it's become quite clear now that those transformer networks can perform quite well both in terms of accuracy and also in terms of speed and and that becomes questionable what the future of recurrent neural networks will be okay any questions regarding this alright okay so this concludes this set of slides
Info
Channel: Pascal Poupart
Views: 199,673
Rating: undefined out of 5
Keywords:
Id: OyFJWRnt_AY
Channel Id: undefined
Length: 82min 38sec (4958 seconds)
Published: Tue Jul 16 2019
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.