Attention is all you need; Attentional Neural Network Models | Łukasz Kaiser | Masterclass

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[Music] okay so so you've heard a great talk just before about how to use neural networks to translate and and this has been quite a big thing in all of natural language processing so people have been doing a lot of stuff in natural language processing power seeing translation finding entities anything towards making machines understand text a little bit and then clean these neural networks and as you've seen in the previous talk the main thing is they generate the next word right they look at the past maybe have some recurrent neural networks maybe some other mechanism and the thing they're trained on is to generate the next word of the output and and mostly for the last year and a half or two years which is like distant past in in deep learning pre historically it's been done with RNAs so you have this lsdm cell it generates the next word and and when I started looking at it I I like didn't believe that you can train them something that will translate from English to German from raw data with such a generic model I mean you've seen the talk the previous talk and all the kinds of mistakes it does and the phrase based and non phrase based systems but but to me I am not sure if if you got this vibe that this is something incredibly amazing because it's a generic system it's four lines of equations and you give it data and it learns to translate it learns to speak fluent French and and actually on context from from English right but but but this has been done so what's missing well the translations you saw they they have some problems some mistakes and we were thinking for a long time like what's the gist of this problem what is not so great about it and it turns out these neural networks you translate sentence by sentence you do it like on the level of tokens or words so a sentence maybe has 40 words a long one maybe has 60 but but when we speak we think about context that goes way beyond that right when I when I speak when I go to see my old friend I will start talking to him about things we talked about 20 years ago maybe and I will know it I immediately recall what's needed there and and and this thing which people call different names may be long range dependencies or or long hop gradients it seems to be like one really important thing to to solve and to tackle to make these networks like understand more and so if we train the network with a long range with something that can really look back many thousands of steps and and remember it well we could train a translation but translation is usually sentenced to sentence I'll tell you the details about this model in a moment but imagine if we just trained it to generate text to write something and since we want it to be longer range let's say we train it on Wikipedia articles okay so that's called the language model something that just generates the next word it since it's a model it's probabilistic so it's a probability distribution over various articles but we so we train one and and this is what it generates so the only thing is conditioned on is the title which is the transformer because that's the name of our model so let's see the transformer are a Japanese hardcore punk band and it's trained on markup so which generates markup too and then it goes on in the early years the band was formed in 68 during the height of Japanese music history among the legendary Japanese composers of Japanese relics they prominently exemplified mahara orders especially tasty lyrics and psychedelic intention I don't know what you know a creative part here thinks of this but it starts as a Japanese band then it goes on and you know it generates history born from the heavy metal Revolution in seven welders kinds of goes off in the things and let's go on the generally played around Japan growing from the top forty standard and then in 81 the bus is Mitchell corner and the members of their original lineup emerged right the interesting part is it makes up some names and then we went to Wikipedia and checked these people don't exist neither does this band or anything like that is all totally made up and you know but but it has it makes up this bassist and then he is the original founder and then there is Nietzsche fou Cunha and you can see that since it decided it's a Japanese band it has kind of Japanese sounding names and it knows that the band should have a history and so on it goes on and goes on and then there is the band to split right then in 2006 the band split up and the remaining members reformed under the name star mirror with porno in tears so this is totally made-up language model and well but this is not a hand-picked example it's just sampled from a distribution you sample again and you have another thing that transformer is a book by British Illuminati Sterman we read set in a post-apocalyptic world that border on a mysterious alien now is the transformer planet well should have a different name and you see it knows the book has a summary and then the book covers the world of binoculars from around the earth and so on and conceptual the transformer universe principles a setting man named the super Congo planet who's a naturalistic device I don't know sometimes it gets too too strange but what's interesting it makes a quote from the book it actually learned that inside Wikipedia pages about books you have quotes from the book so it makes a quote and a man called de Clanton Ebert all known as the greatest Raider whatever its it's it's a strange thing so so this was a model that was trained only for like two days on 8 GPUs we have we have better models now and it's very interesting that the better models they like learn to make up poems in the middle of articles about lyrics and they keep consistency if it's a Japanese band it keeps talking with Japanese names about Japan it refers to the members it refers to the years that were there before so I mean I want to give you the feeling that this is something exciting this is not something computers were able to do like few years ago yes it remembers what it generated before yes well the data are for Wikipedia articles though they are coherent and it learns from that it learns that now it's trained on all English Wikipedia all of English will keep it and so how does it work so you heard about Aaron ants but Aaron ants have have this problem that if you try to generate 2,000 words this state and the gating in the lsdm would start to make the gradient vanish I mean it's called long short-term memory but it's not that long so people started to try to solve this with convolutions that was like last year when you take a long sequence and you make convolutions and convolutions you can get the I think that that's easier to to see on the picture so if you make an RNN it needs to go like one word at a time so to get to here you need to see maybe the thousand words before right then King this model from deep mind called wavenet and then the for translation it was called byte net but it's all the same thing which is we make a convolution on the lower layer to jump to thing two steps and then the next layer skips one at every point so it skips one here and skips one there so in the second layers you see four right the next layer skips two or three so you see eight so for a logarithmic path you'll get to this word with logarithmic number of layers gets to see a context so it's it's much better it's not linear like with the neuron and it's logarithmic and they had some success translating with this was not bad but it still has like it's very positional every layer loops like this guy sees the word number thirteen before me why is word number thirteen before me important maybe in this sentence it's number 12 it seems like this scheme is referencing thing very much by position while referencing by content would be much more natural thing so so how do you reference by content well there is this mechanism called attention that was introduced already with OS DMS but would just look at one thing and attention is something where you make a query with the vector and then you basically look at similar things in your past so while the convolutions are quite a positional thing that's where you have a different color for every position attention looks at everything but gets things that are similar so maybe this guy is very similar to what I'm looking for and this one is not at all that's the general idea so when you retrieve similar things you can look at a very long context and so imagine you're you're producing the name of the band member right now you can attend however many steps behind you can see oh earlier I said it was Michoacana let me say this again right and with CNN's you could do this but but you need to know the exact position you'd need to compress the data very much with attention since it's content-based queering you can do this quickly so okay so there are attentions what I showed you is a language model meaning it just decodes it has no inputs but for translation you need an input so you need an attention that's from the output attends to the input you need an attention in the input which can attend everywhere and then attention in the output which only attends to things before now can anyone tell me what time it is okay so I have when am I supposed to stop so I can go okay so so so let's look at the details of the model we have some inputs it's say the English sentence right and then there'll be a multi-head attentional I go into detail of what that is later then there'll be a feed-forward layer just that every word will be processed and that's the processing of the input now when we start generating output we need this masked attention which is like the one here now I'm not sure if this is doing justice to that idea so in a recurrent neural network it's very clear that you go from one word to the next right if you in the more standard attention things that have been there for a while you attend to everything there and back but if you wanted to make a language model you cannot do this because at the end you're predicting the output so from this word you're predicting the next meaning this word if it could just attend to it it will look at it and copy but there will be no way to decode so this would be a model well it would train immediately but had no way of inference you could never generate anything the trick is you want to attend to only things before you and but so how would you do this in an RNN you can do it but it takes you every step every work that comes in you process it in one step meaning if you have 2,000 words you need to make 2,000 consecutive steps which is a lot and it takes a lot of time it will take you months to train and you'll not be too happy now in our model which is gone in our model you have the outputs they come in and then there is the masked attention which is a single matrix multiplied with a mask so all of these things they're done in parallel you have a thousand words the camera is one back as one tensor that's thousand by something long and then every word I needs to attend to all words before it which will be done by multiplying everyone with everyone and masking out half of this with zeroes so this step which does attention for maybe two thousand words with two thousand words everyone to everyone is a single matrix multiplied and the GPU hardware is extremely optimized for large matrix multiplies so this can happen in in a very quick time and then you do this again but then multiply with this okay have a feed-forward and then do a prediction in the in the sense of neural network architecture this is a very simple hue layer feed-forward architecture in a way it's much simpler than the RNN it has no recurrence there is no way to sharing everything happens in a bunch of masked matrix multiplies so what is exactly this attention so we define attention there is a query that will be a vector there is a and then there is this key and value match matrix which is your memory so you can think like this is the current word what I'm operating on and this is all the past all the words I've generated before and keys and values can be the same thing but they don't have to does that's why we write it like this so what you want to do in attention is take the query find the most similar key as I told you it's it's a similarity thing and then get the values that correspond to these similar keys but of course it's soft at or needs to be differentiable so how to do this efficiently well you take the query multiply it by the transposed keys so that's a matrix multiply then you take a softmax which which is like this exponentiation and normalization and that gives you a mask that gives you a probability distribution over keys which is peaked at the ones that are similar to the query and then this mask you make a matrix multiplied with the values which is the same as summing over values multiplied by this mask and there you are you4re what you want you've done to matrix multiply and one softmax operation which is very fast actually you need to normalize it a little bit to Train well but that's a technical detail and that's it so how costly is it actually well if you have a recurrent neural network let's look at that first you need to make n steps every word needs to make a step and in every step you have this dimension by dimension operation may be a bunch of them but generally it's d squared when D is your hidden dimension of the network attention is a little scary because it has this N squared think everyone is attending to everyone before it so so that's very scary and it's scary scared me for a while I was telling now maybe we shouldn't get into it it's N squared and then you know if you do any theory you'd say oh that must be worse than this but then luckily at Google there is this guy know I'm and you know he he never got his bachelor but he wrote most of these papers and he says well well well D is usually at least a thousand in our large networks while n is the number of words in a sentence so that's like 70 so what are you talking about this is 70 times thousand squared this is 70 squared time times thousand you're totally wrong in most cases this is smaller and when we get to more than 2,000 words we can find out ways to to do more smarter things let's first do that so so so we did it and there are so one problem with attention since it's just a similarity measure it works as this was as if this was just a set of words it has no idea that this word comes after this because it just retrieves the most similar ones but it's really important the order of words is not arbitrary you cannot just reorder and hope that it will translate well or or generate the right next word do you need to add some timing signal and you need to add some a little bit of the positional things so so we have the multi hat attention with positional signals I will not go into the details of how this is done so there will be a multiple attention heads looking at different words and they know the positions okay so have better better pictures in a moment right so so we built this model it has multiple layers of these attentions and as a standard now we trained it on translation and then you know deep learning is is a trade right you need to do all the tricks to make things work actually well so so you need to have dropouts in the right places you need to use the atom optimizer and maybe think a little bit about your learning rate and how to decay them we use something called label smoothing and in the soft marks and and then when you start actually decoding from this model you need to pay some attention to to what you're doing so so there is there is a bunch of technical items that we we did and then that was I think the fifth time I did did them and then we also started thinking well it's actually quite tricky to do them all right and and since this was my fifth time I and also I did it with other people I realized that the previous four times always at some bugs and then we thought okay so with someone from outside of Google brain where you likely have you know like 20 people who've been doing this for a longer than almost anyone else and they can tell you what you're doing wrong it's very hard to do this right on your own so we decided to make a library that where you could at least look at the code and get these things right and I'll talk a little bit later about how how this library came out but then right then we we started doing the this machine translation thing and you know I I don't know if you got this blue from the previous talk so there were like these phrase based systems and on this quite standard english german things they got like twenty point six blue or something like this and then came the king d'like so this was the google production paper it got 24 then there was a very recent facebook paper that used convolutional things it was 25 there was a mixture of experts layer paper which is this model with some extensions it got 26 so this was considered very high like if you give this to human translators they will get something around 30 so this is kind of pushing and of course blue is a bad metric like you can get 30 with something that still makes totally stupid mistakes and these networks are by no means perfect but in terms of this metric this this seemed quite high so we per set here you're doing another architecture for translation what can you get you'll get 26 and a half should be very happy so we were and then there were ants and balls and you see they get like 26 4 and then we were extremely happy because we trained this big transformer model and it got like 28.4 and and that's a very high number and we we looked at these sentences and and they were really better like the reordered in a better way the words come in a better position and like we tested what parts of the architecture mattered most but here here is a funny thing so who knows about the vinaigrettes schema so the vinaigrette well was the scientist and he was thinking what's a--what's a good challenge for for AI that would have to do with language and then he thought well there are these things where you really need to understand the meaning of the words let's say I tell you he didn't put a trophy into the suitcase because it was too small what was too small right the suitcase and if I say because it was too big it was the trophy now okay this obvious to everyone no no human will may ever make a mistake when you translate it into French and also into German trophy and suitcase have a different gender there is simply no way you can translate these two sentences correctly without having quite a deep understanding of of what's happening in the world and it's totally not clear how a neural network would be able to get an understanding like this and indeed if you try a few vinaigrette schema sentences there were no machine translation models that did any good on them not Google Translate not the lsdm models not phrase based translation by any means it's so-so so we threw them at at our new model not expecting too much and then it started doing some interesting things it doesn't get them all but this is a shorter version the animal didn't cross the street because it was too tired or because it was too white you see the same pattern and then because there are these attention heads so when it's producing the word it we can check where so there there are like 40 of these attention heads the blue one is just one and it looks at the animal and Street so this is the attention mask the query x key and the the stronger color means that it's it's it's the bigger strength so you can see that it looks at animal and Street and here it focuses on animal more well here it also looks at them both and focuses on street more now it will not get everything right but what this shows you is that this model has the capacity of thinking about relations because it can retrieve these two things and in the next layer start processing this to you know decide what to output next so here here is a bunch of this vinaigrettes schema sentences and google translate gets one or the other wrong while we get like these two we get correct the these two this we get wrong it's a I think it's a very interesting thing with this attention heads that some of them you can actually interpret you can see like this head it always looks when it produces pronouns it looks at things that it basically does core efferent resolution looks at what what should it stand for there is always another head that whatever word it is in it looks at the head noun of the sentence or the head verb just wants to know what are we talking about here and it's really funny the bigger you make the model the more data you have the more interesting things you start seeing with these attention heads and okay so so that that's the transformer model and then another project was to think okay so it does very well on translation it produces nice language can we can we do multitasking with this could we do could we do a lot of things in common because you know it's it's it's fun to generate language but it would be even better well you know we need to train one model for translation one model for parsing one model for entity resolution one model for grammar correction it becomes boring at some point it's like can we just train one model so we started thinking in this way and then we thought okay if it's one model and also should do images and speech but then how do you put an image together with the text well the image is a $2.99 by $2.99 pixels and the text has these tokens and stuff so so so we didn't like it didn't seem like you can just put them together in one place so we made these small things which we call modalities that take the raw input and process it a bit and return a compressed representation that that's of a similar size for for for all different modalities and and then we trained it on on eight different tasks so we trained it on for translation tasks the image net image classification task image captioning task and what else oh parsing and speech recognition okay let's give it a go and and then it failed totally I mean it either did one of the tasks and not the other or already didn't do any of them so we we started working on it and one thing that I mean there was there were a bunch of things that were important but there is this thing called mixture of experts and it's a technique that allows you to make your model big without making it slow so normally when you make your model big you increase this D we have the hidden dimensionality but then every matrix every weight matrix is d squared so every time you run the model no matter what you do these d squared multiplies and if D is ten thousand or ten thousand times ten thousand s cannot be fast now a mixture of experts is you have say a hundred matrices or maybe a thousand and you will have a gate that your model will learn and it will only pick four of them it will multiply by these four and ignore the rest and gather the result but for a different word for or for a different image or for a different input it will multiply by another four so you get the capacity you get the parameters like in the previous talk you heard there is this memorization versus deep neural networks and in some sense this mixture of experts model allow you to have a lot of capacity to memorize because they have a lot of parameters without paying the cost of a four-speed so when you train that train the multimodal then you have a single model trained on a single run and well you need to tell it like what to output whether it's English or category or French or German or a parse tree and then it starts to work you you you you give it an image and you say okay please caption it to me and this is some man sitting in front of a suitcase okay maybe it's not perfect but it's something it can find categories it can translate from English to French and to German and it can like give you a parse tree for a sentence and there are some interesting things that happen without any training because you're training Komodo to translate from English to German and from German to English from English to French and from French to English right this this is the training data you have but then the model if if you give it French and say to German it will output you were very reasonable translation even though it's never seen French and German in its training data but it has seen English and it manages to connect it so so I think this multitasks think is very fun because it improves it especially improves that on tasks where you have little data with all deep learning is also great but please come to me with 10 million sentences right and you know getting 10 million human translated sentences is not that easy ok and then I also talked about the library so since we've been redoing this again and again we decided to finally make it into a framework and a system and we call it the tensor 2 tensor library so of course on top of tensorflow and it has all the tricks that I was talking about the optimizer the label small things the the learning great DK schemes it's that they're well tuned so that you can change the model and play with it and not change the rest of the things because of course every every training model you download from anywhere will have some label some say learning rate DK but you tweak the model and you might need to tweak the learning rate for example if you if you make the learning rate dependent on the behavior initialization is dependent on the depth of your network you change the network a bit you need to change the initializers it's everything is connected with everything and then you know you get the model it runs you change one thing it breaks so we we try to avoid that we try to work with something that's always we scaled to some unit norm and allows you to tweak and build and mix and match without preventing it to work and we've had a bunch of papers and we implemented some older models and so we try to make it reasonably easy to use and install and of course it has bugs as every software but but if you're interested give it a try and you can you can run it on translation and so the earlier translation models so for example this one that's the current that's the model that's in the Google Translate currently it it takes like it has a fair cost of training like and to the 20s is not that little while this model gets a better result with the cost that that's like two orders of magnitude lower and hundred times faster is a lot like this takes a month on many GPUs and so on well this you can train at home overnight and that's fun because you you can you know you can tweak something and see that it works and or you know make it generate Italian so so and you can train a smaller version in about an hour or so it will it will not get you the blues course you want but it will translate English to German for me since now we want to try new architectures we want to experiment with new things and tries more distant ideas I find it fun to be able to you know train for one hour and see and you very quickly see that no it's totally garbage it doesn't work at all but you know that something can be done with this time so yeah I mean you're welcome to try it out add at the data set and a problem and you know thank you for your attention [Applause] this this kind of mellow industry translation language translation that's where it for one language being at a time and one day to say that sign all you can generalize all language videos it works s with the multimodal if it's just language to language you don't even need modalities it's the same you just need to make sure the vocabulary is the same no we train together we train on for say for language / English German German English English French French English it's one training run and one model no not for this experiment no but we can do this and it also gets very good results they'll probably be slightly worse on on every single language / so to compare things academically right to to present results you need to have the same data set as was used before and all the previous papers are not multilingual that's why we don't put numbers for multilingual but but you can do this and actually the production version of Google Translate that you use on your phone it does have multilingual models because it's just easier to launch them in production you know they're 103 language purse that Google supports that's a lot of things if you group them they're grouped into like groups of say 8 languages or maybe sometimes six or sometimes 10 the depends on and especially if there are languages like the multiple Hindi languages we have a little bit not too much data for them but but but they can benefit from transfer learning from others you group them together train one model and of course people you need to check whether it gives good results but but yes the multi mark or training you can do this with many models you can do this with this model you can do this with the GMT model but it's in general quite a good idea although like for example French and German I think the French will go will be a little bit worse because the French data is one of the largest data sets in translation either even the public data that is 40 million sentences so if you're adding German you're not improving it at the part of the model will be right we we try to examine this and so the hope was that we could like see that the languages that are similar in language groups will give bigger improvements there is not a clear correlation sometimes there are strange things happening like Vietnamese improve check for some reason I don't know there but but yes yes they see the zero shot is is good because in general we have only x2 English and English 2x data bar but but people sometimes want to translate between it needs more parameters you need a larger network and one good way of making larger is to use the mixture of experts because if you make it larger in a naive way it will be slower and yes so when the image comes it first has this image specific part which is down Strider convolutions and they make it down into some something small like seven by seven and then that that gets into and the text if it's word level it just gets embedded if it's character level it's also gets a little bit compressed and then the small parts they go into the common joint model right so when the text comes in you embed the words when the image comes in you use the pixel as value as a number right the the you divide by 255 and and then you have three channels and then make a convolution from three channels to 32 channels no no no no no yes yes we're experimenting with a bunch of things it's there are a lot of things and they're few of us so so it's yeah we were thinking about it but but it's also open sourced for a reason right the training is oh no the library is for oh you mean so the library does everything it it will download the data front for you tokenize it pre-process it train the model and give you something where you can type in text and then any coder send us yes yes it's enter 20 and as long as the data set is public which not all are but but for translation they are it would download it for you of course we cannot like for speech it's harder because a lot of speech datasets they require you to register have a password somewhere downloaded pay some money we cannot like we didn't but we should it's a what we did try was so one problem with document level or context level translation is that its effect on blue score is very minimal we gave to human translators some paragraph level documents and scrambled sentences and then gathered it together and the difference in blue was negligible and it's very hard to you know in when you make a system like this it's not like you train the model in it and it's wonderful you spent weeks after weeks on tuning parameters and trying to improve and to guide this process you need a metric where you're reasonably sure that if you improve it you've made progress in the context level translation it's very hard to work with blue and we don't have a metric to but yes I mean I think translation sugar should happen in context it's just we we need an automated metric writers and yeah so I mean I think it will come all right so yes I get over it a translation is a bag of an attention over sir anything is like it it was a bag of words right it doesn't have anything so you need to add something and we tried a number of things so one thing we try this for every position to have a separate vector and it's learned and this is word number 65 and this works a bit but it doesn't generalize so if you have a sentence longer than anything in your training set it gets confused so what we ended up doing is to add sine and cosine curves so for every position I you will make a position vector and it will have sine of I but also sine of I divided by 2 I divided by 4 I divided by 8 so that since the vector is of depth D it can have a lot of numbers and n cosine for some reason and so this this vector allows you to see on different frequencies because what the network really wants is to attend like at least one of the attention heads will always attend to the word before me because it's important to know who's just what just came before me right or towards before me so this positioning vector allows this because a lot of these positions basically all except for one are very similar to the word just before me so it can latch on that and attend one before and but it can also attend like 20 before seems to be a reasonable way to handle positioning have given outside presentation like price recognition or emotional mercy did you try is approach so we there are a lot of problems in language right and even besides language and and we are well I listed the people but they also do other things than this so we're really maybe three or four people doing this that's why it's open source and that's why your or here you should go and try but we try done on a number of tasks we tried on parsing and the results were exceptional so so even with the standard parsing data that is forty thousand sentences which for a neural network is extremely tiny and for example an LST am a pure RN n model cannot learn parsing from this a model with some attention and tweaks learns gets like eighty nine and who this model gets ninety two which is better than the berkeley parser and the thing you need to understand berkeley parser likes sloths petrov spent his PhD and he's a very smart guy and he spent years writing it and that's on top of other people who spent decades like writing grammars probabilistic grammars latent inference model for them and they get to 91 and a half this thing with which is super easy and basically all the equations were on the slide you just run this on this forty thousand data and you get ninety two it's in a way depressing you know because it means stupidity is enough now no maybe not but but yes so we tried it on a few things we also tried it on summarization internally it worked better than any other model so it seems like if you tweak at it it works really good we tried it on images it didn't work so well so so with this but but it sometimes is that you just need some other trick [Applause] [Music]
Info
Channel: Pi School
Views: 87,278
Rating: 4.9360189 out of 5
Keywords: Attention is all you need, Neural Networks, Artificial Intelligence, Neural Network Models, AI, Tensors, Attentional Neural Network, Google Brain, Machine Learning
Id: rBCqOTEfxvg
Channel Id: undefined
Length: 48min 23sec (2903 seconds)
Published: Wed Oct 04 2017
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.