Stanford CS224N: NLP with Deep Learning | Winter 2019 | Lecture 8 – Translation, Seq2Seq, Attention

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
So welcome to the Machine [NOISE] Translation lecture, which is kind of like a culmination [NOISE] of this sequence of three lectures on RNNs and related topics. So let's have a few announcements first. Uh, the first thing is, as you probably noticed when you came in, we're taking attendance today. Uh, so you need to sign in with the TAs who are outside the auditorium. Uh, if you missed it, don't get up now, it's fine. There will be time to sign in after the lecture. Uh, and then, if you have any kind of questions about special cases with the attendance policy, uh, you should check out a Piazza post that we put up last night with some clarifications. [NOISE] Uh, you have the reminder that Assignment 4 content is going to be covered today. So you're gonna have everything you need to do Assignment 4 at the end of today. [NOISE] And do get started early because the model takes 4 hours to train. The other announcement is that we're going [NOISE] to be sending out our mid-quarter feedback survey sometime in the next few days probably, uh, so please do fill it out. You'll get 0.5% credit, and you're also gonna help us to make the class better for the rest of the quarter. [NOISE] Okay. So here's the overview of what we're going to do today. [NOISE] Uh, today, first, we're going to introduce a new task in NLP, which is machine translation, [NOISE] and then, we're going to introduce a new neural architecture called sequence-to-sequence. And the connection here is that machine translation is a major use case of sequence-to-sequence. [NOISE] After that, we're going to introduce a new neural technique called attention, and this is something that improves sequence-to-sequence a lot. Okay. So Section 1 of this is gonna be about, uh, a bit of machine translation history, pre-neural machine translation. [NOISE] So machine translation or MT, uh, is the task of translating a sentence x, uh, which we call the source language, whatever language you're translating from, into a sentence y, which is in another language, which we call the target language. Uh, so here's an example. Let's suppose x is this French sentence. Um, [NOISE] could anyone in the audience, a French speaker, translate this to English for us? [NOISE] [BACKGROUND] Yeah. Um, the man is born free, and, uh, everywhere, he is in irons. Great. So that was something like, the man is born free, but everywhere, he's in irons. That was a fairly literal translation. It's usually translated, this quote by Rousseau is usually translated as, man is born free, but everywhere, he is in chains. But there's an ambiguity: [NOISE] should fers be, um, literally irons or chains? Also, you could choose to, uh, translate l'homme as man or maybe humankind. Uh, so this is an example of machine translation, and there's already, you know, quite a few choices you can make. [NOISE] So the beginning of machine translation as an AI task began in the early 1950s. So, um, in particular, there was a lot of work translating Russian to English, uh, because the West was very interested in listening to what the Russians were saying during the Cold War. And we've got a fun video here, [NOISE] which shows the state of machine translation in 1954. [MUSIC] They haven't reckoned with ambiguity when they set out to use computers to translate languages. A $500,000 simple calculator, most versatile electronic brain known, translates Russian into English. Instead of mathematical wizardry, a sentence in Russian is to be fed [OVERLAPPING]. One of the first non-numerical applications of computers, [BACKGROUND] it was hyped as the solution to the Cold War obsession of keeping tabs on what the Russians were doing. Claims were made that the computer would replace most human translators. [inaudible] you're just in the experimental stage. When you go in for full-scale production, what will the capacity be? We should be able to do about, with the help of a commercial computer, uh, about one to two million words, uh, an hour, and this will be quite an adequate speed to cope with the whole alphabet of the Soviet Union in just a few hours' computer time a week. When do you have to be able to achieve this feat? If our experiments go well, then perhaps within, uh, five years or so. So in this video, I think there's a number of interesting things. Uh, firstly, we can see an example of about how, uh, AI hype is nothing new. Even in 1954, [NOISE] they were talking this machine translation system as if it was an electronic brain, which I think, uh, overstates maybe how general it is. Uh, they were also, at least some of them, fairly optimistic that this [NOISE] machine translation system was going to be replacing humans, uh, anytime soon. Um, so yeah, that's, that's pretty interesting. And, um, [NOISE] the thing is that these systems actually were mostly rule-based, uh, by which I mean that they were mostly using a bilingual dictionary between Russian and English, and they were essentially mostly just looking up the Russian words, uh, looking up their English counterparts, and they were storing these big bilingual dictionaries on these large magnetic tapes. Um, so certainly, it was a [NOISE] huge technical feat at the time, uh, but they, uh, some people were probably too optimistic [NOISE] about how quickly it would replace humans. So jumping forward several decades in time, uh, now I want to tell you about statistical machine translation. So the core idea of statistical machine translation is that you're going to learn a probabilistic model from the data in order to do the translation. So as an example, uh, as before, suppose that we're translating from French to English. The idea is that you want to find the best English sentence y, given the French sentence x, [NOISE] and, uh, mathematically, you can formulate this as finding argmax y of this conditional probability of y, given x, [NOISE] and the model that you're learning is this probability distribution P. [NOISE] So what we usually do is, we break down this probability into, uh, its two components using Bayes' Rule. [NOISE] So this means that finding the y that maximizes, uh, probability of y, given x, is equivalent to finding the y that maximizes the probability of x, given y, times the probability of y. So the two components here, on the left, we have a translation model, and this is keeping track of how words and phrases should be translated. Uh, so the idea is that it knows, uh, how, uh, French words and an English word might be translated to each other or maybe small, small phrases and chunks of words should be translated. And this is learned from a lot of parallel data, and I'll be telling you later how we do that. The second compo- component P(y), [NOISE] this is just a language model. [NOISE] We learned about this last week. A language model is a system that can predict the next word, but it can also be thought of as a system [NOISE] that tells you the probability of a sequence of words. So here, if we're translating from French to English, P(y) is an English language model. [NOISE] So the idea is the, the reason why we want to break down this single conditiona- conditional probability distribution into the, the pr- product of two different ones is that this is a kind of division of labor. The idea is that instead of, uh, a single conditional probability distribution needing to understand how to translate, and how to write good English text, and understand sentence structure, and everything at once, the idea is that you separate it so that [NOISE] the translation model on the left in blue mostly just knows about a local translation of small chunks of words and phrases, whereas the language model on the right more takes care of writing good English, good sentence structure, word order, and so on. [NOISE] So you already know how to learn a language model [NOISE] because we learned about that last time. You just need lots of monolingual data, in this case, English data. [NOISE] So I'm going to tell you more about how we would learn this translation model that needs to be learned from parallel data. [NOISE] So we need a large amount of parallel data in order to learn this translation model. And an early example of a parallel corpus, is the Rosetta Stone. So this is a stone that has the same text written in three different languages. And this is a hugely important artifact for the people who were trying to understand ancient Egyptian. So in the 19th century, uh, scholars discovered this stone, and it helped them to figure out ancient Egyptian because there was this parallel text that had the the same text in other languages that they did know. So this is, this is a really important parallel corpus, and if you're ever in London, you can go to the British Museum, and see this in person. So the idea is that you get your parallel data. Obviously, you need a larger amount that is on the stone, and hopefully it shouldn't be written on a stone either. But you can use this to learn your statistical machine translation model. So the idea is that, you are trying to learn this conditional probability distribution of x given y. So what we do is we actually break this down even further. We actually want to consider the probability of x and a given y. Where a is the alignment. So the idea of alignment, is this is how the words in the English sentence and the French sentence correspond to each other. So I'm gonna, uh, demonstrate this by an example. So in this example, while we're translating the sentence 'Japan shaken by two new quakes' to French. Then you can see there is a pretty simple one-to-one alignment here, uh, of English words to French words, and also they appear in the exact same order. The only thing that doesn't conform to that is the word 'Le' in French, which we call a spurious word because it doesn't have a direct counterpart in the English sentence, and that's because in English we just say, 'Japan', but in French we say, 'Le Japon'. So alignment can be a bit more complicated than that. For example, alignment can be many-to-one. In this example, you have, uh, several French words that have multiple English words that correspond to them. So this is what we call many-to-one alignment. Uh, it can go in the other direction too. Alignment can be one-to-many. So here we have a single English word implemented, which has a one-to-many alignment because there is a three-word French phra-phrase that corresponds to it. So on the left and the right, we have two ways of depicting the same alignments. It's either a kind of chart or it can be a, a graph. So here's another example, um, of a one-to-many, well, sorry, right. So we call, uh, this word implemented, that is one-to-many. We call it a fertile word because the idea is that it has many children in the, in the target sentence. So in fact, there are some words which are very fertile. Here's an example where the source sentence, 'il m'a entarte', means, 'he hit me with a pie', and here in French, this verb, 'entarte' means, uh, to hit someone with a pie, [LAUGHTER] and this word has no single word equivalent in English. We don't have a single verb that means to hit someone with a pie. [LAUGHTER] Which I think that is really fun, that French has a word. You wonder, maybe they do it so often that they need a single word for that. I don't know. [LAUGHTER] So this is an example of a fertile word, right? Because it needs to have several corresponding English words to translate it. So we can have one-to-many, and many-to-one. You can also have many-to-many alignments. You could call that kind of phrase level translation, or phrase-to-phrase. So here, uh, the English sentence says, 'The poor doesn't- don't have any money', and here don't have any money corresponds to the French phrase, 'sont demunis', and this is a many-to-many alignment because there is no obvious way to break down this phrase-to-phrase alignment into, um, smaller word-to-word alignments. Okay. So that's what alignment is. And if you remember, we were thinking about how would you learn this probability distribution of what the alignment is, uh, in order to do statistical machine translation. So the idea is that you learn probability of x and a, given y as a combination of many factors or many features. So you consider for example, what's the probability of a particular word aligning to another particular word? Like you know, this English word and this French word, how often do they align? But then, this also depends on for example, what's their position in the sentence? Like if they both appear near the end of the sentences, then it's more likely that they align, whereas, if one's at the beginning and one's at the end, that's less likely. You would also consider things like, uh, what's the probability of this particular French word having this particular fertility? Like, what's the probability of this word having three corresponding English words and so on? So all of these statistics are learned from your parallel data, and there's many other things that you would take into consideration. So we're looking at a kind of overview of statistical machine translation today. You're not going to understand it in full detail, but we're understanding an overview of how it works, because we're going to be, uh, comparing it to neural machine translation. Okay. So we're learning this SMT system, and so far, we've broken it down into these two main components. We've got the translation model, and we've got the language model, and we understand a little bit about how you might learn this translation model by breaking it down into alignments. So the question remains, how do you do the argmax over y? How do you find your French sentence y that maximizes this probability? So one kind of brute force solution is you could say, "'let's enumerate every possible y." That's kind of every possible sequence of French words, maybe up to some length, and, uh, we'll calculate this probability for all of them, and it should be pretty clear that that is just a no go. That's way too expensive, and we're not going to be able to get anywhere with that. So the answer for how you actually do this in practice is, you are going to use some kind of heuristic search algorithm, to search for the best translation, y. Uh, but along the way, you're going to discard hypotheses that are too low probability. So you're gonna search, you're going to discard, and prune the trees as you go to make sure that you're not keeping too many hypotheses, uh, on each step. So this process of finding your best sequence is also called decoding. So here is an overview of how that works for SMT. This an example where you have this German sentence that translates to, 'He does not go home', and you can see that there is some kind of phrase-to-phrase alignment here. So, uh, an overview of how this decoding would work in SMT, is that you kind of consider lots of different hypotheses, for how you might translate these individual words, uh, and then you build it up to consider how you might translate, uh, individual phrases, and the phrases get bigger. So for example, you can see that on the top right, if it's not too small, you can see that the, uh, the German word for house, uh, could be translated into the English word, 'house' or 'home', or 'chamber', and so on. Uh, so we consider all of these different hypotheses, and look into how we might put those together to translate phrases but you don't keep all of them all the time. You get rid of the ones that are too low probability. So this can also be depicted as a kind of a tree, where you are exploring different options. You are searching through the space of options, but then you prune the tree as you go. So I know this is a very, very high level, uh, description of how decoding might walk. And in fact, later in this lecture, you're going to see a detailed explanation of how this kind of decoding works for neural machine translation. Okay. So what's our, um, overview of statistical machine translation, uh, was it effective? Uh, so SMT was a huge research field, uh, from the 1990s to about, maybe, uh, 2013. And the best systems during this time were extremely complex. They were extremely sophisticated and impressive systems and, uh, SMT made the best machine translation systems in the world. But they were very complex. So for example, you know, there were hundreds of important details that we haven't mentioned here at all. There were many, many techniques to make it, uh, more complex and more, um, sophisticated than what I've described today. In particular, the systems had to have many separately designed, uh, sub-components. So we already saw how you, uh, break down the translation model into two separate parts. Uh, but there was, you know, many more sub-components than that, and often they had to be learned separately. This meant the engineers had to do a lot of feature engineering. Uh, you have to design features to capture the particular language phenomena that you were interested in. So this meant that they had to require a lot of compiling and maintaining of extra resources, and in fact, you had to have, uh, different resources for different languages. So the work kind of multiplied the more languages you had. An example of this, is you had to have, uh, tables of equivalent phrases. So for example if you're doing French and English translation, then, uh, they would be collecting these phrases of, uh, sorry these tables of phrases that they considered similar, and those were learned from the data. But this was a lot of information that had to be stored and maintained. So overall, this was just a lot of human effort to maintain. Uh, and again, yes, you had to put more human effort in if you wanted to learn an SMT system for a new language pair. Okay, are there any questions here about, uh, SMT? [NOISE] Okay. Uh, so moving on, that's SMT. [NOISE] Now, we're gonna move on to, uh, section two of this lecture. So I want to take you back to the year 2014, for a dramatic re-enactment of what happened in the world of machine translation research. So in 2014, something very dramatic happened and that thing that happened is called neural machine translation, and [LAUGHTER] I think it looks a little bit like this [NOISE] if I'm not being too dramatic. So what is neural machine translation? The idea is that NMT is a way to do machine translation but using just a single neural network. [NOISE] And the neural network architecture that they use is called Sequence-to-Sequence or sometimes just called seq2seq, uh, and involves two RNNs. So, uh, it's called sequence-to-sequence, because you're mapping one sequence to the other. The source sentence [NOISE] to the target sentence and you need two RNNs, basically to handle those two different sentences. All right, lets look at the diagram to see what sequences-to-sequence is in detail. So we start off with our source sentence, and we're gonna use our example from before il a m'entarte, which means, he hit me with a pie. So we, uh, feed this into our encoder RNN, and this is as you've seen before, I've drawn a uni-directional RNN, but this could be bi-directional. It also could be multi-layer. It could be vanilla, or it could be LSTM, and so on. Uh, another thing to note is that [NOISE] we are passing word embeddings into this encoder RNN, but I'm just not explicitly depicting that step. [NOISE] Okay. So the idea of the encoder RNN is that it's going to produce some kind of encoding of this source sentence. So for now, let's assume that the encoding of the source sentence is going to be, uh, the final hidden state of this encoder RNN. So what happens next is we pass this encoding of the source sentence. We pass it over to the decoder RNN, which is going to translate into English. So the decoder RNN is a language model. In particular, it's a conditional language model, like we talked about last time. So it's conditional because it's going to produce the target sentence, but conditioned on this encoding, and the encoding is that vector that has the orange box around it. So how does this work? Uh, we start off by feeding, uh, the start token into the decoder, and then, uh, we can get the first state of the decoder, because we're using, uh, the encoding of the source sentence as the initial hidden state for the decoder. So then we get our first output from the decoder, which is a probability distribution of what word might come next, and let's suppose that we take the argmax over that, and then that gets us the word, uh, he. Which is in this case is correct, because that's probably the word you should start with. Okay, so then we just take the word, he and then we feed it back into the decoder on the next step, and then we do the same thing again. We take argmax and we get a new word and we get he hit. So the idea here is you can co- uh, continue doing this operation and in that way, you're going to generate, uh, your target sentence, uh, which will be something like he hit me with a pie, and you stop once your decoder produces the end token. So an important thing to note here, is that this picture is showing you what happens at test time. This shows you how to generate text. Uh, this isn't what happens during training. I'll show you what happens [NOISE] during training later. Uh, but this thing with the, the pink dotted arrows where you feed the word back in. This is what you do to generate text at test time. Any questions on this? Uh, oh, another thing I should note is that you need two separate sets of word embeddings, right? You need word embeddings for French words, and you need English word embeddings, so that's kind of two separate sets, two separate vocabularies. Um, yeah. Okay. So as a side note, uh, this architecture called sequence-to-sequence is actually pretty versatile. It's not just a machine translation architecture. Uh, you can, uh, uh, phrase quite a few NLP tasks as sequence-to-sequence tasks. Uh, so for example a summarization is a sequence-to-sequence task because in goes your long text and out comes your short text. Uh, dialogue can [NOISE] be seq2seq because in goes the previous utterance and out comes your next utterance, uh, parsing can even be thought of as a sequence-to-sequence task, because you could say in goes the input text and the output parse is going to be expressed as a sequence. This might not be the best way to do parsing but it is a way you can try. Lastly, you could even do something like code generation. So suppose you want to build a system that takes some kind of, uh, natural language input, such as sum up the numbers from 1-10 and then it outputs, let's say some Python code that says, sum open brackets range 10 or something like that. Uh, so if you wanted to train, um, an assistant to do this. You could in a way view that as a translation task, where you're translating from English to Python. It's a pretty challenging translation task. It probably requires a lot more logic than just uh, you know, French to English [NOISE] but you can try and people have tried. There are research papers where people have used seq2seq to do this kind of task. Okay. So to recap, uh, seq2seq is an example of a conditional language model. Uh, it's a language model because the decoder is a language model that's predicting the next target words. But it's a conditional language model because it's also conditioning on your source sentence which is represented by the encoding of the source sentence. So you could look, you could view it like this. NMT is directly calculating the probability of the target sentence y given the source sentence x. So if you look at this, you see that this is just, uh, breaking down the probability of the sequence y, which we suppose is of length, uh, capital T. You can break it down into the being the probability of the first word of y given x and then the probability of the second word of y given, uh, the words that came before, and x, and so on. So in fact, you can see that each of the terms in this product on the right, those are probabilities of the next target word given all the ones so far, and also the source sentence, and that's exactly the conditional probability that your language model produces. So the reason I'm highlighting this is because if you remember in SMT, uh, we didn't directly learn the translation model p of y given x, we broke it down into, uh, uh, smaller components. Whereas here in NMT, we are directly learning this model. And this is in some ways an advantage because it's simpler to do. You don't have to learn all of these different systems and optimize them separately. It's, uh, kind of, simpler and easier. So, uh, this is, this is the model that we're learning. Uh, the question is, how do we train this NMT system? So hopefully, you should already have a good idea of how this would work, given that we've already seen how you would train a language model. But here are the details just in case. So you get your big, uh, parallel corpus. Uh, and then, uh, let's say you have your sentence pair from your parallel corpus. Uh, so this is what happens during training. Uh, you feed your source sentence into the encoder RNN, uh, and then you feed your target sentence into the decoder RNN, and you're going to pass over that final hidden state to be the initial hidden state of the decoder. And then, uh, for every step of the decoder RNN, you're going to produce the, uh, probability distribution of what comes next, which is the, the y hats. And then from those, you can compute your loss. And the loss is just the same as we saw for, u h, unconditional language models. It's, uh, the cross entropy or you could also say negative log-likelihood of the true next word. So for example, on those selected ones, uh, the loss is the negative log probability of the correct next word. And then as before, we're going to average all of these losses to get the total loss for the example. Uh, so a thing you might notice people saying in, for example, research papers is this phrase end-to-end. And this is an example of learning a system end-to-end. And what we mean by this is that the backpropagation is happening end-to-end, one end is, is losses, the loss functions, and the other end I guess is kind of like the, the beginning of the encoder RNN. The point is that you, uh, backpropagation, uh, flows throughout the entire system, and you learn the entire system with respect to this single, uh, loss. Yep? [inaudible] The question is, if the decoder RNN outputs the end token too early, then how can you measure the loss on, uh, the words that came after that? So this is the difference between training time and test time, which is pretty confusing. So, uh, during training, we have this picture where you feed the token back in. So in this scenario, once you produce end, then you have to stop because you can't feed end in as the initial next step. But in training, you don't feed the thing that you produced into the next step. During training, you feed the target sentence from the corpus. So like the gold target sentence into the model. So no matter what the, uh, the decoder predicts on a step, you kind of, you don't use that for anything other than computing loss. Any other questions? Yeah. Is there a reason why you would, uh, backpropagation end-to-end instead of maybe training an encoder like [inaudible] model and then [inaudible] together? The question is, is there a reason why you would want to train end-to-end when, for example, you might want to train the encoder and the decoder separately? Uh, so I think, uh, people view training end-to-end as favorable because the idea is that you can optimize the system as a whole. You might think that if you optimize the part separately, then when you put them together, they will not be optimal together necessarily. So if possible, directly optimizing the thing that you care abou- about with respect to all of the parameters is more likely to succeed. However, there is a notion of pre-training. And as you said, maybe you'd want to learn your, um, decoder RNN as a kind of a language model, an unconditional language model by itself. And that's something that people do. You might, uh, learn a very strong language model, and then use that to initialize your decoder RNN, and then fine-tune it on your task. That's a, a valid thing you might try to do. Yep. Are you always [inaudible] The question is, is the length of the source sentence and the length of the target sentence fixed? So for example, is the source sentence always length 4? Uh, no. That's definitely not true because in your parallel corpus, you're going to have sentences of all lengths. Uh, so this is more kind of an implementation or a practicality question. Uh, the idea is that this is what you mathematically want to be computing during training for each example, and you're going to have batches of examples. But the question is, how do you actually implement them in, uh, in practice? So what you usually do just because it's easier to assume that your batch is this kind of even-sized tensor where everything is the same length, is you pad any short sentences up to some predefined maximum length, or maybe the length of the maximum example in your batch, uh, and then you make sure that you don't use any hidden states that came from the padding. Yep. I believe two languages together [inaudible] possible to have a system [inaudible] that will be kind of universal with similar languages or something like that? Okay. So the question I think is, uh, it seems like sometimes you wouldn't want to train things end-to-end, and there are circumstances in which you might want to train things separately, and you mentioned, for example, having, uh, different languages mapped to each other. So this is a totally valid point, and in fact, uh, so far we've, kind of, assumed that you want to learn language A to language B as a pair, right? And that's different to language A to language C or even language B to language A. And, um, that does mean you have kind of n-squared many systems in the number of, uh, languages you're considering. So, yeah, that's actually a valid idea, and this is something that people have researched. The idea that maybe you could have a, kind of, mix and match with your encoders and decoders. And you could try to, uh, train a kind of general purpose, let's say English decoder and then match it up with your different encoders. Uh, but this is, I think fairly complex to train to, to make sure that they all work together. But that, that is certainly something that people have done. Let me just check on the time. Okay. Let's take one more question. Yep. So does the word embedding also come from the same corpus that we are training on? The question is, does the word embedding also come from the corpus that you're training on? So I think there's a few options just as we saw with language models; you could download, uh, pretrained word vectors like Word2Vec or GloVe, and you could use those. And then you can either, kind of, freeze them or you could fine-tune them as part of the end-to-end training, or you could just initialize your word vectors as, uh, you know, close to zero random and then learn them from scratch. All right. Okay, moving on. Uh, so now we understand how you would train a neural machine translation system. And we talked briefly about how you might, uh, do decoding or generation. So what I showed you before is something called, uh, greedy decoding, which is this idea that on each step, you just choose the argmax, the top one best word, and then you feed that in on the next step. So this is called greedy decoding because you're just taking the best, uh, the best option that you can see right now, and then you really don't have a way to go back. So can anyone see a problem with this method? Maybe I've kind of given it away but, uh, yeah. [inaudible]. You said too expensive. Um, I guess I mean it is expensive in that you have to do a sequence and the sequence is usually worse than something you can do in parallel. But I suppose, um, maybe what's wrong with the greediness? Can anyone suggest what's wrong with the greediness, yeah? [inaudible] [NOISE] That's not necessarily gonna give you the argmax over the entire sentence. That's exactly right. That's, uh, kind of what, uh, what greediness means. So in practice, this might give you something like this. Uh, we're trying to translate our running example sentence and let's suppose on the first step we say, "He," and then we say, "He hit," and then we say, "He hit a," oh no, that wasn't right. That wasn't the best thing to choose but we kinda have no way to go back now, right. We just have to continue and try to make the best of it after saying, "He hit a," which isn't gonna work out well. So that's the main problem with greedy decoding. There's kind of no way to backtrack, no way to go back. So how can we fix this? And this relates back to, uh, what I told you earlier about how we might use, uh, a kind of searching algorithm to do decoding and SMT. Uh, but first, you might, uh, think exhaustive search is a good idea. Well, probably not because it's still a bad idea for the same reasons as before. So if you did want to do exhaustive search, and search through the space of all possible French translations, uh, then you would be again, trying to consider which Y maximizes, uh, this product of all of these individual probability distributions. So as before, if you try to do this, uh, then on each step T of the decoder, you're gonna be having to track V to the power of T possible partial translations, uh, where V is your vocabulary size. So here when I say partial translation, I just mean, uh, a kinda, you know, like, half of a sentence so far, or something like that. So, of course, this, uh, exponential in V complexity is just far too expensive. So yes, we're gonna use some kind of search algorithm, and in particular, we're gonna use a beam search decoding. So the core idea of beam search decoding is that on each step of the decoder, you're gonna be keeping track of the K most probable partial translations, and we call partial translations hypotheses, because we're kind of tracking multiple of them and we're not sure which one is best. So we're thinking about several. Here K is an integer and we call this the beam size, and in practice for NMT this is usually maybe 5-10. So you can think of, uh, K kind of as how big is your search space at any one time. So if you increase K, then you're going to be considering more different options on each step and you might hope that this will mean that you get the best quality solution in the end though of course it'll be more expensive. So I said that we want to keep track of the K most probable partial translations, that is, hypotheses. So this means that we need some kind of notion of, you know, how probable is this hypothesis or what's its score. So the score of a hypothesis and, uh, we're representing that as Y_1 up to Y_T, um, is just its log probability. So, uh, the log probability of this partial translation, uh, according to the language model can be broken down as we saw before into the sum of the individual log probabilities of the words given everything that came before. So it's, if it's not obvious, uh, these scores are all negative because we're taking log of, uh, of a number between 0 and 1. Uh, and a higher score is better. Yes, because you want a higher probability of, uh, of the hypothesis according to the language model. So the idea is that we're gonna use this score, uh, and the search algorithm to search for high-scoring hypotheses and we're gonna track the top K on each step. So I'm gonna show you a detailed example in a moment, but the important thing is to know that beam search is not guaranteed to find an optimal solution. Uh, exhaustive search, the one where you enumerate, enumerate all V to the T possible translations, that is guaranteed to find the optimal solution but it is just completely infeasible because it's so expensive. So beam search is not guaranteed to find the optimal solution, but it is much more efficient than exhaustive search of course. Okay. So, um, here's an example of beam search decoding in action. Uh, so let's suppose the beam size equals K, uh, is 2 and then as a reminder, we have, uh, this is the score that you apply to a partial, uh, hypothesis, uh, a partial translation, which is a hypothesis. So we start off with our starting token, and the idea is that we're going to compute the probability distribution of what word might come next. So having computed that probability distribution using our seq2seq model, then we just take the top K, that is top two possible options. So let's suppose that the top two are the words "He" and "I". So the idea is that we can compute the score of these two hypotheses, uh, by using the formula above. It's just the log probability of this word given the context so far. So here, let's say that "He" has a score of -0.7 and "I" has a score of -0.9. So this means that he is currently the best one. Okay. So what we do is, uh, we have our two, uh, K hypotheses, and then for each of those, we find the top K words that could come next. And we calculate their scores. So this means that for both "He" and "I" we find the top two words that could come next. And for each of these four possibilities, uh, the score of the hypothesis is equal to, uh, the log probability of this new word given the context so far plus the score so far because you can accumulate the sum of low probabilities. You don't have to compute it from scratch each time. So here you can see that we have these four possibilities and that the top two scores are -1.6 and -1.7. So this means that hit and was are the two best ones. So the idea is that of these K squared equals 4 hypotheses, we're just gonna keep the K equals 2 top ones. And then we just keep doing the same thing. For these two, we expand to get the two next ones. And then of those we compute the scores, and then we keep the two best ones and discard the others and then of those, we expand. So we keep doing this again and again, expanding and then just keeping the top K and expanding like this until, uh, you get some kinda, uh, finished translation. I'm going to tell you more in a moment about what exactly the stopping criterion is. But let's suppose that we stop here. Uh, looking at the four hypotheses that we have on the far right, the one with the top score is, uh, the top pie one with -4.3. So let's suppose that we are gonna stop now when we decide that this is the top hypothesis, then all we need to do is just backtrack through this tree in order to find the full translation, which is "He hit me with the pie." All right. So, um, let me tell you more detail about how exactly we decide when to stop. So if you remember in greedy decoding, usually we just keep decoding until the model produces the END token. So for example, this means that your model is actually producing the sequence, uh, I guess it doesn't produce START, you give it START but then it produces the sequence "He hit me with a pie" END. So the problem in beam search decoding is that you're considering all these different hypotheses, K different hypotheses at once and the thing is those hypotheses might produce END tokens at different times. So there's no one obvious place to stop. So what we do in practice, is when a hypothesis produces the END token, then we regard this hypothesis as complete and we kind of place it aside. We have a collection of completed hypothesis. So we kind of take it out of beam search, we no longer keep exploring it because it's finished, uh, and we, yeah, place it aside. And you continue exploring other hypotheses with beam search. So the remaining question is when do you stop doing beam search? When do you stop iterating through this algorithm? So there's, uh, uh, multiple possible stopping criterion but two common ones are you might say, uh, we're gonna stop doing beam search once we reach time step T, where T is some, uh, predefined threshold that you choose. So you might say, uh, we're gonna stop beam search after 30 steps because we don't want any output sentences that are longer than 30 words for example, or you might say, "Uh, we're gonna stop doing beam search once we've collected at least N completed hypotheses." So you might say, "Uh, I want at least 10 complete translations before I stop doing beam search." Okay. So what's the final thing you have to do? Uh, we finished doing beam search, um, we have this collection of completed hypotheses. Uh, we want to choose the top one. Uh, the one that we're going to use is our translation. So, uh, how do we select the top one that has the highest score? Uh, you might think this is simple given that all of these hypotheses already have scores attached. But if we just look at this, uh, formula again, uh, for what the score is of each hypothesis. Uh, can anyone see a problem with this? If we have our sets of hypotheses, and then we're choosing the top one based on the one that has the best score, can anyone see a problem? Yeah. [NOISE] So the answer was you're gonna end up choosing the shortest one. The problem here is that longer hypotheses have lower scores in general because you're multiplying more probabilities so you're getting a smaller, a smaller overall value or I guess if we're adding low probabilities we're gonna get more negative values. So it's not quite that you will definitely choose the shortest hypothesis because if you could overall have, uh, a lower score but there's definitely going to be a bias towards shorter translations, uh, because they'll in general have lower scores. So the way you can fix this is pretty simple, you just normalize by length. So instead of using the tools we have above, you're going to use, uh, the score divided by [inaudible]. And then you use this to select the top one. Any questions on this? [NOISE]. Yeah. Can we train with the END token so that it is possible to [inaudible] I didn't quite hear that, can you train with the END token? Yeah, like we had an END token. Yes. So you train with the END token, if that's your question. Um, because the whole point is you're relying on your language model, your decoder to produce the END token in order to know when to stop. So you need to train it to produce the END token by giving it examples of training sentences with END tokens. Yeah. Why don't we use this score being changed [inaudible] Great question. The question is, why don't we use this normalized score, the one at the bottom of the screen during beam search in the first place? So the reason why that's not necessary, you could, but it's not necessary, is because during beam search, we only ever compare the scores of hypotheses that have the same length, right? So in each of these steps, the way we look at, let's say the top k squared and we want to choose which ones are the top k, we're comparing the scores of four different hypotheses that are of length, one, two, three, four, five. So, um, it's true that these scores are getting lower and lower, but in the same way because they're all length five right now. [NOISE] Okay. So we now understand how you would train an NMT system and how would you- you would use your trained NMT system to generate your translations using, let's say, beam search. So let's all take a step back and think about, what are the overall advantages of NMT in comparison to SMT? Uh, so the first advantage is just better performance. Uh, NMT systems tend to give better output than SMT systems in several ways. One is that the output often tends to be more fluent. Uh, this is probably because NMT, uh, this is probably because RNNs are particularly good at learning language models as you learned last week. Uh, another way that they're better is they often use, uh, the context better, that is, uh, they're better at conditioning on the source sentence and using that to change the output. Another way they're better is they often, uh, are more able to generalize what they learn about phrases and how to translate them. So for example, if it sees an example of how to translate a certain source phrase and then later it sees a slightly different version of that source phrase, it's, uh, more able to generalize what it learned about the first phrase than SMT systems will. Another big advantage of NMT systems compared to SMT that we talked about before is that it's a single neural network that can be optimized end-to-end. And the- the advantage here I suppose is primarily simplicity and convenience. So there's no subcomponents that need to be individually optimized. Another big advantage is that it requires much less human engineering efforts. When I told you earlier about all the different things that people had to do to build, uh, big, uh, powerful SMT systems, uh, there's relatively less engineering effort for NMT. And NMT is certainly not easy, but it's- is less complicated than SMT. In particular, there's no feature engineering. You don't have to define what features of, uh, linguistic phenomena that you want to capture. You can mostly just view it as a sequence of words although, uh, there are different views on that. Uh, lastly, a great thing about NMT is that you can use pretty much the same method for all language pairs. So if you've, uh, you know, built your French-to-English translation system and now you want to build a Spanish-to-English one, [NOISE] uh, you can probably use basically the same architecture and the same method as long as you can go find a big enough parallel corpus of Spanish-to-English. All right. So what are the disadvantages of NMT, uh, remaining? So compared to SMT, there are some disadvantages. One is that NMT is less interpretable. Uh, what I mean by this is you feed in your source sentence into the neural network and then it feeds out some target sentence and you didn't really have any way to figure out why that happened, right? So in particular, if the target sentence con- contains some kind of error, um, you can't really look at the neurons and understand what happened. It's pretty hard to attribute errors. So this means that, uh, NMT systems are pretty hard to debug. So by comparison, SMT systems were more interpretable in that you had all of these different sub-components that were doing different jobs. And, uh, you were more able to look at those. They weren't, you know, neurons often would be, uh, you know, probabilities of certain words given other words and so on. And, you know, that's by no means easy to interpret but it was at least more interpretable than NMT. Uh, another disadvantage is NMT is pretty difficult to control. So, uh, for example, if your NMT system is, uh, doing a particular error, it's not very easy for you, the, uh, programmer to specify some kind of rule or guideline that you want the NMT system to follow. So for example, if you want to say, I want to always translate this word in this way. Um, when- when this other thing is present, like that's not particularly easy to, uh, to impose as a rule on the NMT system, uh, because you can't, uh, easily control what it's doing on a step-by-step basis. So sometimes you have some kind of, uh, post-processing rules you might try to do, but overall you can't. It- it's- it's harder than you'd expect to try to, um, impose a fairly simple form. [NOISE] So this means it has some kind of safety concerns in fact. Because, uh, let's say, you know, you don't want your NMT system to say bad things, right? It's- it's pretty hard to actually put, um, these, uh, controls in place to stop it from saying these things that you don't want it to say. I mean, on the level of maybe just never saying particular bad words, then sure you can remove them from the vocabulary. But overall they're pretty hard to control, and we're actually gonna see some examples of NMT systems being, you know, doing things that their uh, designers certainly didn't intend. Okay. So, uh, how do we evaluate MT? Uh, every good NLP task needs to have an automatic metric so that we can, uh, measure our progress. So the, uh, most commonly used evaluation metric for MT is called BLEU and that stands for Bilingual Evaluation Understudy. So the main idea is that BLEU is gonna compare the translation that's produced by your machine translation system. It's gonna compare that to one or maybe several human written translations of the same sentence. And then it's gonna compute a similarity score that's based on n-gram precision. So when I say n-gram precision, I mean you're gonna look at all the one, two, three, and four grams that appear in your, uh, machine written translation and your human written translation. And then n-gram precision is basically saying, for all of the n-grams that appeared in the machine-written translation, how many of those appeared in, you know, at least one of the human-written translations? Another thing that you need to add to BLEU is a brevity penalty. Uh, so you're saying that you get a lower BLEU score if your system translation is significantly shorter than all of the human-written translations. And the reason why you need to add this is because n-gram precision alone doesn't [NOISE] really punish using, uh, fewer words. So you might try to maximize n-gram precision by being very conservative and writing, uh, short sentences that only contain words that you're really sure about, and then you get a good precision score. But this doesn't make a good translation because you're probably missing a bunch of information that you needed to translate from the source sentence. So that's why you need to add the brevity, uh, penalty. So overall, um, BLEU is very useful because, uh, we need an automatic metric in order to, uh, measure progress, you can't measure progress on human evaluation alone because it takes too long [NOISE] to compute. Um, but of course it's pretty and perfect. So for example, you can think about how there are many ways- many valid ways to translate a sentence. At the very beginning of this lecture, I asked how do we translate that sentence, uh, by Rousseau and there were at least a few different options that came up. Uh, so if there's many valid ways to translate a sentence, how does BLEU recognize that? BLEU is [NOISE] rewarding sentences that have a high n-gram overlap with, uh, one or some of the human-written translations. But, if, uh, you write one, if your model writes one valid translation and the humans wrote a different valid translation and they don't have high n-gram overlap, then BLEU is going to, uh, give you a low score. So, um, you're going to learn about BLEU in detail in Assignment 4, and in fact Assignment 4 has a full description- mathematical description of what the BLEU score is. So I'm not gonna tell you about that now, uh, yes, so you're gonna think about BLEU and the- the ways in which it's imperfect but useful. Yeah. So would one- one n-gram, be a one to one equivalency? What? Would a one n-gram be a one to one equivalency? The question is, would a one n-gram be a one-to-one equivalency? I'm not sure I understand the question. You're asking about alignment or something else? Uh, just trying to get an idea about how they're doing n-gram checks, is it doing all n-gram permutations or is it doing like window size of one? Well, I guess one- one n-gram it doesn't make a difference because you can't permute a one-gram. Okay. So you're asking for examples, for four grams are they checking, uh, whether this exact sequence of four paired or any permutation of it, its exact sequences? So by definition, n-grams are sequences where the order matters. Okay. All right. So, uh, that's how you evaluate machine translation. So now you can understand this metric of how we evaluate our progress on machine translation, um, I can show you this graph and you might understand what it means. So this is a, uh, bar plot which shows in a nutshell how NMT changed the machine translation, uh, landscape in just a few years. So in this plot, we've got BLEU score is the Y-axis. Uh, and you have two different types of SMT which is the red and the dark blue, uh, bar plots. And what's happening is, uh, in 2015, uh, Neural MT enters the scene for the first time and it isn't doi- doing as well as SMT, and then the next year it's suddenly outperforming SMT. And here these are BLEU scores on some particular fixed dataset like, uh, a shared task that many people were, um, submitting systems for. [NOISE] So the main thing to notice here is that the progress that was being made by SMT systems was, you know, a fairly gentle increase in BLEU year-by-year. And then in just one year, NMT arrives and is suddenly doing, uh, much more rapid progress. So I think this justifies why the picture of the meteor maybe isn't too Jurassic here. So you could in fact call NMT the biggest success story of NLP in deep learning. Uh, because if you think about the history of this, NMT went from being a fringe research activity in 2014 to being actually the leading standard methodfor machine translation in the world in 2016. In particular, in 2014, the first seq2seq paper was published. And in 2016, Google Translate switches from SMT to NMT. This is a pretty remarkable turnaround for just two years. So this is amazing, not just because it was a quick turnaround, but also if you think about the level of human effort involved. Uh, these SMT systems, for example the Google Translate SMT system was built by doubtless hundreds of engineers over many years. And this, uh, this SMT system was outperformed by an NMT system that was trained by, uh, you know, relatively few like a handful of engineers in a few months. So I'm not- I'm not diminishing how difficult it is to, um, build NMT systems, and certainly I'm sure Google's NMT system today is built by more than a handful of engineers in a few months. I'm sure it's a very big operation now. Uh, but when NMT, uh, began to outperform SMT, it was pretty remarkable how it was able to do that, uh, based on the amount of effort involved. Yeah. Given the [inaudible] cons of NMT has there been research on combining the two and if there is, what does that look like? Yeah, great. The question is given that we know that there are some disadvantages of NMT even in comparison to SMT, is there any work on combining the two? So, yes. I think there is. Uh, there's a lot of NMT research ongoing and in particular, people sometimes focus on these particular shortcomings. And, uh, there's a lot of work in kind of taking techniques and ideas and wisdom from the many decades of SMT research, and then integrating them into the new NMT paradigm. So yes. [NOISE]. Okay. So is machine translation solved? Can we all go home? I think the answer is clearly no. Uh, NMT definitely is not doing machine translation perfectly. So, um, just to highlight some of the difficulties that remained with NMT. Uh, one is out-of-vocabulary words. Uh, this is a kind of basic problem but it is pretty tricky. You know, what do you do if you're trying to translate a sentence that contains a word that is not in your source vocabulary, or what if you're trying to produce a word that's not in your target vocabulary? Um, there's certainly been lots of work on doing this, and you're going to hear later in the class how you might try to attack this with for example, uh, sub-word modeling can make it easier. Uh, but this is a significant problem. Another one is domain mismatch. So let's suppose that you train your machine translation system on a bunch of fairly, uh, formal text, like let's say, uh, Wikipedia or something like that. Uh, but then you try to deploy it to translate informal text, like people chatting on Twitter or something. Then often, you'll find that it doesn't perform very well on this different domain because you've got a domain mismatch. Uh, so that's quite a big problem. Another one is maintaining context over longer text. So everything we've talked about so far has assumed that you were just translating a single sentence to a single sentence, and there's no other wider context. Uh, but, you know if you want to use a machine translation system to translate a whole news article and maybe even a book, then you're probably going to want to use the context that came in previous sentences in order to translate things correctly in the current sentence. So, uh, this is an active area of research, how can you get an NMT system to condition on larger pieces of context without it becoming too expensive and so on? Another difficulty is low-resource language pairs. Um, everything we've talked about so far has assumed that you have access to a very large parallel corpus, but what if you don't? What if you are trying to translate to or from a language that has relatively little text available, um, online for example? So that can be pretty difficult. Here a few examples of machine translation screwing up, uh, with specific errors. So, here's an example of how common sense is really difficult for NMT systems. On the left, we have the English phrase paper jam, which means when your printer gets jammed up with paper and it's all, uh, tangled inside. And then on the right, we have a very literal translation of that into Spanish, and it's essentially saying jam, edible jam made of paper, which clearly isn't the right interpretation. So here, we have an NMT system that's just doing very literal translation and clearly doesn't have any notion of common sense. You can't make jam from paper. Uh, here's another example. NMT can pick up biases in the training data. We already talked about this at the, uh, the word embedding level, the representation of words. Uh, but it can also be a problem at the you know, the sentence level when you're translating things. So here in this example, uh, on the left, we have two sentences in Malay that roughly mean, uh, they work as a nurse, and they work as a programmer. The point is on the left, there is no information about gender in the pronouns. But when it gets translated to English, then we've suddenly got gender coming out of nowhere, she works as a nurse, and he works as a programmer. This is likely happening because in our training data, we had more examples of female nurses and male programmers. So you can understand why from a machine learning, uh, maximizing the objective point of view the, uh, English language model has learned to do that. But the problem here is this isn't good machine translation. Uh, here the system is making up information that was not present in the source sentence. So this is certainly an error that the machine translation shouldn't be doing because it's just simply inaccurate. And even worse, it's propagating, uh, gender roles. Here's another pretty weird example. [LAUGHTER] What is happening here? Uh, on the left, we have a nonsense sentence, this is just kind of a syllable repeated. And we're supposedly translating from Somali. Uh, and then we're asking to translate this into English and then we're getting this out of nowhere. Um, as the name of the Lord was written in the Hebrew language, it was written in the language of the Hebrew nation, and you might be thinking "Where on earth did that come from?" And in fact, this got reported in the media as you know, Google Translate wants to convert you to its religion or whatever. [LAUGHTER] Um, so for sure, it is very startling. But the thing is there's actually quite a reasonable explanation. So what's going on here is that, um, often for low resource languages, such as for example Somali, um, one of the best resources of parallel text is the Bible. So you train for example Somali to English using the Bible as a training text, maybe among other texts. Okay, that's the first puzzle piece. But the other puzzle piece is the nonsensical input. So when the input isn't really Somali or any kind of text, right? It's just the same syllable over and over. Then the NMT system doesn't really have anything sensible to condition on. Its basically nonsense, it's just noise. So what does the NMT system do? Right? It can't really use, it can't really condition on the source sentence. So what it does, is it just uses the English language model, right? You can think of it as like the English language model of the decoder RNN just kind of goes into autopilot and starts generating random text, kind of like we saw last week when we saw, uh, a language model trained on Obama's speeches or Harry Potter would just generate texts in that style. That's kind of what's happening here with the Bible, because we don't have any useful information, um, from the sentence on the left. Um, so, this is an example why, uh, neural machine translation in particular makes these kinds of errors, uh, because the system is uninterpretable. So you don't know that this is going to happen until it happens, and perhaps Google didn't know this was going to happen until it happened and it got reported. Um, so this is one downside of uninterpretability is that really weird effects can happen and you don't see them coming and it's not always even easy to explain why they happened. Yeah? [inaudible]. Ah, the question is what happens if you did translate from Irish? I suppose that's the part where Google tries to autodetect the language, maybe it thinks that ag ag ag is more like Irish than Somali, [LAUGHTER] I imagine if you did put Irish to English, there's probably more, uh, training data for Irish to English. So maybe it wouldn't be so Bible-focused. Um, yeah, and there's a lot of examples of these online where you do different kinds of nonsense syllables in different languages. So there's a lot of, uh, challenges remaining in NMT. And, uh, the research continues. So NMT, I think remains one of the flagship tasks for NLP Deep Learning. In fact, NMT research has pioneered many of the successful innovations of NLP Deep Learning in general. Uh, so today in 2019, uh, NMT research continues to thrive, there's still many, many papers, uh, published all the time on NMT. And in fact, uh, researchers have found lots of improvements to the fairly vanilla seq2seq models that I've shown you today. Uh, but in fact, there is one improvement that is so integral to seq2seq that you could regard it as the new vanilla. And that's the improvement we're going to learn about today, and it's called attention. Okay. So section three is on attention. What is attention? First, I'm going to motivate why we need this thing called attention. So let's look at this diagram that we saw before of sequence-to-sequence. And remember when we assumed that this, uh, encoding of the source sentence, the, t he one in the orange box is going to represent the whole sentence. Uh, can anyone volunteer a problem you can see with this architecture? In particular perhaps, a problem with this idea that that single vector is the encoding of the source sentence. Yeah? [inaudible] Okay, so the answer is something like, um, you're only looking at one word, you mean like the last word in the source sentence? And you're not seeing more information. Yeah some- it's, it's something like that. Any other ideas? Yep. We might have lost information in the beginning of the sentence by the time you get to the end. Yeah. You might have lost information from [NOISE] the beginning of the sentence by, by the time you get to the end, especially if it was longer than four words. Right. I think these are different ways of saying a similar idea [NOISE] which is that we have a kind of informational bottleneck. Uh, we're forcing all of the information about the source sentence to be captured in this single vector because that's the only thing that gets given to the decoder. If some information about source sentence isn't in our vector, then there's no way the decoder is gonna be able to translate it correctly. So this is the, yeah, this is an informational bottleneck. [NOISE] It's putting kind of too much pressure on this single vector to be a good representation [NOISE] of the encoder. So this is the motivation for attention. Attention is a neural technique and it provides a solution to the bottleneck problem. The core idea is that on each step [NOISE] of the decoder, you're gonna use a direct connection to the encoder to focus on a particular part of the source sequence. So first I'm gonna show you what attention is via a diagram so that's kind of an intuitive explanation. And then I'm gonna show you the equations later. So here's how seq- sequence-to-sequence with attention works. So on the first step of our decoder, uh, we have our first decoder hidden state. So what we do, is we take the dot-product between that decoder hidden state and the first [NOISE] encoder hidden state. And then we get something called an attention score which I'm representing by a dot. So that's a scalar. [NOISE] And in fact, we take the dot-product between the decoder hidden state and all of the encoder hidden states. So this means that we get one attention score or one scalar for each of these, uh, source words effectively. So next what we do, is we take those four number scores and we apply the softmax, uh, distribution, uh, the softmax function to them and then we get a probability distribution. So here, I'm going to represent that probability distribution as a bar chart. Um, and we call this the attention distribution and this one sums up to 1. So here, you can see that most of the probability mass is on the first word. And that kinda makes sense because our first word essentially means "he" and, uh, were gonna be producing the word "he" first in our target sentence. So once we've got this attention distribution, uh, we're going to use it to produce something called the attention output. So the idea is that the attention output is a weighted sum of the encoder hidden states and the weighting is the attention distribution. So I've got these dotted arrows that go from the attention distribution to the attention output, probably there should be dotted arrows also from the encoder RNN but that's hard to depict. [NOISE] But the idea is that you're summing up these encoder RNN, uh, hidden states, [NOISE] but you're gonna weight each one according to how much attention distribution it has on them. So this means that your attention output which is a single vector is going to be mostly containing information from the hidden states that had high attention. In this case, it's gonna be mostly information from the first hidden state. So after you do this, you're going to use the attention output to influence your prediction of the next word. So what you usually do is you concatenate the attention output with your decoder hidden state and then, uh, use that kind of concatenated pair in the way you would have used the decoder [NOISE] hidden state alone before. So that way you can get your probability distribution, uh, y hat 1 of what's coming next. So as before, we can use that to sample your next word. [NOISE] So on the next step, you just do the same thing again. You've got your second decoder hidden state. Again, you take dot-product with all of the encoder hidden states. You take softmax over that to the get attention distribution. And here, you can see the attention distribution is different. We're putting more attention on, uh, the, the word entarté because we're about to produce the word hit. Uh, but we're also attending a little bit to the second word a because that's telling us that hit is a past tense. So a cool thing that's happening here is we're getting [NOISE] a soft alignment. If you remember when we looked at alignment in SMT systems, it was mostly this, uh, hard binary thing with on or off, either these words are aligned or they're not. Here, you have a much more flexible soft notion of alignments where, uh, each word kind of has a distribution over the corresponding words in the source sentence. So another thing to note kind of a side note, is that sometimes, uh, we take the attention output from the previous hidden state, uh, and we kind of feed it into the decoder again along with the usual word. So that would mean you take the attention output from the first step and kind of concatenate it to the word vector for he and then use it in the decoder. Uh, the reason for this is sometimes it's useful to have this, uh, information from the, the attention on the previous step on the next step. So I'm telling you this because this is something we do in Assignment 4 and it's a fairly common technique but also sometimes people don't do it. Okay. So, um, the theory is, that you just do this attention, uh, computation on every step. And on each step, you're going to be attending to different things. So in our example on this third step, we look at m' which means me when we produce me and then on the last three [NOISE] we're probably mostly just gonna be looking at this, uh, fertile word entarté to produce hit me with a pie. [NOISE] I'm gonna keep going because we don't have a lot of time. Uh, so here are the equations to describe attention. Uh, I think it's probably easier to look at these in your own time later rather than look at them in the lecture now. But these are the equations that essentially say the same thing as what the diagram just said. So you have your encoder hidden states h_1 up to h_N. And then on timestep t of the decoder, we also have a decoder hidden state, uh, S_t. So we're gonna get the attention score which we're gonna call et by taking the dot-product of your decoder hidden state with each of the encoder hidden states. [NOISE] And that gives you, uh, a vector of same length as the, uh, encoder [NOISE] sentence because you've got one score per source word. Next you take softmax over these scores to get attention distribution that sums up to 1, and we call that alpha. And then you use alpha to take a weighted sum of the encoder hidden states and that gives you your attention output. So [NOISE] the attention output which we call a is a vector that's the same size as your encoder hidden states. Lastly, you take your attention output a and then you, [NOISE] uh, concatenate it with your decoder hidden states and then proceed with that as you were taught before in the no attention model. So attention, if it's not clear, it's pretty cool. It has a number of advantages. So one advantage is that attention just significantly improves NMT performance. And the main reason why it improves it, is because it turns out it's super useful to allow the decoder [NOISE] to focus on certain parts of the source sentence when it's translating. And you can see why this makes sense, right? Because there's a very natural notion of alignment, and if you can focus on the specific word or words you're translating, you can probably do a better job. Another reason why attention is cool is that [NOISE] it solves the bottleneck problem. Uh, we were noting that the problem with having a single vector that has to represent the entire source sentence [NOISE] and that's the only way information can pass from encoder to decoder means that if that encoding isn't very good then, uh, you're not gonna do well. So by contrast in, uh, with attention, the decoder can look directly at the encoder and the source sentence and translate without the bottleneck. [NOISE] Another great thing about attention is that it helps with the vanishing gradient problem, especially if your sentences are quite long. Uh, the reason why attention helps is because you have these direct connections between the decoder and the encoder, kind of over many time steps. So it's like a shortcut connection. And just as we learned last time about, uh, skip connections being [NOISE] useful for reducing vanishing gradient. Here it's the same notion. We have these, uh, long distance [NOISE] direct connections that help the gradients flow better. Another great thing about attention is it provides some interpretability. Uh, if you look at the attention distribution, often you've produced your translation. Uh, you can see what the decoder was focusing on on each step. So for example if we run our system and we translate our, our running example here, then we can produce a plot, kind of like this that shows the attention distribution. So here, dark means high attention and white means low attention. So you might see something like this where, um, it was, it was focusing on the different words and different steps. And this is basically the same kind of plot that we had earlier with a hard notion of alignment, uh, in SNT except that we, uh, we have more flexibility to have a more soft version of alignment like for example when we produce the English word hit, perhaps we were mostly looking at entarte, but we're also looking at a little bit of A. So this, uh, means that we're getting, uh, alignment for free. And the reason I say for free is because when you remember the SNT systems, the whole point there is that you had to learn an alignment system deliberately and separately. You had to define the notion of alignment, you had to define the model of calculating, what the probability of different alignments were and train it. Whereas here, we never told the NMT system about alignments. We never explicitly trained an alignment system. We never had a loss function that tells you how good your alignment was. We just gave the NMT system the apparatus to do something like alignments and told it to maximize the, uh, the cross-entropy loss for doing machine translation. And then the network just learned alignment by itself. I think this is the coolest thing about attention, is that it's learned some structure in a somewhat unsupervised way. Okay, so in the last few minutes, I'm just going to, uh, generalize the notion of attention. Because it turns out that attention is actually a very general, uh, deep learning technique that you can apply in lots of different circumstances. So you've seen that attention is a great way to improve the sequence-to-sequence model for MT, but you can actually use attention for other architectures that aren't seq2seq and also tasks that aren't MT. So to understand this, I'm going to somewhat redefine attention to a more general definition. So here's our more general definition. Suppose you have a set of values, each of which is a vector, and you also have a single vector in which you're calling the query. Then attention is a way, uh, to compute a weighted sum of the values. But the way you weight it is dependent on the query. [NOISE] So we often phrase this, uh, as saying that the query is attending to the values. The idea being that you have all this information that's in the values and the query is somehow determining how it's going to pay attention to the values. So for example in seq2seq, uh, the decoder hidden state is the query. Uh the decoder hidden state on a particular time step is the query and is attending to all the encoder hidden states which are the values. All right, here's our definition again. So here's a way to kind of understand this intuitively, two alternative ways. One is to think of it like this. You could think of it as the weighted sum is like a selective summary of the information in the values. And I say selective because your choice of how much you choose to draw from each value depends on the attention distribution. Uh, so the distribution, uh, depends on the query. So the query is determining how much you're going to select from different, uh, values. And this is kind of similar to LSTM that learned about earlier this week. LSTMs rule based on the idea of a gate that, uh, [NOISE] that defines how much information shou- should [NOISE] come from different elements. And the gate depends on the context. So the strength of LSTMs came from the idea that based on the context, you decide where you're going to draw information from. And this is kind of like the same idea. The second way to think about attention is you could say that it's a way to obtain a fixed-size representation from an arbitrary set of representations. So when I say arbitrary sets, I'm saying we have this set of vectors called the values, right? And you could have 10 values. You could have 100 values. You can have, uh, any [NOISE] arbitrary number of these vectors. But attention gives you a way to get a single vector, um, summary of that which is the attention output, uh, using your query. Okay, uh, so the last thing, uh, is that there's actually several variants of attention and this is something were are going to look at a little in Assignment 4. So in our more general setting, we've seen that we have some values in the query. Doing attention always involves computing the attention [NOISE] scores, and then you apply softmax to get the attention distribution. And then you use that attention distribution to take a weighted sum. So this is, uh, always the outline of how attention works. The part that can be different is this, uh, number one. There are multiple ways you can compute the scores. So, uh, last slide, here all the different ways you can repeat the scores. So the first one which you've already seen today is basic dot-product attention. And the idea here is that [NOISE] the score for a particular, a particular value, HI, is just the dot-product of the query and that particular value. [NOISE] And, uh, in particular this assumes that the size of your query vector and the size of your value vector has to be the same because you're taking dot-product. [NOISE] Another, uh, version of, uh, attention is called multiplicative attention. And here, the idea is that the score of your, uh, value HI, is going to be this, uh, bi-linear function of your query and that value. So in particular, we're pushing this weight matrix in the middle and that's a learnable parameter. You're learning the best way matric- ma- weight matrix in order to get the scores, the attention scores that are useful. The last one is called additive attention. So what's happening here is that the score of the value HI is, uh, you get it by applying a linear transformation to both the value and the query and then you add them together. And then you put them through a non-linearity like tanh. And then lastly, uh, you take that vector and you take the dot-product with a weight vector to give you a single number that is the score. [NOISE] So here, you've got two different weight matrices and [NOISE] also a weight vector which are the learnable parameters. One thing that's different here is that there's kind of an additional hyperparameter, which is the attention dimensionality. So [NOISE] that's kind of, uh, the, I think it's the heights of the W1 and W2 and this is the length of V, right? You can choose what size that dimension is. It's kind of like a hidden layer in the computation. So, um, you can decide how big you want that intermediate representation to be. Okay, so I'm not going to tell you any more about that because that's actually one of the questions in the assignment, uh, Assignment 4 is to think about the relative advantages and disadvantages of these models. [NOISE] Okay. So here's a summary of today. [NOISE] It really is the last slide, [BACKGROUND] second last, last time, but this was the last slide. [BACKGROUND] So we learned about the history of MT. [NOISE] We learned about how in 2014 [NOISE] Neural MT revolutionized MT. [NOISE] We learned about how sequence-to-sequence is the right architecture for NMT and it uses two RNNs. And lastly, we learned about how attention [NOISE] is a way to focus on particular parts of the input. All right, thanks.
Info
Channel: stanfordonline
Views: 90,228
Rating: 4.9477553 out of 5
Keywords:
Id: XXtpJxZBa2c
Channel Id: undefined
Length: 76min 56sec (4616 seconds)
Published: Thu May 02 2019
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.