Stanford CS224N: NLP with Deep Learning | Winter 2019 | Lecture 6 – Language Models and RNNs

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
👍︎︎ 1 👤︎︎ u/aivideos 📅︎︎ May 08 2019 🗫︎ replies
Captions
Hi, everyone. I'm Abby, I'm the head TA for this class and I'm also a PhD student in the Stanford NLP group. And today I'm gonna be telling you about language models and recurrent neural networks. So, here's an overview of what we're gonna do today. Today, first, we're going to introduce a new NLP task, that's language modelling, and that's going to motivate us to learn about a new family of neural networks, that is recurrent neural networks or RNNs. So, I'd say that these are two of the most important ideas you're going to learn for the rest of the course. So, we're going to be covering some fairly cool material today. So, let's start off with language modeling. Language modeling is the task of predicting what word comes next. So, given this piece of text the students opens their blank, could anyone shout out a word which you think might be coming next? Purpose. [NOISE]. [OVERLAPPING] Mind, what else? I didn't quite hear them, but, uh, yeah, these are all likely things, right? So, these are some things which I thought, students might be opening, uh, students open their books, seems likely. Uh, students open their laptops, students open their exams, Students open their minds, incredibly, someone came up with one, that one just now, uh, it's kind of a metaphorical meaning of opening. So, you are all performing language modeling right now. And thinking about what word comes next, you are being a language model. So, here's a more formal definition of what a language model is. Given a sequence of words X1 up to Xt, a language model, is something that computes the probability distribution of the next word, Xt plus 1. So, a language model comes up with the probability distribution, the conditional probability, of what X t plus 1 is given the words it found. And here we're assuming that, Xt plus 1 can be any word w from a fixed vocabulary V. So we are assuming that there is a pre-defined list of words that we're considering. In this way, you can view language modeling as a type of classification task, because there's a predefined number of possibilities. Um, we call a system that does this a language model. There's an alternative way of thinking about a language model as well. You can think of a language model as a system which assigns probability to a piece of text. So, for example, if we have some piece of text, X up to X capital T, then, the probability of this text according to the language model can be broken down. So, just by definition, you can say that the probability is equal to, the product of all of these conditional probabilities. And, uh, the form inside, the products is exactly what a language model provides. So, you can think of these things as somewhat equivalent. Predicting next words, gives you a system, that can give the probability of a given piece of text. So, in fact, you, use language models every day. For example, when you're texting on your phone and you're writing a message, then most likely if you have a smartphone, it will be predicting what word you might be about to say. So, if you say, um, I'll meet you at the- your phone might suggest perhaps you mean airport or cafe, or office, for example. Another situation which you use language models every day is when you search for something on the internet, for example, Google, and you start typing your query, then Google tries to complete your query for you, and that's language modeling. It's predicting what word or words might come next. So, that's what a language model is, and the question is, how would you learn a language model? So, if I was to ask that question in the pre- deep learning era, which was really only a few years ago, the answer would be, you would learn a n-gram language model. So, today first we're going to learn about n-gram language models. So, before I can tell you what a n-gram language model is, you need to know what an n-gram is. So, by definition an n-gram is a chunk of n consecutive words. So, for example, a one gram or unigram, is just all of the individual words in the sequence that would be "the students open the-" A two gram or bigram would be all of the consecutive chunks of pairs of words, "the students", "students opened", "opened their" and so on for trigrams and four-grams, etc. So, the core idea of an n-gram language model is that in order to predict what word comes next, you're going to collect a bunch of statistics, about how frequent different n-grams are, from some kind of training data, and then you can use those statistics to predict what next words might be likely. Here is some more detail. So, to make an n-gram language model, first you need to make a simplifying assumption, and this your assumption. You say that the next word Xt plus 1 depends only on the preceding N-1 words. So, what we're assuming, is that the probability distribution, the conditional probability of Xt plus 1 given all of the words they follow, we're just going to simplify that, and say it only depends on the last N-1 words, and that's our assumption. So, by the definition of conditional probability, we can say that this probability, is just the ratio of two different probabilities. So, on the top, you've got the probability of a particular n-gram and on the bottom we've got the probability of a particular N-1 gram This is a little hard to read because of all the superscripts but I'm gonna give an example with words on the next slide. Okay. So, that's the definition of the probability of the next word, but the question remains, how do we get all of these n-gram and N-1 gram probabilities? So, the answer is, we're going to get them by counting them in some large corpus of text. So, we're going to approximate, these probabilities just by the count of the number of times that these particular n-grams and N-1 grams appeared in our training corpus. Okay. So, here's an example with some words. Suppose we are trying to learn a 4-gram language model, and suppose that we have a piece of text, that says, "As the proctor started the clock, the students opened their blank", and we're trying to predict what word is coming next. So, because we're learning a 4-gram language model, a simplifying assumption is that the next word depends only on the last three words, last N-1 words. So, we're going to discard all of the context so far except for the last few words, which is, "Students opened their." So, as a reminder, n-gram language model says that, the probability of the next word being, some particular word W in the vocabulary is equal to the number of times we saw students opened their W divided by the number of times we saw students opened their, in the training corpus. So, let's suppose that in our training corpus, we saw the phrase "students open their" 1,000 times. And suppose that, we saw "students opened their books" 400 times. This means that the probability of the next word being books is 0.4. And uh, similarly, let's suppose that we saw students open their exams 100 times, this means that the probability of exams given students open their is 0.1. Is there a question? [inaudible]. The question is, does the order of the words matter? And the answer is yes, the order of students open there does matter. It's different to "the students opened." So, the question I want to raise now is, was it a good idea for us to discard the proctor context? If you look at the actual example that we had, the example was as the proctor started the clock, the students opened their blank. So, do we think that books or exams is more likely given the actual context, the full context? Yep. Exams. Right. Exams is more likely because the proctor and the clock heavily implies that it's an exam scenario, so they're more likely to be opening the exams than the books, unless it's an open book exam. Uh, but I think, overall, it should be exams. So, the problem that we're seeing here is that in the training corpus, the fact that students were opening something means that it's more likely to be books than exams because overall, books are more common than exams. But if we know that the context is, the proctor and the clock, then it should be exams. So, what I'm highlighting here is a problem with our simplifying assumption. If we throw away too much context, then we are not as good as predicting the words as we would be if we kept the context. Okay. So, that's one problem with n-gram, uh, language models. Uh, there are some other problems as well. So, uh, here again is the equation that you saw before. One problem which we're gonna call the sparsity problem is what happens if the number on top, the numerator, what if that count is equal to zero. So, what if for some particular word W, the phrase students opened their W never occurred in the data. So, for example, let's suppose students opened their petri dishes, is fairly uncommon and it never appears in the data, then that means our probability of the next word being petri dishes will be zero. And this is bad, because it might be uncommon but it is, a valid scenario, right? If you're a biology student for example. So, this is a problem and we call it the sparsity problem, because the problem is that if we'd never seen an event happen in the training data, then our model assigns zero probability to that event. So, one partial solution to this problem is that maybe we should add a small delta, small number delta to the count, for every word in the vocabulary. And then this way, every possible word that come next, has at least some small probability. So, petri dishes will have some small probability, but then so, will all of the other words which are possibly bad choices. So, this, uh, technique is called smoothing, because the idea is, you're going from a very, uh, sparse probability distribution, which is zero, almost everywhere, with a few spikes where there's, uh, being n-grams that we've seen, it goes from that to being a more smooth probability distribution where everything has at least a small probability on it. So, the second sparsity problem which is possibly worse than the first one is, what happens if the number in the denominator is zero? So, in our example, that would mean, what if we never even saw the trigram "students opened their" in the training data. If that happens, then we can't even calculate this probability distribution at all for any word W because we never even saw this context before. So, a possible solution to this is that if you can't find "students open their" in the corpus, then you should back off to just conditioning on the last two words, rather than the last three words. So, now you'd be looking at times when you'd seen, uh, "open their" and seeing what what's come next. So, this is called back-off because in this failure case, for when you have no data for your 4-gram language model, you're backing off to a trigram language model. Are there any questions at this point? Okay. So, um, another thing to note is that these sparsity problems get worse if you increase N. If you make N larger in your n-gram language model, and you might want to do this, for example, you might think, uh, I want to have a larger context, so I can pay attention to words that happened longer ago and that's gonna make it a better predictor. So, you might think making N bigger is a good idea. But the problem is if you do that then the sparsity problems get worse. Because, let's suppose you say, I want a 10-gram language model. Then the problem is that you're going to be counting, how often you seen process in 9-grams and 10-grams. But 9-grams and 10-grams, there's so many of them, that the one you are interested in probably never occurred, in your training data which means that the whole thing becomes dysfunctional. So, in practice, we usually can't have N much bigger than five. Okay. So, that was, uh, two sparsity problems with n-gram language models. Here is a problem with storage. So, if we look at this equation, uh, you have to think about what do you need to store in order to use your n-gram language model. You need to store this count number, for all of the n-grams that you observed in the corpus when you were going through the training corpus counting them. And the problem is, that as you increase N, then this number of n-grams that you have to store and count increases. So, another problem with increasing N is that the size of your model, or your n-gram model, uh, gets bigger. Okay, so n-gram Language Models in practice. Let's look at an example. You can actually build a simple trigram Language Model over a 1.7 million word corpus, uh, in a few seconds on your laptop. And in fact, the corpus that I used to do this was the same one that you met in assignment one. It's Reuters' corpus which is, uh, business and financial news. So, if you want to do this yourself, you can follow that link at the bottom of the slide later. So, uh, this is, uh, something which I ran on my laptop in a few second. So I gave it the context of the bigram today the, and then I asked the trigram Language Model what word is likely to come next. So, the Language Model said that the top next most likely words are company, bank, price, Italian, emirate, et cetera. So already just looking at these probabilities that are assigned to these different words, uh, you can see that there is a sparsity problem. For example, the top two most likely words have the exact same probability and the reason for that is, that this number is 4 over 26. So these are quite small integers, uh, meaning that we only saw, uh, today the company and today the bank four times each. So, uh, this is an example of the sparsity problem because overall these are quite low counts, we haven't seen that many different, uh, versions of this event, so we don't have a very granular probability distribution. But in any case ignoring the sparsity problem, I would say that overall, these, uh, top suggestions look pretty reasonable. So you can actually use a Language Model to generate text and this is how you would do it. So let's suppose you have your first two words already, uh, you condition on this and you ask your Language Model what's likely to come next. So then given this probability distribution over the words, you can sample from it, that is, select some words with, you know, the associated probability. So let's suppose that gives us the word price. So then price is your next word, and then you just condition on the last two words, which in this ex- example is now the price. So now you get a new probability distribution and you can continue this process, uh, sampling and then conditioning again and sampling. So if you do this long enough, you will get a piece of text, so this is the actual text that I got when I run this generation process with this trigram Language Model. So it says, "Today the price of gold per ton, while production of shoe lasts and shoe industry, the bank intervened just after it considered and rejected an IMF demand to rebuild depleted European stocks, September, 30th end primary 76 counts a share.'' Okay. So, uh, what do we think about this text? We think it's good? We, uh, surprised? Um, I would say that in some ways it is good, it's kind of surprisingly grammatical, you know, it mostly, uh, kind of pauses, uh, but you would definitely say that it, it doesn't really make any sense. It's pretty incoherent. And we shouldn't be surprised that it's incoherent I think because if you remember this is a trigram Language Model, it has a memory of just the last well, three or two words depending on how you look at it. So clearly we need to consider more than three words at a time if we want to model language well. But as we already know, increasing n makes the sparsity problem worse, n-gram Language Models, and it also increases model size. Is that a question? How does it [inaudible] [NOISE] So the question is, how does the n-gram Language Model know when to put commas. Uh, so you can, [NOISE] decide that commas and other punctuation are just another kind of word, is that well or token, and then, to the Language Model it doesn't really make much difference. It's just used that as another possible world that can be, um, predicted, that's why we've got the weird spacing around the, the commas is because it was essentially viewed as a separate word. [NOISE] Okay. So this course is called NLP with Deep Learning. So you probably thinking how do we build a neural Language Model? So let's just recap, uh, in case you forgot. Remember that a Language Model is something that takes inputs which is a sequence of words X1 up to Xt, and then it outputs a probability distribution of what the next word might be Xt plus 1. Okay, so when we think about what kind of neural models we've met in this course so far. Uh, we've already met window-based neural models. And in lecture three, we saw how you could apply a window-based neural model to a named entity recognition. So in that scenario you take some kind of window around the word that you care about which in this example is Paris, and then, uh, you get the word embeddings for those, concatenate them put them through some layers, and then you get your decision which is that Paris is a location not, you know, a person or organization. So that's a recap of what we saw in lecture three. How would we apply a model like this to language modeling? So here's how you would do it. Here's an example of a fixed-window neural language model. So, again, we have some kind of context which is, as the proctor started the clock the students opened their, um, we're trying to guess what word might come next. So we have to make a similar simplifying assumption to before. Uh, because it's a fixed size window, uh, we have to discard the context except for the window that we're conditioning on. So let's suppose that our fixed window is of size four. So what we'll do is similarly to the, ah, NER model. We're going to represent these words with one-hot vectors, and then we'll use those to look up the word embeddings for these words using the, uh, embedding lookup matrix. So then we get all of our word embeddings E,1, 2, 3, 4, and then we concatenate them together to get e. We put this through a linear layer and a nonlinearity function f to get some kind of hidden layer, and then we put it through another linear layer and the softmax function and now we have an output probability distribution y hat. And in our case because we're trying to predict what word comes next, ah, ah, vector y hat will be of length v where v is the vocabulary and it will contain the probabilities of all the different words in the vocabulary. So here I've represented that as a bar charts where if you suppose you've got all of the words listed alphabetically from a to z, and then there's the different probabilities of the words. So if everything goes well, then this language model should tell us that some likely next words are books and laptops, for example. So none of this should be, um, unfamiliar to you because you saw it all last week. We're just applying a Window-based model to a different task, such as language modeling. Okay, so what are, some good things about this model compared to n-gram language models? So one, ah, advantage I'd say is that there's no sparsity problem. If you remember an n-gram language model has a sparsity problem which is that if you've never seen a particular n-gram in training then, you can't assign any probability to it. You don't have any data on it. Whereas at least here you can take any, you know, for example, 4-gram you want and you can feed it into the, ah, the neural nets and it will give you an output distribution of what it thinks the next word would be. It might not be a good prediction but at least it will, it will run. Another advantage is you don't need to store all of the observed n-grams that you ever saw. So, uh, this an advantage by, uh, comparison you just have to store all of the word vectors for all the words in your vocabulary. Uh, but there are quite a lot of problems with this fixed-window language model. So here are some remaining problems: Uh, one is that your fixed window is probably too small. No matter how big you make your fixed window, uh, you're probably going to be losing some kind of useful context that you would want to use sometimes. And in fact, if you try to enlarge the window size, then you also have to enlarge the size of your, uh, weight factor, sorry, your weight matrix W. Uh, so the width of W because you're multiplying it by e which is the concatenation of your word embeddings. The width of W grows as you increase the size of your window. So in inclusion really your window can never be large enough. Another problem with this model which is more of a subtle point is that X1 and X2 and really all of the words in the window they're, uh, multiplied by completely diffe rent weights in W. So to demonstrate this you could draw a picture. So the problem is that if you have your weight matrix W and then you have your concatenation of embeddings e and we have, uh, four embeddings. So we have e_1, e_2, e_3, e_4, and you multiply, uh, the concatenated embeddings by the weight matrix. So really you can see that there are essentially kind of four sections of the weight matrix, and the first word embedding e_1 is only ever multiplied by the weights for it in this section, and that's completely separate to the weights that multiply by e_2 and so forth. So the problem with this is that what you learn in the weight matrix in one section is not shared with the others. You're kind of learning a lot of similar functions four times. So the reason why we think this is a problem is because there should be a lot of commonalities in how you process the incoming word embeddings. So what you learn about how to process, you know, the third embedding, some of it at least should be shared with all of the embeddings. So what I'm saying is it's kind of inefficient that we're learning, uh, all of these separate weights for these different words when there's a lot of commonalities between them. Is there a question? So that's why [inaudible] [NOISE]. Okay- Yeah, hopefully- hopefully the verbal description is on. So, in conclusion, I'd say that the biggest problem that we've got with this fixed-size neural model is that clearly we need some kind of neural architecture that can process any length input, because most of the problems here come from the fact that we had to make this simplifying assumption that there was a fixed window. Okay. So this motivates, uh, us to introduce this new family of neural architecture, it's called recurrent neural networks or RNNs. So, this is a simplified diagram that shows you the most important, um, features of an RNN. So we have again an input sequence of X1, X2, et cetera, but you can assume that this sequence is of any arbitrary length you like. The idea is that you have a sequence of hidden states instead of just having, for example, one hidden state as we did in the previous model. We have a sequence of hidden states and we have as many of them as we have inputs. And the important thing is that each hidden state ht is computed based on the previous hidden state and also the input on that step. So the reason why they're called hidden states is because you could think of this as a single state that's mutating over time. It's kind of like several versions of the same thing. And for this reason, we often call these time-steps, right? So these steps that go left to right, we often call them time-steps. So the really important thing is that the same weight matrix W is applied on every time-step of this RNN. That's what makes us able to process any length input we want. Is because we don't have to have different weights on every step, because we just apply the exact same transformation on every step. So additionally, you can also have some outputs from the RNN. So these y hats, these are the outputs on each step. And they're optional because you don't have to compute them or you can compute them on just some steps and not others. It depends on where you want to use your RNN to do. Okay. So that's a simple diagram of an RNN. Uh, here I'm going to give you a bit more detail. So here's how you would apply an RNN to do language modeling. So, uh, again, let's suppose that we have some kind of text so far. My text is only four words long, but you can assume that it could be any length, right? It's just short because we can't fit more on the slide. So you have some sequence of tags, which could be kind of long. And again, we're going to represent these by some kind of one-hot vectors and use those to look up the word embeddings from our embedding matrix. So then to compute the first hidden state H1, we need to compute it based on the previous hidden state and the current input. We already have the current input, that's E1, but the question is where do we get this first hidden state from? All right, what comes before H1? So we often call the initial hidden state H0, uh, yes, we call the initial hidden state and it can either be something that you learn, like it's a parameter of the network and you learn how to initialize it, or you can assume something like maybe it's the zero vector. So the formula we use to compute the new hidden state based on the previous one, and also the current inputs is written on the left. So you do a linear transformation on the previous hidden state and on the current input and then you add some kind of bias and then put it through a non-linearity, like for example, the sigmoid function. And that gives you a new hidden state. Okay. So, once you've done that, then you can compute the next hidden state and you can keep unrolling the network like this. And that's, uh, yeah, that's called unrolling because you're kind of computing each step given the previous one. All right. So finally, if you remember, we're trying to do language modeling. So we're trying to predict which words should come next after the students opened their. So on this fourth step over here, we can use, uh, the current hidden state, H4, and put it through a linear layer and put it through a softmax function and then we get our output distribution Y-hat 4 which is a distribution over the vocabulary. And again, hopefully, we'll get some kind of sensible estimates for what the next word might be. Any questions at this point. Yep? Is the- the number of hidden state or is it gonna be the number of words in your input? The question is, is the number of hidden states the number of words in your input? Yeah, in this setting here, uh, yes, or you could say more generally the number of hidden states is the number of inputs. Yep. And just as with the n-gram model, we could use the output as the input from the tasks mutation in transformational model? Yeah, so the question is, as with the n-gram language model, could we use the output as the input on the next step? And the answer is yes, and I'll show you that in a minute. Any other questions? Yeah. Are you learning the embedding? The question is, are you learning the embeddings? Um, that's a choice. You could have the embeddings be for example, pre-generated embeddings that you download and you use those and they're frozen, or maybe you could download them, but then you could fine-tune them. That is, allow them to be changed as parameters of the network or you could initialize them to, you know, small, uh, random values and learn them from scratch. Any other questions? Yeah. So you said you use the same delta matrix, like you do back propagation, does that you only update like WE, or do you update both WH and WE? So the question is, you say we reuse the matrix, do we update WE and WH, or just one? So you suddenly learn both WE and WH. I suppose I was emphasizing WH more, but yeah, they're both matrices that are applied repeatedly. There was also a question about back-prop, but we're going to cover that later in this lecture. Okay, moving on for now. Um, so, what are some advantages and disadvantages of this RNN language model? So here are some advantages that we can see in comparison to the fixed window one. So an obvious advantage is that this RNN can process any length of input. Another advantage is that the computation for step t can in theory use information from many steps back. So in our motivation example, which was as the proctor started the clock, the students opened their. We think that proctor and maybe clock are both pretty important hints for what might be coming up next. So, at least in theory, the hidden state at the end can have access to the information from the input from many steps ago. Another advantage is that the model size doesn't increase for longer inputs. So, uh, the size of the model is actually fixed. It's just WH and WE,s and then also the biases and also the embedding matrix, if you're counting that. None of those get bigger if you want to apply it to more, uh, longer inputs because you just apply the same weights repeatedly. And another advantage is that you have the same weights applied on every time-step. So I said this thing before about how the fixed-sized window neural model, it was less efficient because it was applying different weights of the weight matrix to the different, uh, words in the window. And the advantage about this RNN is that it's applying the exact same transformation to each of the inputs. So this means that if it learns a good way to process one input, that is applied to every input in the sequence. So you can see it as more efficient in that way. Okay, so what are the disadvantages of this model? One is that recurrent computation is pretty slow. Uh, as you saw before, you have to compute the hidden state based on the previous hidden state. So this means that you can't compute all of the hidden states in parallel. You have to compute them in sequence. So, especially if you're trying to compute an RNN over a pretty long sequence of inputs, this means that the RNN can be pretty slow to compute. Another disadvantage of RNNs is that it tuns out, in practice, it's quite difficult to access information from many steps back. So even though I said we should be able to remember about the proctor and the clock and use that to predict exams and our books, it turns out that RNNs, at least the ones that I've presented in this lecture, are not as good as that as you would think. Um, we're gonna learn more about both of these disadvantages later in the course, and we're going to learn something about how you can try to fix them. Have we gotten any questions at this point? Yep. Why do we assume that WH are the same? Sorry, can you speak up? Why do we assume that the WH should be the same? So the question is, why should you assume that the WH are the same? I suppose, it's not exactly an assumption, it's more a deliberate decision in the design of an RNN. So, an RNN is by definition, a network where you apply the exact same weights on every step. So, I suppose the question why do you assume maybe should be, why is that a good idea? Um, so I spoke a little bit about why it's a good idea, and this list of advantages, I suppose, are the reasons why you'd want to do that. Does that answer your question? Open their books, right? If you assume that WH are the same, you mean that like, uh, Markov chain, it's like a Markov chain. Uh, the trans- transmit, uh, trans- transfer probability for the human moods open, they are the same, but actually the Markov chain. The model, [inaudible] the transfer probability for that is the same, so [inaudible] probability, it- it's just an approximation but it's another test. Okay. So I think that [OVERLAPPING] If you assume WH could be the same, it's good because you used a number of parameters, but this is just an, this is just an approximation. The underlying transfer, uh, probability, it shouldn't be the same. Especially [OVERLAPPING] Okay. Um, so I think the question is saying that given the- these words the students opened their are all different and they're happening in different context, then why should we be applying the same transformation each time? So that's a- that's a good question. I think, uh, the idea is that you are learning a general function, not just, you know, how to deal with students, the one-word students in this one context. We're trying to learn a general function of how you should deal with a word given the word so far. You're trying to learn a general representation of language and context so far, which is indeed a very difficult problem. Um, I think you also mentioned that something about an approximation. Uh, another thing to note is that all of the hidden states are vectors, they're not just single numbers, right? They are vectors of lengths, I don't know, 500 or something? So they have quite a large capacity to hold lots of information about different things in all of their different, um, positions. So, I think the idea is that you can store a lot of different information in different contexts, in different parts of the hidden state, but it is indeed an approximation and there is some kind of limit to how much information you can store. Okay, any other questions? Yes. Since you kinda process any single length frame, what length do you use during your training? And does the length you use for training affect WH? Okay, so, the question is, given that you can have any length input, what length is the input during training? So, I suppose in practice, you choose how long the inputs are in training either based on what your data is or maybe based on, uh, your efficiency concerns so maybe you make it artificially shorter by chopping it up. Um, what was the other question? Uh, does WH depend on that? Okay. So the question was, does WH depend on the length you used? So, no, and that's one of the good things in the advantages list. Is that the model size doesn't increase for longer input, because we just unroll the RNN applying the same weights again and again for as long as we'd like. There's no need to have more weights just because you have a longer input. [NOISE] Yeah. So how the ratios that you mentioned are [inaudible] the number of words. [NOISE] Are you asking about capital E or the lowercase E? Uh, lowercase E. Okay. So, the question is, how do we choose the dimension of the lowercase Es? Uh, so, you could, for example, assume that those are just pre-trained word vectors like the ones that you, uh, used in assignment one. More like word2vec. Yeah. For example, word2vec, and you just download them and use them, or maybe you learn them from scratch, in which case, you decide at the beginning of training how big you want those vectors to be. [NOISE] Okay. I'm gonna move on for now. [NOISE] So, we've learned what an RNN language model is and we've learned how you would, uh, run one forward, but the question remains, how would you train an RNN language model? How would you learn it? [NOISE] So, as always, in machine learning, our answer starts with, you're going to get a big corpus of text, and we're gonna call that just a sequence of words X1 up to X capital T. So, you feed the sequence of words into the RNN language model, and then, the idea is that you compute the output distribution Y-hat T for every step T. So, I know that the picture I showed on the previous, uh, slide [NOISE] only showed us doing on the last step, but the idea is, you would actually compute this on every step. So, this means that you're actually predicting the probability of the next word on every step. [NOISE] Okay. So, once you've done that, then you can define the loss function, and this should be familiar to you by now. Uh, this is the cross-entropy between [NOISE] our predicted probability distribution Y-hat T and the true, uh, distribution, which is Y-hat- sorry, just YT, which is a one-hot vector, uh, representing the true next [NOISE] words, which is XT plus one. So, as you've seen before, this, uh, cross-entropy [NOISE] between those two vectors can be written also as a negative log probability. And then, lastly, if you average this cross-entropy loss across every step, uh, every T in the corpus time step T, then, uh, this gives you your overall loss for the entire training set. [NOISE] Okay. So, just to make that even more clear with a picture, uh, suppose that our corpus is, the students open their exams, et cetera, and it goes on for a long time. Then, what we'd be doing is, we'd be running our RNN over this text, and then, on every step, we would be predicting the probability [NOISE] distribution Y-hats, and then, from each of those, you can calculate what your loss is, which is the JT, and then, uh, on the first step, the loss would be the negative log probability of the next word, which is, in this example, students, [NOISE] and so on. Each of those is the negative log probability of the next word. [NOISE] And then, once you've computed all of those, you can add them [NOISE] all up and average them, and then, this gives you your final loss. [NOISE] Okay. So, there's a caveat here. Um, computing the loss and gradients across the entire corpus, all of those words X1 up to X capital T is too expensive [NOISE] because your corpus is probably really big. [NOISE] So, um, as a student asked earlier, uh, in practice, what do you actually regard as your sequence? So, in practice, you might regard your sequence as, uh, something like a sentence or a document, some shorter unit of text. So, uh, another thing you'll do [NOISE] is, if you remember, stochastic gradient descent allows you to compute gradients for small chunks of data rather than the whole corpus at a time. So, in practice, if you're training a language model, what you're actually likely to be doing is computing the loss for a sentence, but that's actually a batch of sentences, and then, you compute the gradients with respect to that batch of sentences, update your weights, and repeat. Any questions at this point? [NOISE] Okay. So, uh, moving onto backprop. Don't worry, there won't be as much backprop as there was last week, but, uh, there's an interesting question here, right? So, the, uh, characteristic thing about RNNs is that they apply the same weight matrix repeatedly. So, the question is, [NOISE] what's the derivative of our loss function, let's say, on step T? What's the derivative of that loss with respect to the repeated weight matrix WH? So, the answer is that the derivative of the loss, uh, the gradient with respect to the repeated weight is the sum of the gradient with respect to each time it appears, and that's what that equation says. So, on the right, the notation with the vertical line and the I is saying, uh, the derivative of the loss with respect to WH when it appears on the Ith step. Okay. So, so, why is that true? [NOISE] Uh, to sketch why this is true, uh, [NOISE] I'm gonna remind you of the multivariable chain rule. So, uh, this is a screenshot from a Khan Academy article on the multivariable chain rule, and, uh, I advise you check it out if you want to learn more because it's very easy to understand. Uh, and what it says is, given a function F [NOISE] which depends on X and Y, which are both themselves functions of some variable T, then, if you want to get the derivative of F with respect to T, then you need to do the chain ru- rule across X and Y separately and then add them up. [NOISE] So, that's the multivariable chain rule, [NOISE] and if we apply this to our scenario with trying to take the derivative of the loss JT with respect to our weight matrix WH, then you could view it as this kind of diagram [NOISE] where WH has, uh, a relationship with all of these individual appearances of WH, but it's a [NOISE] simple relationship, it's just equality, and then, each of those appearances of WH affect the loss in different ways. So, then, if we apply the multivariable chain rule, then it says that the derivative of the loss with respect to WH is the sum of those chain rule things, but the expression on the right is just one because it's an equality relation, [NOISE] and then, that gives us the equation that I wrote on the previous slide. So, this is a proof sketch for why the derivative of the loss with respect to our recurrent matrix is the sum of the derivatives each time it appears. Okay. So, suppose you believe me on that, that is, how you compute the, uh, gradient with respect to the recurrent weight. So, a remaining question is, well, how [NOISE] do we actually calculate this in practice? [NOISE] So, the answer is that you're going to calculate this sum by doing backprop, uh, backwards, kind of right to left, um, through the RNN, and you're going to accumulate this sum as you go. So, the important thing is, you shouldn't compute each of those things separately, uh, you should compute them by accumulating, like, each one can be computed in form- in terms of the previous one. [NOISE] So, this algorithm of computing each of these, uh, each of these gradients with respect to the previous one is called backpropagation through time. And, um, I always think that this sounds way more sci-fi than it is. It sounds like it's time travel or something, but it's actually pretty simple. Uh, it's just the name you give to applying the backprop algorithm to a recurrent neural network. Any questions at this point? Yep. [NOISE] So, it seems that how you break up the batches matter your end result. [inaudible]. So, if you break it into much more [inaudible]. Okay. So the question is, um, surely, how you decide to break up your batches affects how you learn, right? Because if you choose, uh, one set of data to be your batch, right, then, you will make your update based on that, and then, you only update the next one based on [NOISE] where you go from there. So, if you decided to put different data in the batch, then you would have made a different step. So, that's true, [NOISE] and that is why stochastic gradient descent is only an approximation of true gradient descent because the gradient that you compute with respect to one batch is just an approximation of the true gradient with respect to the, uh, the loss over the whole corpus. So, yes, it's true that it's an approximation and how [NOISE] you choose to batch up your data can matter, and that's why, for example, shuffling your data is a good idea, and shuffling it differently, each epoch, is a good idea. Uh, but the, the core idea of SGD is [NOISE] that, um, it should be a good enough approximation that over many steps, you will, uh, minimize your loss. [NOISE] Any other questions? [NOISE] Yeah. [NOISE] So, is, uh, is the question, as you compute forward prop, do you start computing backprop before you've even, like, got to the loss? Is that the question? [NOISE] Yes. I didn't think so, right? Because you need to know what the loss is in order to compute the derivative of the loss with respect to something. So, I think you need to get to the end. So, if we assume simplicity, that there is only one loss which you get at the end of several steps, then you need to get to the end, compute the loss before you can compute the derivatives. But I suppose you, you, you could compute the derivative of two, kind of, adjacent things of one with respect to the other. [OVERLAPPING] But, yeah. [NOISE] As you're going forward, do- you need to sort of keep a track of what, what you would have [inaudible] the one you eventually get the loss. [inaudible] Yes. So, when you forward prop, you certainly have to hang on to all of the intervening factors. [NOISE] Okay. I'm gonna move on for now. Uh, so, that was a maths-heavy bit but, um, now, we're getting on to text generation, which someone asked about earlier. So, um, just as we use the n-gram language model to generate text, you can also use an RNN language model to generate text, uh, via the same repeated sampling technique. Um, so, here's a picture of how that would work. How you start off with your initial hidden state H0, uh, which, uh, we have either as a parameter of the model or we initialize it to zero, or something like that. So, let's suppose that we have the first word my, and Iet's suppose I, um, supply that to the model. So, then, using the inputs and the initial hidden state, you can get our first hidden state H1. And then from there, we can compute the, er, probability distribution Y hat one of what's coming next, and then we can use that distribution to sample some word. So let's suppose that we sampled the word favorite. So, the idea is that we use the outputted word as the input on the next step. So, we feed favorite into the second step of the RNN, we get a new hidden state, and again we get a new probability distribution, and from that we can sample a new word. So, we can just continue doing this process again and again, and in this way we can generate some text. So, uh, here we've generated the text, My favorite season is Spring, and we can keep going for as long as we'd like. Okay, so, uh, let's have some fun with this. Uh, you can generate, uh, text using an RNN language model. If you train the RNN language model on any kind of text, then you can use it to generate text in that style. And in fact, this has become a whole kind of genre of internet humor that you might've seen. So, uh, for example, here is an RNN language model trained on Obama speeches, and I found this in a blog post online. So, here's the text that the RNN language model generated. "The United States will step up to the cost of a new challenges of the American people that will share the fact that we created the problem. They were attacked and so that they have to say that all the task of the final days of war that I will not be able to get this done." [LAUGHTER] Okay. So, if we look at this and especially think about what did that text look like that we got from the n-gram language model, the one about the, the price of gold. Um, I'd say that this is kind of recognizably better than that. It seems more fluent overall. Uh, I'd say it has a more of a sustained context in that it kind of makes sense for longer stretches at a time, and I'd say it does sound totally like Obama as well. So, all of that's pretty good, but you can see that it's still pretty incoherent overall, like i- it was quite difficult to read it because it didn't really make sense, right? So I had to read the words carefully. Um, so, yeah, I think this shows some of the progress you can get from using RNNs to generate text but still, um, very far from human level. Here are some more examples. Uh, here's an RNN language model that was trained on the Harry Potter books. And here's what it said. "Sorry." Harry shouted, panicking. "I'll leave those brooms in London." Are they? "No idea." said Nearly Headless Nick, casting low close by Cedric, carrying the last bit of treacle Charms from Harry's shoulder. And to answer him the common room perched upon it, four arms held a shining knob from when the Spider hadn't felt it seemed. He reached the teams too." So, again, I'd say that this is fairly fluent. It sounds totally like the Harry Potter books. In fact, I'm pretty impressed by how much it does sound like in the voice of the Harry Potter books. You even got some character attributes, I'd say that Harry the character does often panic in the book so that seems right. Um, [LAUGHTER] but some bad things are that we have, for example, a pretty long run-on sentence in the second paragraph that's hard to read. Uh, you have some nonsensical things that really make no sense. Like, I don't know what a treacle charm is. It sounds delicious but I don't think it's real, uh, and overall it's just pretty nonsensical. Here's another example. Here is an RNN language model that was trained on recipes. So, uh, [LAUGHTER] this one's pretty bizarre, the title is 'chocolate ranch barbecue', It contains Parmesan cheese, coconut milk, eggs, and the recipe says place each pasta over layers of lumps, shape mixture into the moderate oven and simmer until firm. Serve hot in bodied fresh, mustard orange and cheese. Combine the cheese and salt together the dough in a large skillet; add the ingredients and stir in the chocolate and pepper. [LAUGHTER] Um, so, one thing that I think is even more clear here in the recipes example than the prose example, is the inability to remember what's [NOISE] what's happening overall, right? Cuz a recipe you could say is pretty challenging because you need to remember the title of what you're trying to make which in this case is chocolate ranch barbecue, and you need to actually, you know, make that thing by the end. Uh, you also need to remember what were the ingredients in the beginning and did you use them. And in a recipe, if you make something and put it in the oven, you need to take it out later, a- and stuff like that, right? So, clearly it's not really remembering what's happening overall or what it's trying to do, it seems to be just generating kind of generic recipe sentences and putting them in a random order. Uh, but again, I mean, we can see that it's fairly fluent, it's grammatically right, it kind of sounds like a recipe. Uh, but the problem is it's just nonsensical. Like for example, shape mixture into the moderate oven is grammatical but it doesn't make any sense. Okay, last example. So, here's an RNN language model that's trained on paint-color names. And this is an example of a character-level language model because it's predicting what character comes next not what word comes next. And this is why it's able to come up with new words. Another thing to note is that this language model was trained to be conditioned on some kind of input. So here, the input is the color itself I think represented by the three numbers, that's probably RGB numbers. And it generated some names for the colors. And I think these are pretty funny. My favorite one is Stanky Bean, which is in the bottom right. [LAUGHTER] Um, so, it's pretty creative, [LAUGHTER] and I think these do sound kind of like paint colors but often they're quite bizarre. [LAUGHTER] Light of Blast is pretty good too. So, uh, you're gonna learn more about character-level language models in a future lecture, and you're also going to learn more about how to condition a language model based on some kind of input such as the color, um, code. So, these are pretty funny, uh, but I do want to say a warning. Um, you'll find a lot of these kinds of articles online, uh, often with headlines like, "We forced a bot to watch, you know, 1000 hours of sci-fi movies and it wrote a script," something like that. Um, so, my advice is you have to take these with a big pinch of salt, because often, uh, the examples that people put online were hand selected by humans to be the funniest examples. Like I think all of the examples I've shown today were definitely hand selected by humans as the funniest examples that the RNN came up with. And in some cases they might even have been edited by a human. So, uh, yeah, you do need to be a little bit skeptical when you look at these examples. [OVERLAPPING] Yep. So, uh, in the Harry Potter one, there was a opening quote and then there was a closing quote. So, like do you expect the RNN, like when it puts that opening quote and keeps putting more words, do you expect the probability of a closing quote to like increase as you're going or decrease? That's a great question. So, uh, the question was, uh, we noticed that in the Harry Potter example, there was some open quotes and some closed quotes. And it looks like the model didn't screw up, right? All of these open quotes and closed quotes, uh, are in the correct places. So, the question is, do we expect the model to put a higher probability on closing the quote given that is inside a quo- quote passage? So, I should say definitely yes and that's most- mostly the explanation for why this works. Um, there's been some really interesting work in trying to look inside the hidden states of, uh, language models to see whether it's tracking things like, are we inside an open quote or a close quote? And there has been some limited evidence to show that maybe there are certain neuron or neurons inside the hidden state, which are tracking things like, are we currently inside a quote or not? [NOISE]. Yeah. So, so, like do you think the probability would increase as you go more to the right [OVERLAPPING]? So, the question is as the quote passage goes on for longer, do you think the priority or the probability of outputting a closed quote should increase? Um, I don't know. Maybe. Um, that would be good, I suppose, because you don't want an infinite quote, uh, but I wouldn't be surprised if that didn't happen. Like I wouldn't be surprised if maybe some other worse-trained language models, just opened quotes and never closed them. Uh, any other questions? Yeah. What are the dimensions of the W metric? Okay. So, the question is what are the dimensions of the W metric? So we're going back to the online stuff. Uh, okay. You're asking me about W_h or W_e or something else? Yeah. So, W_h will be, uh, if we say that the hidden size has size n, then W_h will be n by n. And if we suppose that the embeddings have size d, then W_e will be, uh, d by n, n by d, maybe. Does that answer your question? [NOISE] Uh, any other questions about generating or anything? Yep. So, you said that there was a long sentence in the Harry Potter-related text? Yeah. Is it ever sort of practical to combine RNNs with like in this hand written rules? Sorry. Is it ever practical to combine- RNNs with a written list of hand-written rules. [OVERLAPPING] Okay. Yeah. That's a great question. So the question was, is it ever practical to combine RNNs with a list of hand-written rules? For example, don't let your sentence be longer than this many words. Um, so yeah. I'd say it probably is practical maybe especially if you're interested in, uh, making sure that certain bad things don't happen, you might apply some hacky rules like yeah forcing it to end, uh, early. I mean, okay. So there's this thing called Beam Search which we're going to learn about in a later lecture, which essentially doesn't just, um, choose one word in each step and continue. It explores many different options for words you could generate. And you can apply some kinds of rules on that where if you have lots of different things to choose from, then you can maybe get rid of some options if you don't like them because they break some of your rules. But, um, it can be difficult to do. Any other questions? Okay. Um, so we've talked about generating from language models. Uh, so unfortunately, you can't just use generation as your evaluation metric for the language models. You do need some kind of, um, measurable metric. So, the standard evaluation metric for language models is called perplexity. And, uh, perplexity is defined as the inverse probability of the corpus according to the language model. So, if you look at it you can see that that's what this formula is saying. It's saying that for every, uh, word xt, lowercase t, in the corpus, uh, we're computing the probability of that word given everything that came so far but its inverse is one over that. And then lastly, when normalizing this big, uh, product by the number of words, which is capital T. And the reason why we're doing that is because if we didn't do that, then perplexity would just get smaller and smaller as your corpus got bigger. So we need to normalize by that factor. So, you can actually show you that this, uh, perplexity is equal to the exponential of the cross-entropy loss J Theta. So if you remember, cross-entropy loss J Theta is, uh, the training objective that we're using to train the language model. And, uh, by rearranging things a little bit, you can see that perplexity is actually the exponential of the cross-entropy. And this is a good thing, uh, because if we're training the language model to, uh, minimize the cross-entropy loss, then you are training it to optimize the perplexity as well. So you should remember that the lower perplexity is better, uh, because perplexity is the inverse probability of the corpus. So, uh, if you want your language model to assign high probability to the corpus, right? Then that means you want to get low perplexity. Uh, any questions? [NOISE] Okay. Uh, so RNNs have been pretty successful in recent years in improving perplexity. So, uh, this is a results table from a recent, uh, Facebook research paper about RNN language models. And, uh, you don't have to understand all of the details of this table, but what it's telling you is that, on the, uh, top where we have n gram language model. And thessssssssssssn in the subsequent various, we have some increasingly complex and large RNNs. And you can see that the perplexity numbers are decreasing, because lower is better. So RNNs have been really great for making more effective language models in the last few years. Okay. So to zoom out a little bit, you might be thinking, uh, why should I care about Language Modelling? Why is it important? I'd say there are two main reasons why Language Modelling is important. Uh, so the first one is, that language modelling is a benchmark task that helps us measure our progress on understanding language. So, you could view language modeling as a pretty general language understanding task, right? Because predicting what word comes next to given any, any kind of, uh, generic text. Um, that's quite a difficult and general problem. And in order to be good at language modelling, you have to understand a lot of things, right? You have to understand grammar, you have to understand syntax, and you have to understand, uh, logic and reasoning. And you have to understand something about, you know, real-world knowledge. You have to understand a lot of things in order to be able to do language modelling properly. So, the reason why we care about it as a benchmark task is because if you're able to build a model, which is a better language model than the ones that came before it, then you must have made some kind of progress on at least some of those sub-components of natural language understanding. So, another more tangible reason why you might care about language modelling is that it's a sub-component of many many NLP tasks especially those which involve generating text or estimating the probability of text. So, here's a bunch of examples. Uh, one is predictive typing. That's the example that we showed at the beginning of the lecture with typing on your phone or searching on Google. Uh, this is also very useful for people who have movement disabilities, uh, because they are these systems that help people communicate using fewer movements. Uh, another example is speech recognition. So, in speech recognition you have some kind of audio recording of a person saying something and often it's kind of noisy and hard to make out what they're saying and you need to, uh, figure out what words did they say. So this an example where you have to estimate the probability of different, uh, different options of what, what it is they could have said. And in the same way, handwriting recognition, is an example where there's a lot of noise and you have to figure out what the person intended to say. Uh, spelling and grammar correction is yet another example where it's all about trying to figure out what someone meant. And that means you actually understand how likely it is that they were saying different things. Uh, an interesting, an interesting application is authorship identification. So suppose that you have a piece of text and you're trying to figure out who likely wrote it and maybe you have, uh, several different authors and you have text written by those different authors. So you could, for example, train a separate language model on each of the different authors' texts. And then, because, remember, a language model can tell you the probability of a given piece of text. Then you could ask all the different language models, um, how likely the texts and the question is, and then if a certain author's language model says that it's likely then that means that text the texts and the question is more likely to be written by that author. Um, other examples include machine translation. This is a huge, uh, application of language models, uh, because it's all about generating text. Uh, similarly, summarization is a task where we need to generate some text given some input text. Uh, dialogue as well, not all dialogue agents necessarily are RNN language models but you can build a dialogue agent that generates the text using an RNN language model. And there are more examples as well. Any questions on this? [LAUGHTER] Yep. So, I know that [inaudible] Great question. So, the question was, uh, for some of these examples, uh, such as speech recognition or maybe [NOISE] image captioning, the input is audio or image or something that is not text, right? So, you can't represent it in the way that we've talked about so far. Um, so, [NOISE] in those examples, you will have some way of representing the input, some way of encoding the audio or the image or whatever. Uh, the reason I brought it up now in terms of language models is that that's the input, but you use the language model to get the output, right? So, the language model, [NOISE] uh, generates the output in the way that we saw earlier, uh, but we're gonna learn more about those conditional language [NOISE] models later. [NOISE] Anyone else? [NOISE] Okay. [NOISE] So, uh, here's a recap. If I've lost you somewhere in this lecture, uh, or you got tired, um, now's a great time to jump back in because things are gonna get a little bit more accessible. Okay. So, here's a recap of what we've done today. Uh, a language model is a system that predicts the next word, [NOISE] and a recurrent neural network, is a new family, oh, new to us, a family of neural networks that takes sequential input of any length and it applies the same weights on every step, and it can optionally produce some kind of output on each step or some of the steps or none of the steps. [NOISE] So, don't be confused. A recurrent neural network is not [NOISE] the same thing as a language model. Uh, we've seen today that an RNN is a great way to build a language model, but actually, it turns out that you can use RNNs for, uh, a lot of other different things that are not language modeling. [NOISE] So, here's a few examples of that. [NOISE] Uh, you can use an RNN to do a tagging task. So, some examples of tagging tasks are part-of-speech tagging and named entity recognition. So, pictured here is part-of-speech tagging, and this is the task. We have some kind of input text such as, uh, the startled cat knocked over the vase, and your job is to, uh, label or tag each word with its part of speech. So, for example, cat is a noun and knocked is a verb. So, you can use an RNN to do this task in, in the way that we've pictured, which is that you, uh, feed the text into the RNN, [NOISE] and then, on each step of the RNN, you, uh, have an output, probably a distribution over what, uh, tag you think it is, and then, uh, you can tag it in that way. And then, also for named entity recognition, that's all about, um, tagging each of the words with what named entity type they are. So, you do it in the same way. [NOISE] Okay. Here's another thing you can use RNNs for, uh, you can use them for sentence classification. So, sentence classification is just a general term to mean any kind of task where you want to take sentence or other piece of text, and then, you want to classify it into one of several classes. So, an example of that is sentiment classification. Uh, sentiment classification is when you have some kind of input text such as, let's say, overall, I enjoyed the movie a lot, and then, you're trying to classify that as being positive or negative or [NOISE] neutral sentiment. So, in this example, this is positive sentiment. [NOISE] So, one way you might use an RNN to tackle this task is, uh, you might encode the text using the RNN, and then, really what you want is some kind of sentence encoding so that you can output your label for the sentence, right? And it'll be useful if you would have a single vector to represent the sentence rather than all of these separate vectors. So, how would you do this? How would you get the sentence encoding from the RNN? [NOISE] Uh, one thing you could do [NOISE] is, you could use the final hidden state as your sentence encoding. So, um, the reason why you might think this is a good idea is because, for example, in the RNN, we regard the, the final hidden state as, um, this is the thing you use to predict what's coming next, right? So, we're assuming that the final hidden state contains information about all of the text that has come so far, right? So, for that reason, you might suppose that this is a good sentence encoding, and we could use that [NOISE] to predict, you know, what, uh, what sentiment is this sentence. And it turns out that usually, a better way to do this, usually a more effective way, is to do something like maybe take an element-wise max or an element-wise mean of all these hidden states to get your sentence encoding, um, [NOISE] and, uh, this tends to work better than just using the final hidden state. [NOISE] Uh, there are some other more advanced things you can do as well. Okay. [NOISE] Another thing that you can use RNNs for is as a general purpose encoder module. Uh, so, here's an example that's question answering, but really this idea of RNNs as a general purpose encoder module is very common [NOISE] and use it in lots of different, um, deep learning [NOISE] architectures for NLP. [NOISE] So, here's an example which is question answering. Uh, so, let's suppose that the, the task is, you've got some kind of context, which, in this, uh, situation, is the Wikipedia article on Beethoven, and then, you have a question which is asking, what nationality was Beethoven? Uh, and this is actually taken from the SQuAD Challenge, which is the subject of the Default Final Project. So, um, if you choose to do- to do the Default Final Project, you're going to be building systems that solve this problem. So, what you might do is, you might use an RNN to process the question, what nationality was [NOISE] Beethoven? And then, you might use those hidden states that you get from this, uh, RNN of the question as a representation of the question. And I'm being intentionally vague here [NOISE] about what might happen next, uh, but the idea is that you have [NOISE] both the context and the question are going to be fed some way, and maybe you'll use an RNN on context as well, and you're going to have lots more neural architecture in order to get your answer, which is, uh, German. So, the point here is that the RNN is acting as an encoder for the question, that is, the hidden states that you get from running the RNN over the question, represent the question. [NOISE] Uh, so, the encoder is part of a larger neural system, [NOISE] and it's the, the hidden states themselves that you're interested in because they contain the information. So, you could have, um, taken, uh, element-wise max or mean, like we showed in the previous slide, to get a single vector for the question, but often, you don't do that. Often, you'll, uh, do something else which uses the hidden states directly. So, the general point here is that RNNs are quite powerful as a way to represent, uh, a sequence of text, uh, for further computation. Okay. Last example. So, going back to RNN language models again, [NOISE] uh, they can be used to generate text, and there are lots of different, uh, applications for this. So, for example, speech recognition, uh, you will have your input, which is the audio, and as a student asked earlier, this will be, uh, represented in some way, and then, uh, maybe you'll do a neural encoding of that, [NOISE] and then, you use your RNN language model to generate the output, which, in this case, is going to be a transcription of what the audio recording is saying. So, you will have some way of conditioning, and we're gonna talk more about how this works, uh, in a later lecture, but you have some way of conditioning your RNN language model on the input. So, you'll use that to generate your text, [NOISE] and in this case, the utterance might be something like, what's the weather, question mark. [OVERLAPPING] [NOISE] Yeah. [NOISE] In speech recognition, [inaudible]. Okay. So, the question is, in speech recognition, we often use word error rates to evaluate, but would you use perplexity to evaluate? [NOISE] Um, I don't actually know much about that. Do you know, Chris, what they use in, uh, speech recognition as an eval metric? [NOISE] [inaudible] word error rate [inaudible]. The answer is, you often use WER, uh, for eval, but you might also use perplexity. Yeah. Any other questions? [NOISE] Okay. So, um, this is an example of a conditional language model, and it's called a conditional language model because we have the language model component, but crucially, we're conditioning it on some kind of input. So, unlike the, uh, fun examples like with the Harry Potter text where we were just, uh, generating text basically unconditionally, you know, we trained it on the training data, and then, we just started [NOISE] with some kind of random seed, and then, it generates unconditionally. This is called a conditional language model because there's some kind of input that we need to condition on. Uh, machine translation is an example [NOISE] also of a conditional language model, and we're going to see that in much more detail in the lecture next week on machine translation. [NOISE] All right. Are there any more questions? You have a bit of extra time, I think. [NOISE] Yeah. I have a question about RNNs in general. [NOISE] Do people ever combine the RNN, uh, patterns of architecture, um, with other neural networks? Say, [NOISE] you have, um, you know, N previous layers that could be doing anything, and at the end of your network, you wanna run them through, uh, five recurrent layers. Do people mix and match like that, or these, uh, [inaudible]. [NOISE] Uh, the question is, do you ever combine RNN for the other types of architecture? So, I think the answer is yes. [NOISE] Uh, you might, [NOISE] you know, uh, have- you might have other types of architectures, uh, to produce the vectors that are going to be the input to RNN, or you might use the output of your RNN [NOISE] and feed that into a different type of neural network. So, yes. [NOISE] Any other questions? [NOISE] Okay. Uh, so, before we finish, uh, I have a note on terminology. Uh, when you're reading papers, you might find often this phrase vanilla RNN, and when you see the phrase vanilla RNN, that usually means, uh, the RNNs that are described in this lecture. So, the reason why those are called vanilla RNNs is because there are actually other more complex kinds of RNN flavors. So, for example, there's GRU and LSTM, and we're gonna learn about both of those next week. And another thing we're going to learn about next week [NOISE] is that you can actually get some multi-layer RNNs, which is when you stack multiple RNNs on top of each other. [NOISE] So, uh, you're gonna learn about those, but we hope that by the time you reach the end of this course, you're going to be able to read a research paper and see a phrase like stacked bidirectional LSTM with residual connections and self-attention, and you'll know exactly what that is. [NOISE] That's just an RNN with all of the toppings. [LAUGHTER] All right. Thank you. That's it for today. [NOISE] Uh, next time- [APPLAUSE] next time, we're learning about problems [NOISE] and fancy RNNs. [NOISE]
Info
Channel: stanfordonline
Views: 86,079
Rating: 4.9830365 out of 5
Keywords:
Id: iWea12EAu6U
Channel Id: undefined
Length: 68min 25sec (4105 seconds)
Published: Fri Apr 19 2019
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.