Lecture 8: Recurrent Neural Networks and Language Models

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments

Intuitively, if you have a big enough network using relus then it should be able to fit whatever non-linearity you have. A circle can be very closely approximated by polygons. Probably the more important factor in choosing between relu and other activations would be increased training speed, smoothness of your decision boundary is probably an artificial concern.

👍︎︎ 2 👤︎︎ u/pancakesmmmm 📅︎︎ Dec 07 2017 🗫︎ replies
Captions
[MUSIC] Stanford University. >> All right, hello everybody. Welcome to Lecture seven or maybe it's eight. Definitely today is the beginning of where we talk about models that really matter in practice. We'll talk today about the simplest recurrent neural network model one can think of. But in general, this model family is what most people now use in real production settings. So it's really exciting. We only have a little bit of math in between and a lot of it is quite applied and should be quite fun. Just one organizational item before we get started. I'll have an extra office hour today right after class. I'll be again on Queuestatus 68 or so. Last week we had to end at 8:30. And there's still a lot of people who had a question, so I'll be here after class for probably another two hours or so. Try to get through everybody's questions. Are there any questions around projects? >> [LAUGH] >> And organizational stuff? All right, then let's take a look at the overview for today. So to really appreciate the power of recurrent neural networks it makes sense to get a little bit of background on traditional language models. Which will have huge RAM requirements and won't be quite feasible in their best kinds of settings where they obtain the highest accuracies. And then we'll motivate recurrent neural networks with language modeling. It's a very important kind of fundamental task in NLP that tries to predict the next word. Something that sounds quite simple but is really powerful. And then we'll dive a little bit into the problems that you can actually quite easily understand once you have figured out how to take gradients and you actually understand what backpropagation does. And then we can go and see how to extend these models and apply them to real sequence tasks that people really run in practice. All right, so let's dive right in. Language models. So basically, we want to just compute the probability of an entire sequence of words. And you might say, well why is that useful? Why should we be able to compute how likely a sequence is? And actually comes up for a lot of different kinds of problems. So one, for instance, in machine translation, you might have a bunch of potential translations that a system gives you. And then you might wanna understand which order of words is the best. So "the cat is small" should get a higher probability than "small the is cat". But based on another language that you translate from, it might not be as obvious. And the other language might have a reversed word order and whatnot. Another one is when you do speech recognition, for instance. It also comes up in the machine translation a little bit, where you might have, well this particular example is clearly more a machine translation example. But comes up also in speech recognition where you might wanna understand which word might be the better choice given the rest of the sequence. So "walking home after school" sounds a lot more natural than "walking house after school". But home and house have the same translation or same word in German which is haus, H A U S. And you want to know which one is the better one for that translation. So comes up in a lot of different kinds of areas. Now basically it's hard to compute the perfect probabilities for all potential sequences 'cause there are a lot of them. And so what we usually end up doing is we basically condition on just a window, we try to predict the next word based on the just the previous n words before the one that we're trying to predict. So this is, of course, an incorrect assumption. The next word that I will utter will depend on many words in the past. But it's something that had to be done to use traditional count based machine learning models. So basically we'll approximate this overall sequence probability here with just a simpler version. In the perfect sense this would basically be the product here of each word, given all preceding words from the first one all the way to the one just before the i_th one. But in practice, this probability with traditional machine learning models we couldn't really compute so we actually approximate that with some number of n words just before each word. So this is a simple Markov assumption just assuming the next action or next word that is uttered just depends on n previous words. And now if we wanted to use traditional methods that are just basically based on the counts of words and not using our fancy word vectors and so on. Then the way we would compute and estimate these probabilities is essentially just by counting how often does, if you want to get the probability for the second word, given the first word. We would just basically count up how often do these two words co-occur in this order, divided by how often the first word appears in the whole corpus. Let's say we have a very large corpus and we just collect all these counts. And now if we wanted to condition not just on the first and the previous word but on the two previous words, then we'd have to compute all these counts. And now you can kind of sense that well, if we want to ideally condition on as many n-grams as possible before but we have a large vocabulary of say 100,000 words, then we'll have a lot of counts. Essentially 100,000 cubed, many numbers we would have to store to estimate all these probabilities. Does that make sense? Are there any questions for these traditional methods? All right, now, the problem with that is that the performance usually improves as we have more and more of these counts. But, also, you now increase your RAM requirements. And so, one of the best models of this traditional type actually required 140 gigs of RAM for just computing all these counts when they wanted to compute them for 126 billion token corpus. So it's very, very inefficient in terms of RAM. And you would never be able to put a model that basically stores all these different n-gram counts. You could never store it in a phone or any small machine. And now, of course, once computer scientists struggle with a problem like that, they'll find ways to deal with it, and so, there are a lot of different ways you can back off. You say, well, if I don't find the 4-gram, or I didn't store it, because it was not frequent enough, then maybe I'll try the 3-gram. And if I can't find that or I don't have many counts for that, then I can back off and estimate my probabilities with fewer and fewer words in the context size. But in general you want to have at least tri or 4-grams that you store and the RAM requirements for those are very large. So that is actually something that you'll observe in a lot of comparisons between deep learning models and traditional NLP models that are based on just counting words for specific classes. The more powerful your models are, sometimes the RAM requirements can get very large very quickly, and there are a lot of different ways people tried to combat these issues. Now our way will be to use recurrent neural networks. Where basically, they're similar to the normal neural networks that we've seen already, but they will actually tie the weights between different time steps. And as you go over it, you keep using, re-using essentially the same linear plus non-linearity layer. And that will at least in theory, allow us to actually condition what we're trying to predict on all the previous words. And now here the RAM requirements will only scale with the number of words not with the length of the sequence that we might want to condition on. So now how's this really defined? Again, they're you'll see different kinds of visualizations and I'm introducing you to a couple. I like sort of this unfolded one where we have here a abstract hidden time step t and we basically, it's conditioned on H_t-1, and then here you compute H_t+1. But in general, the equations here are quite intuitive. We assume we have a list of word vectors. For now, let's assume the word vectors are fixed. Later on we can actually loosen that assumption and get rid of it. And now, at each time step to compute the hidden state. At that time step will essentially just have these two matrices, these two linear layers, matrix vector products and we sum them up. And that's essentially similar to saying we concatenate h_t-1 and the word vector at time step t, and we also concatenate these two matrices. And then we apply an element-wise non-linearity. So this is essentially just a standard single layer neural network. And then on top of that we can use this as a feature vector, or as our input to our standard softmax classification layer. To get an output probability for instance over all the words. So now the way we would write this out in this formulation is basically the probability that the next word is of this specific, at this specific index j conditioned on all the previous words is essentially the j_th element of this large output vector. Yes? What is s? So here you can have different ways to define your matrices. Some people just use u, v, and w or something like that. But here we basically use the superscript just identify which matrix we have. And these are all different matrices, so W_(hh), the reason we call it hh is it's the W that computes the hidden layer h given the input h t- 1. And then you have an h_x here, which essentially maps x into the same vector space that we have. Our hidden states in and then s is just our softmax w. The weights of the softmax classifier. And so let's look at the dimensions here. It's again very important. You have another question? So why do we concatenate and not add is the question. So they're the same. So when you write W_(h) using same notation plus W_(hx) times x then this is actually the same thing. And so this will now basically be a vector, and we are feed in linearity but it doesn't really change things, so let's just look at this inside part here. Now if we concatenated h and x together we're now have, and let's say, x here has a certain dimensionality which we'll call d. So x is in R_d and our h will define to be in for having the dimensionality R_(Dh). Now, what would the dimensionality be if we concatenated these two matrices? So we have here the output has to be, again a Dh matrix. And now this vector here is a, what dimensionality does this factor have when we concatenate the two? That's right. So this is a d plus Dh times one and here we have Dh times our matrix. It has to be the same dimensionality, so d plus Dh and that's why we could essentially concatenate here W_h in this way, and W_hx here. And now we could basically multiply these. And if you, again if this is confusing, you can write out all the indices. And you realize that these two are exactly the same. Does that make sense? Right, so as you sum up all the values here, It'll essentially just get summed up also, it doesn't matter if you do it in one go or not. Just a single layer and that worked where you compact in two inputs but it's in many cases for recurrent neutral networks is written this way. All right. So now, here are two other ways you'll often see these visualized. This is kind of a not unrolled version of a hidden, of a recurrent neural network. And sometimes you'll also see sort of this self loop here. I actually find these kinds of unrolled versions the most intuitive. All right. Now when you start and you. Yup? Good question. So what is x[t]? It's essentially the word vector for the word that appears at the t_th time step. As opposed to x_t and intuitively here x_t you could define it in any way. It's really just like as you go through the lectures you'll actually observe different versions but intuitively here x_t is just a vector at xt but here xt is already an input, and what it means in practice is you actually have to now go at that t time step, find the word identity and pull that word vector from your glove or word to vec vectors, and get that in there. So x_t we used in previous lectures as the t_th element for instance in the whole embedding matrix, all our word vectors. So this is just to make it very explicit that we look up the identity of the word at the tth time step and then get the word vector for that identity, like the vector in all our word vectors. Yep. So I'm showing here a single layer neural network at each time step, and then the question is whether that is standard or just for simplicity? It is actually the simplest and still somewhat useful. Variant of a recurrent neural network, though we'll see a lot of extensions even in this class, and then in the lecture next week we'll go to even better versions of these kinds of recurrent neural networks. But this is actually a somewhat practical neural network, though we can improve it in many ways. Now, you might be curious when you just start your sequence, and this is age 0 here and there isn't any previous words. What you would do and the simplest thing is you just initialize the vector for the first hidden layer at the first or the 0 time step as just a vector of all 0s. Right and this is the X[t] definition you had just describe through the column vector of L which is our embedding matrix at index [t] which the time step t. All right so it's very important to keep track properly of all our dimensionality. Here, W(S) to Softmax actually goes over the size of our vocabulary, V times the hidden state. So the output here is the same as the vector of the length of the number of words that we might wanna to be able to predict. All right, any questions for the feed forward definition of a recurrent neural network? All right, so how do we train this? Well fortunately, we can use all the same machinery we've already introduced and carefully derived. So basically here we have probability distribution over the vocabulary and we're going to use the same exact cross entropy loss function that we had before, but now the classes are essentially just the next word. So this actually sometimes creates a little confusion on the nomenclature that we have 'cause now technically this is unsupervised in the sense that you just give it raw text. But this is the same kind of objective function we use when we have supervised training where we have a specific class that we're trying to predict. So the class at each time step is just a word index of the next word. And you're already familiar with that, here we're just summing over the entire vocabulary for each of the elements of Y. And now, in theory, you could just. To evaluate how well you can predict the next word over many different words in longer sequences, you could in theory just take this negative of the average log probability is over this entire dataset. But for maybe historical reasons, and also reasons like information theory and so on that we don't need to get into, what's more common is actually to use perplexity. So that's just 2 to the power of this value and, hence, we want to basically be less perplexed. So the lower our perplexity is, the less the model is perplexed or confused about what the next word is. And we essentially, ideally we'll assign a higher probability to the word that actually appears in the longer sequence at each time step. Yes? Any reason why 2 to the J? Yes, but it's sort of a rat hole we can go down, maybe after class. Information theory bits and so on, it's not necessary. All right. >> [LAUGH] >> All right, so now you would think, well this is pretty simple, we have a single set of W matrices, and training should be relatively straightforward. Sadly, and this is really the main drawback of this and a reason of why we introduce all these other more powerful recurrent neural network models, training these kinds of models is actually incredibly hard. And we can now analyze, using the tools of back propagation and chain rule and all of that. Now we can analyze and understand why that is. So basically we're multiplying here, the same matrix at each time step, right? So you can kind of think of this matrix multiplication as amplifying certain patterns over and over again at every single time step. And so, in a perfect world, we would want the inputs from many time steps ago to actually be able to still modify what we're trying to predict at a later, much later, time step. And so, one thing I would like to encourage you to do is to try to take the derivatives with respect to these Ws, if you just had a two or three word sequence. It's a great exercise, great preparation for the midterm. And it'll give you some interesting insights. Now, as we multiply the same matrix at each time step during foreprop, we have to do the same thing during back propagation We have, remember, our deltas, our air signals and sort of the global elements of the gradients. They will essentially at each time step flow through this network backwards. So when we take our cross-entropy loss here, we take derivatives, we back propagate we compute our deltas. Now the first time step here that just happened close to that output would make a very good update and will probably also make a good update to the word vector here if we wanted to update those. We'll talk about that later. But then as you go backwards in time what actually will happen is your signal might get either too weak, or too strong. And that is essentially called the vanishing gradient problem. As you go backwards through time, and you try to send the air signal at time step t, many time steps into the past, you'll have the vanishing gradient problem. So, what does that mean and how does it happen? Let's define here a simpler, but similar recurrent neural network that will allow us to give you an intuition and simplify the math downstream. So here we essentially just say, all right, instead of our original definition where we had some kind of f some kind of non-linearity, here we use the sigma function, you could use other one. First introduce the rectified linear units and so on instead of applying it here, we'll apply it in the definition just right in here. So it's the same thing. And then let's assume, for now, we don't have the softmax. We just have here, a standard, a bunch of un-normalized scores. Which really doesn't matter for the math, but it'll simplify the math. Now if you want to compute the total error with respect to an entire sequence, with respect to your W then you basically have to sum up all the errors at all the time steps. At each time step, we have an error of how incorrect we were about predicting the next word. And that's basically the sum here and now we're going to look at the element at the t timestamp of that sum. So let's just look at a single time step, a single error at a single time step. And now even computing that will require us to have a very large chain rule application, because essentially this error at time step t will depend on all the previous time steps too. So you have here the delta or dE_t over dy_t, so the t, the hidden state. Sorry, the soft max output or here these unnormalized square output Yt. But then you have to multiply that with the partial derivative of yt with respect to the hidden state. So that's just That's just this guy right here, or this guy for ht. But now, that one depends on, of course, the previous one, right? This one here, but it also depends on that one, and that one, and the one before that, and so on. And so that's why you have to sum over all the time step from the first one, all the way to the current one, where you're trying to predict the next word. And now, each of these was also computed with a W, so you have to multiply partial of that, as well. Now, let's dig into this a little bit more. And you don't have to worry too much if this is a little fast. You won't have to really go through all of this, but it's very similar to a lot of the math that we've done before. So you can kind of feel comfortable for the most part going over it at this speed. So now, remember here, our definition of h_t. We basically have all these partials of all the h_t's with respect to the previous time steps, the h's of the previous time steps. Now, to compute each of these, we'll have to use the chain rule again. And now, what this means is essentially a partial derivative of a vector with respect to another vector. Something that if we're clever with our backprop definitions before, we never actually have to do in practice, right? 'cause this is a very large matrix, and we're combining the computation with the flow graph, and our delta messages before such that we don't actually have to compute explicitly, these Jacobians. But for the analysis of the math here, we'll basically look at all the derivatives. So just because we haven't defined it, what's the partial for each of these is essentially called the Jacobian, where you have all the partial derivatives with respect to each element of the top here ht with respect to the bottom. And so in general, if you have a vector valued function output and a vector valued input, and you take the partials here, you get this large matrix of all the partial derivatives with respect to all outputs. Any questions? All right, so basically here, a lot of chain rule. And now, we got this beast which is essentially a matrix. And we multiply, for each partial here, we actually have to multiply all of these, right? So this is a large product of a lot of these Jacobians. Now, we can try to simplify this, and just say, all right. Let's say, there is an upper bound. And we also, the derivative of h with respect to h_j. Actually, with this simple definition of each h actually can be computed this way. And now, we can essentially upper bound the norm of this matrix with the multiplication of basically these equation right here, where we have W_t. And if you remember our backprop equations, you'll see some common terms here, but we'll actually write this out as not just an element wise product. But we can write the same thing as a diagonal where we have instead of the element wise. Elements we basically just put them into the diagonal of a larger matrix, and with zero path, everything that is off diagonal. Now, we multiply these two norms here. And now, we just define beta, W and beta h, as essentially the upper bounds. Some number, single scalar for each as like how large they could maximally be, right? We have W, we could compute easily any kind of norm for our W, right? It's just a matrix, computed matrix norm, we get a single number out. And now, basically, when we write this all, we put all this together, then we see that an upper bound for this Jacobians is essentially for each one of these elements as this product. And if we define each of the elements here, in terms of their upper bounds beta, then we basically have this product beta here taken to the t- k power. And so as the sequence gets longer and longer, and t gets larger and larger, it really depends on the value of beta to have this either blow up or get very, very small, right? If now the norms of this matrix, for instance, that norm, and then you have control over that norm, right? You initialize your wait matrix W with some small random values initially before you start training. If you initialize this to a matrix that has a norm that is larger than one, then at each back propagation step and the longer the time sequence goes. You basically will get a gradient that is going to explode, cuz you take some value that's larger than one to a large power here. Say, you have 100 or something, and your norm is just two, then you have two to the 100th as an upper bound for that gradient and vice-versa. If you initialize your matrix W in the beginning to a bunch of small random values such that the norm of your W is actually smaller than one, then the final gradient that will be sent from ht to hk could become a very, very small number, right, half to the power of 100th. Basically, none of the errors will arrive. None of the error signal, we got small and smaller as you go further and further backwards in time. Yeah. So if the gradient here is exploding, does that mean a word that is further away has a bigger impact on a word that's closer? And the answer is when it's exploding like that, you'll get to not a number in no time. And that doesn't even become a practical issue because the numbers will literally become not a number, cuz it's too large a value to compute. And we'll have to think of ways to come back. It turns out the exploding gradient problem has some really great hacks that make them easier to deal with than the vanishing gradient problem. And we'll get to those in a second. All right, so now, you might say this could be a problem. Now, why is the vanishing gradient problem, an actual common practice? And again, it basically prevents us from allowing a word that appears very much in the past to have any influence on what we're trying to break in terms of the next word. And so here a couple of examples from just language modeling where that is a real problem. So let's say, for instance, you have Jane walked into the room. John walked in too. It was late in the day. Jane said hi to. Now, you can put an almost probability mass of one, that the next word in this blank is John, right? But if now, each of these words have the word vector, you type it in to the hidden state, you compute this. And now, you want the model to pick up the pattern that if somebody met somebody else, and your all this complex stuff. And then they said hi too, and the next thing is the name. You wanna put a very high probability on it, but you can't get your model to actually send that error signal way back over here, to now modify the hidden state in a way that would allow you to give John a high probability. And really, this is a large problem in any kind of time sequence that you have. And many people might intuitively say well, language is mostly a Sequence problem, right? You have words that appear from left to right or in some temporal order as we speak. And so this is a huge problem. And now we'll have a little bit of code that we can look into. But before that we'll have the awesome Shayne give us a little bit of an intercession, intermission. >> Hi, so let's take a short break from recurrent neural networks to talk about transition-based dependency parsing, which is exactly what you guys saw this time last week in lecture. So just as a recap, a transition-based dependency parser is a method of taking a sentence and turning it into dependence parse tree. And you do this by looking at the state of the sentence and then predicting a transition. And you do this over and over again in a greedy fashion until you have a full transition sequence which itself encodes, the dependency parse tree for that sentence. So I wanna show you how to get from the model that you'll be implementing in your assignment two question two, which you're hopefully working on right now, to SyntaxNet. So what is SyntaxNet? SyntaxNet is a model that Google came out with and they claim it's the world's most accurate parser. And it's new, fast performant TensorFlow framework for syntactic parsing is available for over 40 languages. The one in English is called the Parse McParseface. >> [LAUGH] >> So my slide seemed to have been jumbled a little bit here, but hopefully you can read through it. So basically the baseline we're gonna begin with is the Chen and Manning model which came out in 2014. And Chen and Manning are respectively your head TA and instructor. And the models that produce SyntaxNet in just two stages of improvements, those directly modified Chen and Manning's model, which is exactly what you guys will be doing in assignment two. And so we're going to focus today on the main bulk of these changes, modifications which were introduced in 2015 by Weiss et al. So without further ado, I'm gonna look at their three main contributions. So the first one is they leverage unlabeled data using something called Tri-Training. The second is that they tuned their neural network and made some slight modifications. And the last and probably most important is that they added a final layer on top of the model involving a structured perceptron with beam search. So each of these seeks to solve a problem. So the first one is tri-training. So as you know, in most supervised models, they perform better the more data that they have. And this is especially the case for dependency parsing, where as you can imagine there are an infinite number of possible sentences with a ton of complexity and you're never gonna see all of them, and you're gonna see even some of them very, very rarely. So the more data you have, the better. So what they did is they took a ton of unlabeled data and two highly performing dependency parsers that were very different from each other. And when they agreed, independently agreed on a dependency parse tree for a given sentence, then that would be added to the labeled data set. And so now you have ten million new tokens of data that you can use in addition to what you already have. And this by itself improved a highly performing network's performance by 1% using the unlabeled attachment score. So the problem here was not having enough data for the task and they improved it using this. The second augmentation they made was by taking the existing model, which is the one you guys are implementing, which has an input layer consisting of the word vectors. The vectors for the part of speech tags and the arc labels with one hidden layer and one soft max layer predicting which transition and they changed it to this. Now this is actually pretty much the same thing, except for three small changes. The first is that they added, there are two hidden layers instead of one hidden layer. The second is that they used a RELU nonlinearity function instead of the cube nonlinearity function. And the third and most important is that they added a perceptron layer on top of the soft max layer. And notice that the arrows, that it takes in as input the outputs from all the previous layers in the network. So this perceptron layer wants to solve one particular problem, and this problem is that greedy algorithms aren't able to really look ahead. They make short term decisions and as a result they can't really recover from one incorrect decision. So what they said is, let's allow the network then to look ahead and so we're going to have a tree which we can search over and this tree is the tree of all the possible partial transition sequences. So each edge is a possible transition form the state that you're at. As you can imagine, even with three transitions your tree is gonna blossom very, very quickly and you can't look that far ahead and explore all of the possible branches. So what you have to do is prune some branches. And for that they use beam search. Now beam search is only gonna keep track of the top K partial transition sequences up to a depth of M. Now how do you decide which K? You're going to use a score computed using the perceptron weights. You guys probably have a decent idea at this point of how perceptron works. The exact function they used is shown here, and I'm gonna leave up the annotations so you can take a look at it later if you're interested. But basically those are the three things that they did solve, the problems with the previous Chen & Manning model. So in summary, Chen & Manning had an unlabeled attachment score of 92%, already phenomenal performance. And with those three changes, they boosted it to 94%, and then there's only 0.6% left to get you to SyntaxNet, which is Google's 2016 state of the art model. And if you're curious what the did to get that 0.6%, take a look at Andrew All's paper Which uses global normalization instead of local normalization. So the main takeaway, and it's pretty straight forward but I can't stress it enough, is when you're trying to improve upon an existing model, you need to identify the specific flaws that are in this model. In this case the greedy algorithm and solved those problems specifically. In this case they did that using semi-supervised method using unlabeled data. They tune the model better and they use the structured perception with beam search. Thank you very much. >> [APPLAUSE] >> Kind of awesome. You can now look at these kinds of pictures and you totally know what's going on. And in like state of the art stuff that the largest companies in the world publishes. Exciting times. All right, so we'll gonna through a little bit of like a practical Python notebook sort of implementation that shows you a simple version of the vanishing gradient problem. Where we don't even have a full recurrent real network we just have a simple two layer neural network and even in those kinds of networks you will see that the error that you start at the top and the norm of the gradients as you go down through your network, the norm is already getting smaller. And if you remember these were the two equations where I said if you get to the end of those two equations you know all the things that you need to know, and you'll actually see these three equations in the code as well. So let's jump into this. I don't see it. Let me get out of the presentation All right, better, all right. Now, zoom in. So here, we're going to define a super simple problem. This is a code that we started, and 231N (with Andrej), and we just modified it to make it even simpler. So let's say our data set, to keep it also very simple, is just this kind of classification data set. Where we have basically three classes, the blue, yellow, and red. And they're basically in the spiral clusterform. We're going to define our simple nonlinearities. You can kind of see it as a solution almost to parts of the problem set, which is why we're only showing it now. And we'll put this on the website too, so no worries. You can visit later. But basically, you could define here f, our different nonlinearities, element-wise, and the gradients for them. So this is f and f prime if f is a sigmoid function. We'll also look at the relu, the other nonlinearity that's very popular. And here, we just have the maximum between 0 and x, and very simple function. Now, this is a relatively straight forward definition and implementation of this simple three layer neural network. Has this input, here our nonlinearity, our data x, just these points in two dimensional space, the class, it's one of those three classes. We'll have this model here, we have our step size for SDG, and our regularization value. Now, these are all our parameters, w1, w2 and w3 for all the outputs, and variables of the hidden states. Two sets is bigger, all right. >> [LAUGH] >> All right, now, if our nonlinearity is the relu, then we have here relu, and we just input x, multiply it. And in this case, your x can be the entirety of the dataset, cuz the dataset's so small, each mini-batch, we can essentially do a batch. Again, if you have realistic datasets, you wouldn't wanna do full batch training, but we can get away with it here. It's a very tiny dataset. We multiply w1 times x plus our bias terms, and then we have our element-wise rectified linear units or relu. Then we've computed in layer two, same idea. But now, it's input instead of x is the previous hidden layer. And then we compute our scores this way. And then here, we'll normalize our scores with the softmax. Just exponentiate our scores, some of them. So very similar to the equations that we walk through. And now, it's just basically an if statement. Either we have used relu as our activations, or we use a sigmoid, but the math inside is the same. All right, now, we're going to compute our loss. Our good friend, the simple average cross entropy loss plus the regularization. So here, we have negative log of the probabilities, we summed them up overall the elements. And then here, we have our regularization as the L2, standard L2 regularization. And we just basically sum up the squares of all the elements in all our parameters, and I guess it does cut off a little bit. Let me zoom in. All three have the same of amount of regularization, and we add that to our final loss. And now, every 1,000 iterations, we'll just print our loss and see what's happening. And this is something you always want to do too. You always wanna visualize, see what's going on. And hopefully, a lot of this now looks very familiar. Maybe if implemented it not quite as efficiently, as efficiently in problem set one, but maybe you have, and then it's very, very straightforward. Now, that was the forward propagation, we can compute our error. Now, we're going to go backwards, and we're computing our delta messages first from the scores. Then we have here, back propagation. And now, we have the hidden layer activations, transposed times delta messages to compute w. Again, remember, we have always for each w here, we have this outer product. And that's the outer product we see right here. And now, the softmax was the same regardless of whether we used a value or a sigmoid. Let's walk through the sigmoid here. We now, basically, have our delta scores, and have here the product. So this is exactly computing delta for the next layer. And that's exactly this equation here, and just Python code. And then again, we'll have our updates dw, which is, again, this outer product right there. So it's a very nice sort of equations code, almost a nice one to one mapping between the two. All right, now, we're going to go through the network from the top down to the first layer. Again, here, our outer product. And now, we add the derivatives for our regularization. In this case, it's very simple, just matrices themselves times the regularization. And we combine all our gradients in this data structure. And then we update all our parameters with our step_size and SGD. All right, then we can evaluate how well we do on the training set, so that we can basically print out the training accuracy as we train us. All right, now, we're going to initialize all the dimensionality. So we have there just our two dimensional inputs, three classes. We compute our hidden sizes of the hidden vectors. Let's say, they're 50, it's pretty large. And now, we can run this. All right, we'll train it with both sigmoids and rectify linear units. And now, once we wanna analyze what's going on, we can essentially now plot some of the magnitudes of the gradients. So those are essentially the updates as we do back propagation through the snap work. And what we'll see here is the some of the gradients for the first and the second layer when we use sigmoid non-linearities. And basically here, the main takeaway messages that blue is the first layer, and green is the second layer. So the second layer is closer to the softmax, closer to what we're trying to predict. And hence, it's gradient is usually had larger in magnitude than the one that arrives at the first layer. And now, imagine you do this 100 times. And you have intuitively your vanishing gradient problem in recurrent neural networks. They'll essentially be zero. They're already almost half in size over the iterations when you just had two layers. And the problem is a little less strong when you use rectified linear units. But even there, you're going to have some decrease as you continue to train. All right, any questions around this code snippet and vanishing creating problems? No, sure. [LAUGH] That's a good question. The question is why are the gradings flatlining. And it's essentially because the dataset is so simple that you actually just perfectly fitted your training data. And then there's not much else to do you're basically in a local optimum and then not much else is happening. So yeah, so these are the outputs where if you visualize the decision boundaries, here at the relue and the relue you see a little bit more sort of edges, because you have sort of linear parts of your decision boundary and the sigmoid is a little smoother, little rounder. All right, so now you can implement a very quick versions to get an intuition for the vanishing gradient problem. Now the exploding gradient problem is, in theory, just as bad. But in practice, it turns out we can actually have a hack, that was first introduced by Thomas Mikolov, and it's very unmathematical in some ways 'cause say, all you have is a large gradient of 100. Let's just cap it to five. That's it, you just define the threshold and you say whenever the value is larger than a certain value, just cut it. Totally not the right mathematical direction anymore. But turns out to work very well in practice, yep. So vanishing creating problems, how would you cap it? It's like it gets smaller and smaller, and you just multiply it? But then it's like, it might overshoot. It might go in the completely wrong direction. And you don't want to have the hundredth word unless it really matters. You can't just make all the hundred words or thousand words of the past all matter the same amount. Right? Intuitively. That doesn't make that much sense either. So this gradient clipping solution is actually really powerful. And then a couple years after it was introduced, Yoshua Bengio and one of his students Actually gained a little bit of intuition and it's something I encourage you always to do too. Not just in the equations, where you can write out recurrent neural network, where everything's one dimensional, and the math comes out easy and you gain intuition about it. But you can also, and this is what they did here, implement a very simple recurrent neural network which just had a single hidden unit. Not very useful for anything in practice but now, with the single unit W. And you know, at still the bias term, they can actually visualize exactly what the air surface looks like. So and oftentimes we call the air surface or the energy landscape or so that the landscape of our objective function. This error surface and basically. You can see here the size of the z axis here is the error that you have when you trained us on a very simple problem. I forgot what the problem here was but it's something very simple like keep around this unit and remember the value and then just return that value 50 times later. Something simple like that. And what they essentially observe is that in this air surface or air landscape you have these high curvature walls. And so as you do an update each little line here you can interpret as what happens at an sg update step. You update your parameters. And you say, in order to minimize my objective function right now, I'm going to change the value of my one hidden unit and my bias term just like by this amount to go over here, go over here. And all of a sudden you hit these large curvature walls. And then your gradient basically blows up, and it moves you somewhere way different. And so intuitively what happens here is, if you rescale to the thick size with the special method, then essentially you're not going to jump to some crazy, faraway place, but you're just going to stay in this general area that seemed useful before you hit that curvature wall. Yeah? So the question is, intuitively, why wouldn't such a trick work for the vanishing grading problem but it does work for the exploding grading problem. Why does the reason for the vanishing does not apply to the exploding grading problem. So intuitively, this is exactly the issue here. So the exploding, as you move way too far away, you basically jump out of the area where you, in this case here for instance, we're getting closer and closer to a local optimum, but the local optimum was very close to high curvature wall. And without the gradient problem, without the clipping trick, you would go way far away. Right, now, on the vanishing grading problem, it get's smaller and smaller. So in general clipping doesn't make sense, but let's say, so that's the obvious answer. You can't, something gets smaller and smaller, it doesn't help to have a maximum and then make it, you know cut it to that maximum 'cause that's not the problem. It goes in the opposite direction. And so. That's kind of most obvious intuitive answers. Now, you could say. Why couldn't you, if it gets below a certain threshold, blow it up? But then that would mean that. Let's say you had. You wanted to predict the word. And now you're 50 time steps away. And really, the 51st doesn't actually impact the word you're trying to predict at time step T, right? So you're 50 times to 54 and it doesn't really modify that word. And now you're artificially going to blow up and make it more important. So that's less intuitive than saying, I don't wanna jump into some completely different part of my error surface. The wall just comes from this is what the error surface looks like for a very very simple recurrent node network with a very simple kind of problem that it tries to solve. And you can actually use most of the networks that you have, you can try to make them have just two parameters and then you can visualize something like this too. In fact it's very intuitive sometimes to do that. When you try different optimizers, we'll get to those in a later lecture like Adam or SGD or achieve momentum, we'll talk about those soon. You can actually always try to visualise that in some simple kind of landscape. This just happens to be the landscape that this particular recurrent neural network problem has with one-hidden unit and just a bias term. So the question is, how could we know for sure that this happens with non-linear actions and multiple weight. So you also have some non-linearity here in this. So that intuitively wouldn't prevent us from transferring that knowledge. Now, in general, it's very hard. We can't really visualize a very high dimensional spaces. There is actually now an interesting new idea that was introduced, I think by Ian Goodfellow where you can actually try to, let's say you have your parameter space, inside your parameter space, you have some kind of cross function. So you say my w matrices are at this value and so on, and I have some error when all my values are here, and then I start to optimize and I end up somewhere here. Now the problem is, we can't visualize it because it's usually in realistic settings, you have the 100 million. Workflow. At least a million or so parameters, sometimes 100 million. And so, something crazy might be going on as you optimize between this. And so, because we can't visualize it and we can't even sub-sample it because it's such a high-dimensional space. What they do is they actually draw a line between the point from where they started with their random initialization before optimization. And end the line all the way to the point where you actually finished the optimization. And then you can evaluate along this line at a certain intervals, you can evaluate how big your area is. And if that area changes between two such intervals a lot, then that means we have very high curvature in that area. So that's one trick of how you might use this idea and gain some intuition of the curvature of the space. But yeah, only in two dimensions can we get such nice intuitive visualizations. Yeah. So the question is why don't we just have less dependence? And the question of course, it's a legit question, but ideally we'll let the model figure this out. Ideally we're better at optimizing the model, and the model has in theory these long range dependencies. In practice, they rarely ever do. In fact when you implement these, and you can start playing around with this and this is something I encourage you all to do too. As you implement your models you can try to make it a little bit more interactive. Have some IPython Notebook, give it a sequence and look at the probability of the next word. And then give it a different sequence where you change words like ten time steps away, and look again at the probabilities. And what you'll often observe is that after seven words or so, the words before actually don't matter, especially not for these simple recurrent neural networks. But because this is a big problem, there are actually a lot of different kinds of solutions. And so the biggest and best one is one we'll introduce next week. But a simpler one is to use rectified linear units and to also initialize both of your w's to ones from hidden to hidden and the ones from the input to the hidden state with the identity matrix. And this is a trick that I introduced a couple years ago and then it was sort of combined with rectified linear units. And applied to recurrent neural networks by Quoc Le. And so the main idea here is if you move around in your space. Let's say you have your h, and usually we have here our whh times h, plus whx plus x. And let's assume for now that h and x have the same dimensionality. So then all these are essentially square matrices. And we have here our different vectors. Now, in the standard initialization, what you would do is you'd just have a bunch of small random values and all the different elements of w. And what that means is as you start optimizing, whatever x is you have some random projection into the hidden space. Instead, the idea here is we actually have identity initialization. Maybe you can scale it, so instead you have a half times the identity, and what does that do? Intuitively when you combine the hidden state and the word vector? That's exactly right. If this is an identity initialized matrix. So it's just, 1, 1, 1, 1, 1, 1 on the diagonal. And you multiply all of these by one half. Same as just having a half, a half, a half, and so on. And you multiply this with this vector and you do the same thing here. What essentially that means is that you have a half, times that vector, plus half times that other vector. And intuitively that means in the beginning, if you don't know anything. Let's not do a crazy random projection into the middle of nowhere in our parameter space, but just average. And say, well as I move through the space my hidden state is just a moving average of the word vectors. And then I start making some updates. And it turns out when you look here and you apply this to the very tight problem of MNIST. Which we don't really have to go into, but its a bunch of small digits. And they're trying to basically predict what digit it is by going over all the pixels in a sequence. Instead of using other kinds of neural networks like convolutional neural networks. And basically we look at the test accuracy. These are very long time sequences. And the test accuracy for these is much, much higher. When you use this identity initialization instead of random initialization, and also using rectified linear units. Now more importantly for real language modeling, we can compare recurrent neural networks in this simple form. So we had the question before like, do these actually matter or did I just kind of describe single layer recurrent neural networks for the class to describe the concept. And here we actually have these simple recurrent neural networks, and we basically compare. This one is called Kneser-Ney with 5 grams, so a lot of counts, and some clever back off and smoothing techniques which we won't need to get into for the class. And we compare these on two different corpora and we basically look at the perplexity. So these are all perplexity numbers, and we look at the neural network or the neural network that's combined with Kneser-Ney, assuming probability estimates. And of course when you combine the two then you don't really get the advantage of having less RAM. So ideally this by itself would do best, but in general combining the two used to still work better. These are results from five years ago, and they failed most very quickly. I think the best results now are pure neural network language models. But basically we can see that compared to Kneser-Ney, even back then, the neural network actually works very well. And has much lower perplexity than just the Kneser-Ney or just account based. Now one problem that you'll observe in a lot of cases, is that the softmax is really, really large. So your word vectors are one set of parameters, but your softmax is another set of parameters. And if your hidden state is 1000, and let's say you have 100,000 different words. Then that's 100,000 times 1000 dimensional matrix that you'd have to multiply with the hidden state at every single time step. So that's not very efficient, and so one way to improve this is with a class-based word prediction. Where we first try to predict some class that we can come up, and there are different kinds of things we can do. In many cases you can sort, just the words by how frequent they are. And say the thousand most frequent words are in the first class, the next thousand most frequent words in the second class and so on. And so you first basically classify, try to predict the class based on the history. And then you predict the word inside that class, based on that class. And so this one is only a thousand dimensional, and so you can basically do this. And now the more classes the better the perplexity, but also the slower the speed the less you gain from this. And especially at training time which is what we see here, this makes a huge difference. So if you have just very few classes, you can actually reduce the number here of seconds that each eproc takes. By almost 10x compared to having more classes or even more than 10x if you have the full softmax. And even the test time, is faster cuz now you only essentially evaluate the word probabilities for the classes that have a very high probability here. All right, one last trick and this is maybe obvious to some but it wasn't obvious to others even in the past when people published on this. But you essentially only need to do a single backward's pass through the sequence. Once you accumulate all the deltas from each error at each time set. So looking at this figure, really quick again. Here, essentially you have one forward pass where you compute all the hidden states and all your errors, and then you only have a single backwards pass, and as you go backwards in time you keep accumulating all the deltas of each time step. And so originally people said, for this time step I'm gonna go all the way back, and then I go to the next time step, and then I go all the way back, and then the next step, and all the way back, which is really inefficient. And is essentially same as combining all the deltas in one clean back propagation step. And again, it's kind of is intuitive. An intuitive sort of implementation trick but people gave that the term back propagation through time. All right, now that we have these simple recurrent neural networks, we can use them for a lot of fun applications. In fact, the name entity recognition that we're gonna use in example with the Window. In the Window model, you could only condition the probability of this being a location, a person, or an organization based on the words in that Window. The recurrent neural network you can in theory take and condition these probabilities on a lot larger context sizes. And so you can do Named Entity Recognition (NER), you can do entity level sentiment in context, so for instance you can say. I liked the acting, but the plot was a little thin. And you can say I want to now for acting say positive, and predict the positive class for that word. Predict the null class, and all sentiment for all the other words, and then plot should get negative class label. Or you can classify opinionated expressions, and this is what researchers at Cornell where they essentially used RNNs for opinion mining and essentially wanted to classify whether each word in a relatively smaller purpose here is either the direct subjective expression or the expressive subjective expression, so either direct or expressive. So basically this is direct subjective expressions, explicitly mention some private state or speech event, whereas the ESEs just indicate the sentiment or emotion without explicitly stating or conveying them. So here's an example, like the committee as usual has refused to make any statements. And so you want to classify as usual as an ESE, and basically give each of these words here a certain label. And this is something you'll actually observe a lot in sequence tagging paths. Again, all the same models the recurrent neural network. You have the soft max at every time step. But now the soft max actually has a set of classes that indicate whether a certain expression begins or ends. And so here you would basically have this BIO notation scheme where you have the beginning or the end, or a null token. It's not any of the expressions that I care about. So here you would say for instance, as usual is an overall ESE expression, so it begins here, and it's in the middle right here. And then these are neither ESEs or DSEs. All right, now they started with the standard recurrent neural network, and I want you to at some point be able to glance over these equations, and just say I've seen this before. It doesn't have to be W superscript HH, and so on. But whenever you see, the summation order of course, doesn't matter either. But here, they use W, V, and U, but then they defined, instead of writing out softmax, they write g here. But once you look at these equations, I hope that eventually you're just like it's just a recurrent neural network, right? You have here, are your hidden to hidden matrix. You have your input to hidden matrix, and here you have your softmax waits you. So same idea, but these are the actual equations from this real paper that you can now kind of read and immediately sort of have the intuition of what happens. All right, you need directional recurrent neural network where we, if we try to make the prediction here, of whether this is an ESE or whatever name entity recognition, any kind of sequence labelling task, what's the problem with this kind of model? What do you think as we go from left to right only? What do you think could be a problem for making the most accurate predictions? That's right. Words that come after the current word can't be helping us to make accurate predictions at that time step, right? Cuz we only went from left to right. And so one of the most common extensions of recurrent neural networks is actually to do bidirectional recurrent neural networks where instead of just going from left to right, we also go from right to left. And it's essentially the exact same model. In fact, you could implement it by changing your input and just reversing all the words of your input, and then it's exactly the same thing. And now, here's the reason why they don't have superscripts with WHH, cuz now they have these arrows that indicate whether you're going from left to right, or from right to left. And now, they basically have this concatenation here, and in order to make a prediction at a certain time step t they essentially concatenate the hidden states from both the left direction and the right direction. And those are now the feature vectors. And this vector ht coming from the left, has all the context ordinal, again seven plus words, depending on how well you train your RNN. From all the words on the left, ht from the right has all the contacts from the words on the right, and that is now your feature vector to make an accurate prediction at a certain time set. Any questions around bidirectional recurrent neural networks? You'll see these a lot in all the recent papers you'll be learning, in various modifications. Yeah. Have people tried Convolutional Neural Networks? They have, and we have a special lecture also we will talk a little bit about Convolutional Neural Networks. So you don't necessarily have a cycle, right? You just go, basically as you implement this, you go once all the way for your the left, and you don't have any interactions with the step that goes from the right. You can compute your feet forward HTs here for that direction, are only coming from the left. And the HT from the other direction, you can compete, in fact you could paralyze this if you want to be super efficient and. Have one core, implement the left direction, and one core implement the right direction. So in that sense it doesn't make the vanishing create any problem worse. But, of course, just like any recurring neural network, it does have the vanishing creating problem, and the exploding creating problems and it has to be clever about flipping it and so, yeah We call them standard feedforward neural networks or Window based feedforward neural networks. And now we have recurrent neural networks. And this is really one of the most powerful family and we'll see lots of extensions. In fact, if there's no other question we can go even deeper. It is after all deep learning. And so, now you'll observe [LAUGH] we definitely had to skip that superscript. And we have different, Characters here for each of our matrices, because, instead of just going from left to right, you can also have a deep neural network at each time step. And so now, to compute the ith layer at a given time step, you essentially again, have only the things coming from the left that modify it but, you just don't take in the vector from the left, you also take the vector from below. So, in the simplest definition that is just your x, your input vector right? But as you go deeper you now also have the previous hidden layers input. Instead of why are the, So the question is, why do we feed the hidden layer into another hidden layer instead of the y? In fact, you can actually have so called short circuit connections, too, where each of these h's can go directly to the y as well. And so here in this figure you see that only the top ones go into the y. But you can actually have short circuit connections where y here has as input not just ht from the top layer, noted here as capital L, but the concatenation of all the h's. It's just another way to make this monster even more monstrous. And in fact there a lot of modifications, in fact, Shayne has a paper, an ArXiv right now on a search based odyssey type thing where you have so many different kinds of knobs that you can tune for even more sophisticated recurrent neural networks of the type that we'll introduce next week that, it gets a little unwieldy and it turns out a lot of the things don't matter that much, but each can kind of give you a little bit of a boost in many cases. So if you have three layers, you have four layers, what's the dimensionality of all the layers and the various different kinds of connections and short circuit connections. We'll introduce some of these, but in general this like a pretty decent model and will eventually extract away from how we compute that hidden state, and that will be a more complex kind of cell type that we'll introduce next Tuesday. Do we have one more question? So now how do we evaluate this? It's very important to evaluate your problems correctly, and we actually talked about this before. When you have a very imbalanced data set, where some of the classes appear very frequently and others are not very frequent, you don't wanna use accuracy. In fact, in these kinds of sentences, you often observe, this is an extreme one where you have a lot of ESEs and DSEs but in many cases, just content. Standard sort of non-sentiment context and words, and so a lot of these are actually O, have no label. And so it's very important to use F1 and we basically had this question also after class, but it's important for all of you to know because the F1 metric is really one of the most commonly used metrics. And it's essentially just the harmonic mean of precision and recall. Precision is just the true positives divided by true positives plus false positives and recall is just true positives divided by true positives plus false negatives. And then you have here the harmonic mean of these two numbers. So intuitively, you can be very accurate by always saying something or have a very high recall for a certain class but if you always miss another class That would hurt you a lot. And now here's an evaluation that you should also be familiar with where basically this is something I would like to see in a lot of your project reports too as you analyze the various hyper parameters that you have. And so one thing they found here is they have two different data set sizes that they train on, in many cases if you train with more data, you basically do better but then also it's not always the case that more layers. So this is the depth that we had here, the number l for all these different layers. It's not always the case that more layers are better. In fact here, the highest performance they get is with three layers, instead of four or five. All right, so let's recap. Recurring neural networks, best deep learning model family that you'll learn about in this class. Training them can be very hard. Fortunately, you understand back propagation now. You can gain an intuition of why that might be the case. We'll in the next lecture extend them some much more powerful models the Gated Recurring Units or LSTMs, and those are the models you'll see all over the place in all the state of the art models these days. All right. Thank you.
Info
Channel: Stanford University School of Engineering
Views: 114,623
Rating: 4.9082322 out of 5
Keywords:
Id: Keqep_PKrY8
Channel Id: undefined
Length: 78min 3sec (4683 seconds)
Published: Mon Apr 03 2017
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.