Lecture 4: Word Window Classification and Neural Networks

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[MUSIC] Stanford University. It's getting real today. So, let's talk about a little bit of the overview today. So, we'll really get you into the background for classification. And then, we'll do some interesting things with updating these word vectors that we so far have learned in an unsupervised way. We'll update them with some real supervision signals such as sentiment and other things. Then, we'll look at the first real model that is actually useful and you might wanna use in practice. Well, other than, of course, the word vectors, but one sort of downstream task which is window classification and we'll really also clear up some of the confusion around the cross entropy error and how it connects with the softmax. And then, we'll introduce the famous neural network, our most basic LEGO block that we may start to call deep to get to the actual title of this class. Deep learning in NLP. And then, we'll actually introduce another loss function, the max margin loss and take our first steps into the direction of backprop. So, this lecture will be, I think very helpful for problem set one. We'll go into a lot of the math that you'll need probably for number two in the problem set. So, I hope it'll be very useful and I'm excited for you cuz at the end of this lecture, you'll feel hopefully a lot better about the magic of deep learning. All right, are there any organizational questions around problem sets or programming sessions with the TAs? No, we're all good? Awesome, thanks to the TAs for clearing up everything. Cool, so let's be very careful about our notation today because that is one of the main things that a lot of people trip up over as we go through very complex chain-rules and so on. So, let's start at the beginning and say, all right, we have usually a training dataset of some input X and some output Y. X could be in the simplest case, words in isolation, just a single word vector. It's not something you would usually do in practice. But it'll be easy for us to learn that way. So we'll start with that but then, we'll move to context windows today. And then eventually, we'll use the same basic building blocks that we introduce today for sentences and documents and then complex interactions for everything. Now, the output in the simplest case it's just a single label. It's just a positive or a negative kind of sentence. It could be the named entities of certain words in their context. It can also be other words, so in machine translation, for instance, you might wanna output eventually a sequence of other words as our yi and we'll get to that in a couple weeks. And, yeah, basically they have multiword sequences as potential outputs. All right, so what is the intuition for classification? In the standard machine learning case, so not yet the deep learning world, we usually just, for something as simple logistic regression, basically want to define and learn a simple decision boundary where we say everything to the left of this or in one direction is in one class and the other one, all the other things in the other class. And so, in general machine learning, we assume our inputs, the Xs are kinda fixed, they're just set and we'll only train the W parameter, which is our softmax weights. So, we'll compute the probability of Y, given the input X with this kind of input. And so, one notational comment here is for the whole dataset, we often subscript with i but then, when I drop the i we're just looking at a single example of x and y. Eventually, we're going to overload at the subscript a little bit and look at the indices of certain vector so, if you get confused, just raise your hand and ask. I'll try to make it clear which one is which. Now, let's dive into the softmax. We mentioned it before but we wanna really carefully define and recall the notation here cuz we'll go and take derivatives with respect to all of these parameters. So, we can tease apart two steps here for computing this probability of y given x. The first thing is, we'll take the y'th row of W and multiply that row with x. And so again this notation here, when we have Wy. And that means we'll have, we're taking the y'th row of this matrix. And then, multiplying it here with x. Now if we do that multiple times for all c from one to our classes. So let's say, this is 1, 2, 3, the 4th row and multiply each of these. So then we get four numbers here. And these are unnormalized scores. And then, we'll basically, pipe this vector through the softmax to compute a probability distribution that sums to one. All right, that's our step one. Any questions around that? Cuz it's just gonna keep on going from here. All right, great. And, I get that sometimes in general from previous sort of surveys, it seems to be that 15% of the class are usually bored when we go through all of these, like all of these derivatives. 15% are super overwhelmed and then the majority of people are like, okay, it's a good speed, I'm learning something, I'm getting it, and you're making progress. So, sorry for the 30% for whom this is too slow or too fast. You can probably just skim through the lecture slides or speed it up if you're watching online. If you're super familiar with taking super complex derivatives and if it's a little overwhelming, then definitely come to all the office hours. We have an awesome set of TAs who will help you. All right, now we, let's look at a single example of an x and y that we wanna predict. In general, we want our model to essentially maximize the probability of the correct class. We wanted to output the right class at the end by taking the argmax of that output. And maximizing probability is the same as maximizing log probability, it's the same as minimizing the negative of that log probability and that is often our objective function. So, why do we call this the cross-entropy error? Well, we can define the cross-entropy in the abstract in general as follows. So let's assume we have the ground truth or gold or target probability distribution, we use those three terms interchangeably. Basically, what the ideal target in our training dataset, the y and we'll assume that, that is one at the right class and zero everywhere else. So if we have for instance, five classes here and it's the center class. Its the third class and this would be one and all the other, numbers would be zero. So, if we define this as p in our computed probability, that our softmax outputs as q then we would define here the cross-entropy is basically this sum over all the classes. And in our case, p here is just one-hot vector that's really only 1 in one location and 0 everywhere else. So, all these other terms are basically gone. And we end up with just log of q and that's exactly the log of what our softmax outputs, all right? And then, there are some nice connections to Kullback-Leibler divergence and so on. I used to talk about it but we don't have that much time today. So and you can also if you're familiar of this in stats, you can see this as trying to minimize the Kullback-Leibler divergence between these two distributions. But really, this is all you need to know for the purpose of this class. So this is for one element of your training data set. Now, of course, in general, you have lots of training examples. So we have our overall objective function we often denote with J, over all our parameters theta. And we basically sum these negative log probabilities of the correct classes that we index here, a sub-index with yi. And basically we want to minimize this whole sum. So that's our cross-entropy error that we're trying to minimize, and we'll take lots of derivatives off in a lot of the next couple of hours. All right, any questions so far? So this is the general ML case where we assume our inputs here are fixed. Yes, it's a single number. So we are not multiplying a vector here, so p(c) is the probability for that class, so that's one single number. Great question. So the cross entropy, a single number, our main objective that we're trying to minimize, or our error that we're trying to minimize. Now, whenever you write this F subscript Y here, we don't want to forget that F is really also a function of X, our inputs, right? It's sort of an intermediate step and it's very important for us to play around with this notation. So we can also rewrite this as W y, that row, times x, and we can write out that whole sum. And that can often be helpful as you are trying to take derivatives of one element at a time to eventually see the bigger picture of the whole matrix notation. All right, so often we'll write f here in terms of this matrix notation. So this is our f, this is our W, and this is our x. So just standard matrix multiplication with a vector. All right, now most of the time we'll just talk about this first part of the objective function but it's a bit of a simplification because in all your real applications you will also have this regularization term here. As part of your overall objective function. And in many cases, this theta here for instance, if it's the W matrix of our standard logistic regression, we'll essentially just try this part of the objective function. We'll try to encourage the model to keep all the weights as small as possible and as close as possible to zero. You can kind of assume if you want as a Bayesian that you can have a prior, a Gaussian distributed prior that says ideally all these are small numbers. Often times if you don't have this regularization term your numbers will blow up and it will start to overfit more and more. And in fact, this kind of plot is something that you will very often see in your projects and even in the problem sets. And when I took my very first statistical learning class, the professor said, this is the number one plot to remember. So, I don't know if it's that important, but it is very, very important for all our applications. And it's basically a pretty abstract plot. You can think of the x-axis as a variety of different things. For instance, how powerful your model is. How many deep layers you'll have or how many parameters you'll have. Or how many dimensions each word vector has. Or how long you trained a model for. You'll see the same kind of pattern with a lot of different, x-axis and then the y-axis here is essentially your error. Or your objective function that you're trying to optimize and minimize. And what you often observe is, the more powerful your model gets, the better you are on lowering your training error, the better you can fit these x-i, y-i pairs. But at some point you'll actually start to over-fit, and then your test error, or your validation or development set error, will go up again. We'll go into a little bit more details on how to avoid all of that throughout this course and in the project advice and so on. But this is a pretty fundamental thing and just keep in mind that for a lot of the implementations, and your projects you will want this regularization parameter. But really it's the same one for almost all the objective functions so we're going to chop it and mostly focus on actually fitting our dataset. All right, any questions around regularization? So basically, you can think of this in terms of if you really care about one specific number, then you can adjust all your parameters such that it will exactly go to those different points. And if you force it to not do that, it will kind of be a little smoother. And be less likely to fit exactly those points and hence often generalize slightly better. And we'll go through a couple of examples of what this will look like soon. All right, now as I mentioned in general machine learning, we'll only optimize the W here, the parameters of our Softmax classifier. And hence our updates and gradients will only be pretty small, so in many cases we only have you know a handful of classes and maybe our word vectors are hundred so if we have three classes and 100 dimensional word vectors we're trying to classify, we'd only have 300 parameters. Now, in deep learning, we have these amazing word vectors. And we actually will want to learn not just the Softmax but also the word vectors. We can back propagate into them and we'll talk about how to do that today. Hint, it's going to be taking derivatives. But the problem is when we update word vectors, conceptually as you are thinking through this, you have to realize this is very, very large. And now all of the sudden have a very large set of parameters, right? Let's say your word vectors are 300 dimensional you have, you know 10,000 words in your vocabulary. All of the sudden you have an immensely large set of parameters so on this kind of plot you're going to be very likely to overfit. And so before we dive into all this optimization, I want you to get a little bit of an intuition of what it means to update word vectors. So let's go through a very simple example where we might want to classify single words. Again, it's not something we'll do very often, but let's say you want to classify single words as positive or negative. And let's say in our training data set we have the word TV and telly and say you know this is movie reviews and if you say this movie is better suited for TV. It's not a very positive thing to say about a movie that's just coming out into movie theaters. And so we would assume that in the beginning telly, TV, and television are actually all close by in the vector space. We learn something with word2vec or glove vectors and we train these word vectors on a very, very large corpus and it learned all these three words appear often in a similar context, so they are close by in the vector space. And now we're going to train but, our smaller sentiment data set only includes in the training set, the X-i Y-i as TV and telly and not television. So now what happens as we train these word vectors? Well, they will start to move around. We'll project sentiment into them and so you now might see telly and TV, that's a British dataset, so like to move somewhere else into the vector space. But television actually stays where it was in the beginning. And now when we want to test it, we would actually now misclassify this word because it's never been moved. And so what does that mean? The take home message here will be that if you have only a very small training dataset. That will allow you especially with these deep models to overfit very quickly, you do not want to train your word vectors. You want to keep them fixed, you pre-trained them with nice Glove or word2vec models on a very large corpus or you just downloaded them from the cloud website and you want to keep them fixed, cuz otherwise you will not generalize as well. However, if you have a very large dataset it may be better to train them in a way we're going to describe in the next couple of slides. So, an example for where you do that is, for instance, machine translation where you might have many hundreds of Megabytes or Gigabytes of training data and you don't really need to do much with the word vectors other than initialize them randomly, and then train them as part of your overall objective. All right, any questions around generalization capabilities of word vectors? All right, it might still be magical how we're training this, so that's what we're gonna describe now. So, we rarely ever really classify single words. Really what we wanna do is classify words in their context. And there are a lot of fun and interesting. Issues that arise in context really that's where language begins and grammar and the connection to meaning and so on. So here, a couple of fun examples of where context is really necessary. So for instance, we have some words that actually auto-antonyms, so they mean their own opposite. So for instance to sanction can mean to permit or to punish. And it really depends on the context for you to understand which one is meant, or to seed can mean to place seeds or to remove seeds. So without the context, we wouldn't really understand the meaning of these words. And in one of the examples that you'll see a lot, which is named entity recognition, let's say we wanna find locations or people names, we wanna identify is this the location or not. You may also have things like Paris, which could be Paris in France or Paris Hilton. And you might have Paris staying in Paris and you still wanna understand which one is which. Or if you wanna use deep learning for financial trading and you see Hathaway, you wanna make sure that if it's just a positive movie review from Anne Hathaway. You're not all the sudden buying stocks from Berkshire Hathaway, right? And so, there are a lot of issues that are fun and interesting and complex that arise in context. And so, let's now carefully walk through this first useful model, which is Window classification. So, we'll use as our first motivating example here 4-class named entity recognition, where we basically wanna identify a person or location or organization or none of the above for every single word in a large corpus. And there are lots of different possibilities that exist. But we'll basically look at the following model. Which is actually quite a reasonable model. And also one that started in 2008. So the first beginning by Collobert and Weston, a great paper, to do the first kind of useful state of the art Text classification and word classification context. So, what we wanna do is basically train a softmax classifier by assigning a label to the center word and then concatenating all the words in a window around that word. So, let's take for example this subphrase here from a longer sentence. We basically wanna classify the center word here which is Paris, in the context of this window. And we'll define the window length as 2. 2 being 2 words to the left and 2 words to the right of the current center word that we're trying to classify. All right, so what we will do is we'll define our new x for this whole window as the concatenation of these five word vectors. And just in general throughout all of this lecture all my vectors are going to be column vectors. Sadly in number two of the problem set, they're row vectors. Sorry for that. Eventually, all these programming frameworks they're actually row-wise first and so it's faster in the low-level optimization to use row vectors. For a lot of the math it's actually I find it simpler to think of them as column vectors so. We're very clear in the problem set but don't get tripped up on that. So basically, we'll define this here as one five D dimensional column vector. So, we have T dimensional word vectors, we have five of them and we stack them up in one column, all right. Now, the simplest window classifier that we could think of is to now just put the softmax on top of this concatenation of five word vectors and we'll define this, our x here. Our inputs is just the x of the entire window for this concatenation. And we have the softmax on top of that. And so, this is the same notation that we used before. We're introducing here y hat, with sadly the subscript y for the correct current class. It's tough, I went through [LAUGH] several iterations, it's tough to have like prefect notation that works through the entire lecture always. But you'll see why soon. So, our overall objective here is, again, this whole sum over all these probabilities that we have, or negative log of those. So now, the question is, how do we update these word vectors x here? One x is a window, and x is now deep inside the softmax. All right, well, the short answer is we'll take a lot of derivatives. But the long answer is, you're gonna have to do that a lot in problem set one and maybe in the midterm. So, let's be a little more helpful, and actually go through some of the steps and give you some hints. So some of this, you'll actually have to do in your problem set, so I'm not gonna go through all the details. But I'll give you a couple of hints along the way and then you can know if you're hitting those and then you'll see if you're on the right track. So, step one, always very carefully define your variables, their dimensionality and everything. So, y hat will define as the softmax probability of the vector. So, the normalized scores or the probabilities for all the different classes that we have. So, in our case we have four. Then we have the target distribution. Again, that will be a one hot vector where it's all zeroes except at the ground truth index of the class y, where it's one. And we'll define our f here as f of x again, which is this matrix multiplication. Which is going to be a C dimensional vector where capital C is the number of classes that we have, all right. So, that was step one. Carefully define all of your variables and keep track of their dimensionality. It's very easy when you implement this and you multiply two things, and they have wrong dimensionality, and you can't actually legally multiply them, you know you have a bug. And you can do this also in a lot of your equations. You'd be surprised. In the midterm, you're nervous. But maybe at the end you have some time. And you could totally grade it by yourself in the first pass, by just making sure that all your dimensionality of your matrix and vector multiplications are correct. All right, the second tip is the chain rule, we went over this before, but I heard there's a little bit of confusion still in the office hours. So, let's define this carefully for a simple example and then we'll go and give you a couple more hints also for more complex example. So again, if you have something very simple, such as a function y, which you can defined here as f of u and u can be defined as g of x as in the whole function, y of x, can be described as f of g of x, then you would basically multiply dy, u times the udx. And so very concretely here, this is sort of high school level, but we'll define it properly in order to show the chain rule. So here, you can basically define u as g(x), which is just the inside in the parentheses here, so x cubed + 7. It can have y as a function of f(u), where we use 5 times u, just replacing the inside definition here. So it's very simple, just replacing things. And now, we can take the derivative with respect to u and we can take the derivative with respect to x(u). And then we just multiply these two terms, and we plug in u again. So in that sense, we all know, in theory, the chain rule. But, now we're gonna have the softmax, and we're gonna have lots of matrices and so on. So, we have to be very, very careful about our notation. And we also have to be careful about understanding, which parameters appear inside what other higher level elements. So, f for instance is a function of x. So, if you're trying to take a derivative with respect to x, of this overall soft max you're gonna have to sum over all of the different classes inside which x appears. And you'll see here, this first application, but not just of fy again this is just a subscript the y element of the effector which is the function of x, but also multiply it then here by this. So, when you write this out, another tip that can be helpful is for this softmax part of he derivative is to actually think of two cases. One where c = y, the correct class, and one where it's basically all the other incorrect classes. And as you write this out, you will observe and come up with something like this. So, don't just write that as your thing you have to put in your problems, the steps on how to get there. Bur, basically at some point you observe this kinda pattern when you now try to look at all the derivatives with respect to all the elements of f. And now, when you have this you realize ,okay at the correct class we're actually subtracting one here, and all the incorrect classes, you will not do anything. Now, the problem is when you implement this, it kind of looks like a bunch of if statements. If y equals the correct class for my training set, then, subtract 1, that's not gonna be very efficient. Also, you're gonna go insane if you try to actually write down equations for more complex neural network architectures ever. And so, instead, what we wanna do is always try to vectorize a lot of our notation, as well as our implementation. And so, what this means here, in this case, is you can actually observe that, well, this 1 is exactly 1, where t, our hot to target distribution, also happens to be 1. And so, what you're gonna wanna do, is basically describe this as y(hat)- t, so it's the same thing as this. And don't worry if you don't understand how we got there, cuz that's part of your problem set. You have to, at some point, see this equation while you're taking those derivatives. And now, the very first baby step towards back-propagation is actually to define this term, in terms of a simpler single variable and we'll call this delta. We'll get good, we'll become good friends with deltas because they are sort of our error signals. Now, the last couple of tips. Tip number six. When you start with this chain rule, you might want to sometimes use explicit sums, before and look at all the partial derivatives. And if you do that a couple of times at some point you see a pattern, and then you try to think of how to extrapolate from those patterns of single partial derivatives, into vector and matrix notation. So, for example, you'll see something like this here, in at some point in your derivation. S,o the overall derivative with respect to x of our overall objective function for one element, for one element from our training set x and y is this sum. And it turns out when you think about this for a while, you take here this row vector but then you transpose it, and becomes an inner product, well if you do that multiple times for all the C's and you wanna get in the end a whole vector out, it turns out you can actually just re-write the sum as W transpose* the delta. So, this is one error signal here that we got from our softmax, and we multiply the transpose of our softmax weights with this. And again, if some of these are not clear and you're confused, write them out into full sum, and then you'll see that it's really just re-write this in vector notation. All right, now what is the dimensionality of the window vector gradient? So in the end, we have this derivative of the overall cost here for one element of our training set with respect to x. But x is a window. All right, so each say we have a window of five words. And each word is d-dimensional. Now, what should be the dimensionality of this derivative of this gradient? That's right, it's five times the dimensionality. And that's another really good way, and one of the reasons we make you implement this from scratch, if you have any kinda parameter, and you have a gradient for that parameter, and they're not the same dimensionality, you'll also know you screwed up and there's some mistake or bug in either your code or your map. So, it's very simple debugging skill. And way to check your own equations. So, the final derivative with respect to this window is now this five vector because we had five d-dimensional vectors that we concatenated. Now, of course the tricky bit is, you actually wanna update your word vectors and not the whole window, right? The window is just this intermediate step also. So really, what you wanna do is update and take derivatives with respect to each of the elements of your word vectors. And so it turns out, very simply, that can be done by just splitting that error that you've got on the gradient overall, at the whole window and that's just basically the concatenation of the reduced of all the different word vectors. And those you can use to update your word vectors, as you train the whole system. All right, any questions? Is there a mathematical what? Is there a mathematical notation for the word vector t, other than it's just variable t? Or that seems like a fine notation. You can see this as a probability distribution, that is very peaked. >> Yeah. >> That's all, there's nothing else to it. Just a single vector with all zeroes, except in one location. >> So I'll just write that down? >> You can write that up, yeah. You can always just write out and it's also something very important. You always wanna define everything, so that you make sure that the TAs know that you're thinking about the right thing, as you're writing out your derivatives, you write out the dimensionality, you define them properly, you can use dot, dot, dot if it's a larger dimensional vector. You can just define t as your target distribution [INAUDIBLE] >> The question is, do we still have two vectors for each word? Great question, no. We essentially, when we did glove and word2vec, and had these two u's and v's, for all subsequent lectures from now on, we'll just assume we have the sum of u and v and that's our single vector x, for each word. So, the question is does this gradient appear in lots of other windows and it does. So, if you, the answer is yes. If you have the word "in," that vector here and the gradients will appear in all the windows that have the word "in" inside of them. And same with museums and so on. And so as you do stochastic gradient descent you look at one window at a time, you update it, then you go to the next window, you update it and so on. Great questions. All right. Now, let's look at how we update these concatenated word vectors. So basically, as we're training this, if we train it for instance with sentiment we'll push all the positive words in one direction and the other words in other direction. If we train it, for named entity recognition and eventually our model can learn that seeing something like in as the word just before the center word, would be indicative for that center word to be a location. So now what's missing for training this full window model? Well mainly the gradient of J with respect to the softmax weights W. And so we basically will take similar steps. We'll write down all the partial derivatives with respect to Wij first and so on. And then we have our full gradient for this entire model. And again, this will be very sparse, and you're gonna wanna have some clever ways of implementing these word vector updates. So you don't send a bunch of zeros around at every single window, Cuz each window will only have a few words. So in fact, it's so important for your code in the problem set to think carefully through your matrix implementations, that it's worth to spend two or three slides on this. So there are essentially two very expensive operations in the softmax. The matrix multiplication and the exponent. Actually later in the lecture, we'll find a way to deal with the exponent. But the matrix multiplication can also be implemented much more efficiently. So you might be tempted in the beginning to think this is probability for this class and this is the probability for that class. And so implemented a for loop of all my different classes and then I'll take derivatives or matrix multiplications one row at a time. And that is going to be very, very inefficient. So let's go through some very simple Python code here to show you what I mean. So essentially, always looping over these word vectors instead of concatenating everything into one large matrix. And then multiplying these is always going to be more efficient. So let's assume we have 500 windows that we want to classify, and let's assume each window has a dimensionality of 300. These are reasonable numbers, and let's assume we have five classes in our softmax. And so at some point during the computation, we now have two options. So W here are weights for the softmax. It's gonna be C many rows and d many columns. Now the word vectors here that you concatenated for each window. We can either have the list of a bunch of separate word vectors, or we can have one large matrix that's going to be d times n. So d many rows and n many windows. So we have 500 windows, so we have 500 columns here in this 1 matrix. And now essentially, we can multiply the W here for each vector separately, or we can do this one matrix multiplication entirely. And you literally have a 12x speed difference. And sadly with these larger models, one iteration or something might take a day, eventually for more complex models large data sets. So the difference is between literally 12 days or 1 day of you iterating and making your deadlines and everything. So it's super important, and now sometimes people are tripped up by what does it mean to multiply and do this here. Essentially, it's the same thing that we've done here for one softmax, but what we did is we actually concatenated. A lot of different input vectors x, and so we'll get a lot of different unnormalized scores out at the end. And then we can tease them apart again for them. So you have here, c times t dimensional matrix for the d dimensional input. So using the same notation, yeah, dimensional of each window times d times n matrix to get a c times n matrix. So these are all the probabilities here for your N many training samples. Any questions around that? So it's super important, all your code will be way too slow if you don't do this. And so this is very much an implementation trick. And so in most of the equations, we're not gonna actually go there cuz that makes everything more complicated. And the equations look at only a singular example at a time, but in the end you're gonna wanna vectorize all your code. Yeah, matrices are your friend, use them as much as you can. Also in many cases, especially for this problem set where you really understand the nuts and bolts of how to train and optimize your models. You will come across a lot of different choices. It's like, I could implement it this way or that way. And you can go to your TA and ask, should I implement this way or that way? But you can also just use time it as your magic Python and just let, make a very informed decision and gain intuition yourself. And just basically wanna speed test a lot of different options that you have in your code a lot of the time. All right, so this is was just a pure softmax, and now the softmax alone is not play powerful. Because it really only gets with this linear decision boundaries in your original space. If you have very, very little training data that could be okay, and you kind of used a not so powerful model almost as an abstract regularizer. But with more data, it's actually quite limiting. So if we have here a bunch of words and we don't wanna update our word vectors, softmax would only give us this linear decision boundary which is kind of lame. And it would be way better if we could correctly classify these points here as well. And so basically, this is one of the many motivations for using neural networks. Cuz neural networks will give us much more complex decision boundaries and allow us to fit much more complex functions to our training data. And you could be snarky and actually rename neural networks which sounds really cool. It's just general function approximators. Just wouldn't have quite the same ring to it, but it's essentially what they are. So let's define how we get from the symbol of logistic regression to a neural network and beyond, and deep neural nets. So let's demystify the whole thing by starting, defining again some of the terminology. And we can have more fun with the math, and then one and a half lectures from now. We can just basically use all of these Lego blocks. So bear with me, this is going to be tough. And try to concentrate and ask questions if you have any, cuz we'll keep building now a pretty awesome large model that's really useful. So we'll have inputs, we'll have a bias unit, we'll have an activation function and output for each single neuron in our larger neuron network. So let's define a single neuron first. Basically, you can see it as a binary logistic regression unit. We're going to have inside, again a set of weights that we have in a product with our input. So we have the input x here to this neuron. And in the end, we're going to add a bias term. So we have an always on feature, and that kind of defines how likely should this neuron fire. And by firing, I mean have a very high probability that's close to one. For being on. And f here is always, from now on, going to be this element wise function. In our case here the sigmoid that just squashes whatever this sum gives us in our product plus the bias term and basically just squashes it to be between 0 and 1. All right, so this is the definition of the single neuron. Now if we feed a vector of inputs through all this different little logistic regression functions and neurons, we get this output. And now the main difference between just predicting directly a softmax and standard machine learning and deep learning is that we'll actually not force this to give directly the output. But they will themselves be inputs to yet another neuron. And it's a loss function on top of that neuron such as cross entropy that will now govern what these intermediate hidden neurons. Or in the hidden layer what they will actually try to achieve. And the model can decide itself what it should represent, how it should transform this input inside these hidden units here in order to give us a lower error at the final output. And it's really just this concatenation of these hidden neurons, these little binary logistic regression units that will allow us to build very deep neural network architectures. Now again, for sanity's sake, we're going to have to use matrix notation cuz all of this can be very simply described in terms of matrix multiplication. So a1 here is where going to be the final activation of the first neuron, a2 in second neuron and so on. So instead of writing out the inner product here, or writing even this as an inner product plus the bias term we're going to use matrix notation. And it's very important now to pay attention to this intermediate variables that we'll define because we'll see these over and over again as we use a chain rule to take derivatives. So we'll define z here as W times x plus the bias vector. So we'll basically have here as many bias terms and this vector has the same dimensionality as the number of neurons that we have in this layer. And W will have number of rows for the number of neurons that we have times number of columns for the input dimensionality of x. And then, whenever we write a of f(z), what that means here is that we'll actually apply f element wise. So f(z) when z is a vector is just f(z1), f(z2) and f(z3). And now you might ask, well, why do we have all this added complexity here with this sigmoid function. Later on we can actually have other kinds of so called non linearities. This f function and it turns out that if we don't have the non-linearities in between and we will just stack a couple of this linear layers together it wouldn't add a very different function. In fact it would be continuing to just be a single linear function. And intuitively as you have more hidden neurons, you can fit more and more complex functions. So this is like a decision boundary in a three dimensional space, you can think of it also in terms of simple regression. If you had just a single hidden neuron, you kinda see here almost an inverted sigmoid. If you have three hidden neurons, you could fit this kind of more complex functions and with ten neurons, each neuron can start to essentially, over fit and try to be very good at fitting exactly one point. All right, now let's revisit our single window classifier and instead of slapping a softmax directly onto the word vectors we're now going to have an intermediate hidden layer between the word vectors and the output. And that's when we really start to gain an accuracy and expressive power. So let's define a single layer neural network. We have our input x that will be again, our window, the concatenation of multiple word vectors. We'll define z and we'll define a as element wise on the areas a and z. And now, we can use this neural activation vector a as input to our final classification layer. The default that we've had so far was the softmax, but let's not rederive the softmax. We've done it multiple times now, you'll do it again in a problem set and introduce an even simpler one and walk through all the glory details of that simple classifier. And that will be a simple, unnormalized score. And this case here, this will essentially be the right mechanism for various simple binary classification problems, where you don't even care that much about this probability z is 0.8. You really just cares like, is it one, is it in this class, or is it not? And so we'll define the objective function for this new output layer in a second. Well, let's first understand the feed-forward process. And well feed-forward process is what you will end up using a test time and for each element also in training before you can take derivative. Always be feed-forward and then backward to take the derivatives. So what we wanna do here is for example, take basically each window and then score it. And say if the score is high we want to train the model such that it would assign high scores to windows where the center word is a named entity location. Such as Paris, or London, or Germany, or Stanford, or something like that. Now we will often use and you'll see a in a lot of papers this kind of graph, so it's good to get used to it. There are various other kinds, and we'll try to introduce them slowly throughout the lecture but this is the most common one. So we'll define bottom up, what each of these layers will do and then we'll take the derivatives and learn how to optimize it. Now x window here is the concatenation of all our word vectors. So let's hear, and I'll ask you a question in a second, let's try to figure out the dimensionality here of all our parameters so that you're, I know you're with me. So let's say each of our word vectors here is four dimensional and we have five of these word vectors in each window that are concatenated. So x is a 20 dimensional vector. And again, we'll define it as column vectors. And then lets say we have in our first hidden layer, lets say we have eight units here. So you want an eight unit hidden layer as our intermediate representation. And then our final scores just again a simple single number. Now what's the dimensionality of our W given what I just said? 20 dimensional input, eight hidden units. 20 rows and eight columns. We have one more transfer, [LAUGH] that's right. So it's going to be eight rows and 20 columns, right? And you can always whenever you're unsure and you have something like this then this will have some n times d. And then multiply this and then this will have, this will always be d, and so these two always have to be the same, right? So all right, now what's the main intuition behind this extra layer, especially for NLP? Well, that will allow us to learn non-linear interactions between these different input words. Whereas before, we could only say well if in appears in this location, always increase the probability that the next word is a location. Now we can learn things and patterns like, if in is in the second position, increase the probability of this being the location only if museum is also the first vector. So we can learn interactions between these different inputs. And now we'll eventually make our model more accurate. Great question. So do I have a second W there. So the second layer here the scores are unnormalized, so it'll just be U and because we just have a single U, this will just be a single column vector and we'll transpose that to get our inner product to get a single number out for the score. Sorry, yeah, so the question was do we have a second W vector. So yeah, that's in some sense our second matrix, but because we only have one hidden neuron in that layer, we only need a single vector. Wonderful. All right, so, now let's define the max-margin loss. It's actually a super powerful loss function often is even more robust than the cross entropy error in softmax, and is quite powerful and useful. So let's define here two examples. Basically, you want to give a high score to windows, where the center word is a location. And we wanna give low scores to corrupt or incorrect windows where the center word is not a named entity location. So museum is technically a location, but it's not a named entity location. And so the idea for this training objective of max-margin is to essentially try to make the score of the true windows larger than the ones of the corrupt windows smaller or lower. Until they're good enough. And we define good enough as being different by the value of one. And this one here is a margin. You can often see it as a hyperparameter too and set it to m and try different ones but in many cases one works fine. This is continuous and we'll be able to use SGD. So now what's the intuition behind the softmax, sorry the max-margin loss here? If you have for instance a very simple data set and you have here a couple of training samples. And here you have the other class c, what a standard softmax may give you is a decision boundary that looks like this. It's like perfectly separates the two. It's a very simple training example. Most standard softmax classifiers will be able to perfectly separate these two classes. And again, this is just for illustration in two dimensions. These are much higher dimensional problems and so on. But a lot of the intuition carries through. So now here we have our decision boundary and this is the softmax. Now, the problem is maybe that was your training data set. But your test set, actually, might include some other ones that are quite similar to those stuff you saw at training, but a little different. And now this kind of decision boundary is not very robust. In contrast to this, what the max margin loss will attempt to do is to try to increase the margin between the closest points of your training data set. So if you have a couple of points here and you have different points here. We'll try to maximize the distance between the closest points here, and essentially be more robust. So then if at test time you have some things that are kinda similar, but not quite there, you're more likely to also correctly classify them. So it's a really great lost or objective function. Now in our case here when we say a sc for one corrupt window. In many cases in practice we're actually going to have a sum over multiple of these. And you can think of this similar to the skip-gram model where we sample randomly a couple of corrupt examples. So you really only need for this kind of training a bunch of true examples of this is a location in this context. And then all the other windows where you don't have that as your training data are essentially part of your negative class. All right, any questions around the max-margin objective function? We're gonna take a lot of derivatives of it now. That's right, is the corrupt window just a negative class? Yes, that's exactly right. So you can think of any other window that doesn't have as its center location just as the other class. All right, now how do we optimize this? We're going to take very similar steps to what we've done with cross entropy, but now we actually have this hidden layer and we'll take our second to last step towards the full back-propagation algorithm which we'll cover in the next lecture. So let's assume our cost J here is larger than 0. So what does that mean? In the very beginning you will initialize all your parameters here again. Either randomly or maybe you'll initialize your word vectors to be reasonable. But they're not gonna be quite perfect at learning in this context in the window what is location and what isn't. And so in the beginning all your scores are likely going to be low cuz all our parameters, U and W and b have been initialized to small, random numbers. And so I'm unlikely going to be great at distinguishing the window with a correct location at center versus one that is corrupt. And so basically, we will be in this regime. After a while of training, eventually you're gonna get better and better. And then intuitively if your score here for instance of the good window is five and one of the corrupt is just two, then you'll see 1- 5 + 2 is less than 0 so you just basically have 0 loss on those elements. And that's another great property of this objective function which is over time you can start ignoring more and more of your training set cuz it's good enough. It will assign 0 cost as in 0 error to these examples and so you can start to focus on your objective function only on the things that the model still has trouble to distinguish. All right, so let's in the very beginning assume most of our examples will J will be larger than 0 for them. And so what we're gonna have to do now is take derivatives with respect to all the parameters of our model. And so what are those? Those are U, W, b and our word vectors x. So we always start from the top and then we go down because we'll start to reuse different elements and just the simple combination of taking derivatives and reusing variables is going to lead us to back propagation. So derivative of s with respect to U. Well, what was s? s was just u transpose times a and so we all know that derivative of that is just a. So that was easy, first element, first derivative super straight forward. Now it's important when we take the next derivative to also be aware of all our definitions. How we define these functions that we're taking derivatives off. So s is basically U transpose a, a was f(z) and z was just Wx + b. All right, it's very important to just keep track. That's like almost 80% of the work. Now, let's take the derivative like I said, first partial of only one element of W to gain intuitions. And then we can put it back together and have a more complex matrix notation. So we'll observe for Wij that it will actually only appear in the ith activation of our hidden layer. So for example, let's say we have a very simple input with a three dimensional x. And we have two hidden units, and this one final score U. Then we'll observe that if we take the derivative with respect to W23. So the second row and the third column of W, well that actually only is needed in a2. You can compute a1 without using W23. So what does that mean? That means if we take the derivative of weight Wij, we really only need to look at the ith element of the vector a. And hence, we don't need to look at this whole inner product. So what's the next step? Well as we're taking derivatives with W, we need to be again aware of where does W appear and all the other parameters are essentially constant. So U here is not something we're taking a derivative off. So what we can do is just take it out, just as like a single number, right. We'll just get it outside, put the derivative inside here. And now, we just need to very carefully define our ai. So a subscript i, so that's where Wij appears. Now, ai was this function, and we defined it as f of zi. So why don't we just write this carefully out, and now this is first application of the chain rule with derivative of ai with respect to zi, and then zi with respect to Wij. So this is single application of the chain rule. And then end of it it looks kind of overwhelming, but each step is very clear. And each step is simple, we're really writing out all the glory details. So application of the chain rule, now we're going to define ai. Well ai is just f of zi, and f was just an element y function on a single number zi. So we can just rewrite ai with its definition of f of zi, and we keep this one intact, all right? And now derivative of f, we can just for now assume is f prime. Just a single number, take derivative. We'll just define this as f prime for now. It's also just a single number, so no harm done. Now we're still in this part here, where we basically wanna take the derivative of zi with respect to Wij. Well let's define what zi was, zi was just here. The W of the ith row times x plus the ith element of b. So let's just replace zi with it's definition. Any questions so far? All right, good or not? So we have our f prime and we have now the derivative with respect to Wij of just this inner product here. And we can again, very carefully write out well, the inner product is just this row times this column vector. That's just the sum, and now when we take the derivative with respect to Wij, all the other Ws are constants. They fall out, and so basically it's only the xk, the only one that actually appears in the sum with Wij is xj and so basically this derivative is just Xj. All right, so now we have this whole expressions of just taking carefully chain rule multiplications definitions of all our terms and so on. And now basically, what we're gonna want to do is simplify this a little bit, cuz we might want to reuse different parts. And so we can define, this first term here actually happens to only use subindices i. And it doesn't use any other subindex. So we'll just define Uif prime of zi for all the different is as delta i. At first notational simplicity and xj is our local input signal. And one thing that's very helpful for you to do is actually look at also the derivative of the logistic function here. Which can be very conveniently computed in terms of the original values. And remember f of z here, or f of zi of each element is always just a single number. And we've already computed it during forward propagation. So we wanna ideally use hidden activation functions that are very fast to compute. And here, we don't need to compute another exponent or anything. We're not gonna recompute f of zi cuz we already did that in the forward propagation step. All right, now we have the partial derivative here with respect to one element of W. But of course, we wanna have the whole gradient for the whole matrix. So now the question is, with the definitions of this delta i for all the different elements of i of this matrix and xj for all the different elements of the input. What would be a good way of trying to combine all of these different elements to get a single gradient for the whole matrix W, if we have two vectors. That's right. So essentially, we can use delta times x transpose, namely the outer product to get all the combinations of all elements i and all elements j. And so this again might seem like a little bit like magic. But if you just think again of the definition of the outer product here. And you write it out in terms of all the indices, you'll see that turns out to be exactly what we would want in one very nice, very simple equation. So we can kind of think of this delta term actually as the responsibility of the error signal that's now arriving from our overall loss into this layer of W. And that will eventually lead us to flow graphs. And that will eventually lead us to you not having to actually go through all this misery of taking all these derivatives. And being able to abstract it away with software packages. But this is really the nuts and bolts of how this works, yeah? Yeah, the question is, this outer product will get all the elements of i and j? And that's right. So when we have delta times x transposed. Then now we have basically here, x is usually this vector. So now let's take the right notation. So we wanna have derivative with respect to W. W was a, 2x3 dimension matrix for example, 2x3. We should be very careful of our notation. 2x3. So now, the derivative of j with respect to our w has to, in the end, also be a 2x3 matrix. And if we have delta times x transposed, then that means we'll have to have a two-dimensional delta, which is exactly the dimensions that are coming in. [INAUDIBLE] Signal that I mentions that we have for the number of hidden units that we have. Times this one dimensional, basically row vector times xt which is a 1 x 3 dimensional vector that we transpose. And so, what does that mean? Well, that's basically multiplying now, standard matrix multiplication. You should write that. So now the last term that we haven't taken derivatives of off the [INAUDIBLE], is our bi and it'll eventually be very similar. We're going to go through it. We can pull Ui out, we're going to take f prime, assume that's the same. So now, this is our delta i. We'll observe something very similar. These are very similar steps for bi. But in the end, we're going to just end up with this term and that's just going to be one. And so, the derivative of our bi element here, is just delta i and we can again use all the elements of delta, to have the entire gradient for the update of b. Any questions? Excellent, so this is essentially, almost back-propagation. We’ve so far only taken derivatives and using the chain rule. And first thing, when I went through this, this is like a lot of the magic of deep learning, is just becoming a lot clear. We’ve just taken derivatives, we have an objective function and then we update based on our derivatives, all the parameters of these large functions. Now the main remaining trick, is to re-use derivatives that we've computed for the higher layers in computing derivatives for the lower layers. It's very much an efficiency trick. You could not use it and it would just be very, very inefficient to do. But this is the main insight of why we re-named taking derivatives as back propagation. So what is the last derivatives that we need to take? For this model, well again, it's in terms of our word vectors. So let's go through all of those. Basically, we'll have to take the derivative of the score with respect to every single element of our word vectors. Where again, we concatenated all of them into a single window. And now, the problem here is that each word vector actually appears in both of these terms. And both hidden units use all of the elements of the input here. So we can't just look at a single element. We'll really have to sum over, both of the activation units in the simple case here, where we just have two hidden units and three dimensional inputs. Keeps it a little simpler, and there's less notation. So then, we basically start with this. I have to take derivatives with respect to both of the activations. And now, we're just going to go through similar kinds of steps. We have s. We defined s as u transpose times our activation. That was just Ui then ai was just f of w and so on. Now, what we'll observe as we're going through all these similar steps again is that, we'll actually see the same term here reused from before. It's Ui x F prime of Zi. This is exactly the same. That we've seen here. F prime of Zi. And what that means is, we can reuse that same delta. And that's really one of the big insights. Fairly trivial but very exciting, cuz it makes it a lot faster. But, what's still different now, is that of course we have to take the the derivative with respect. To each of these, to this inner product here in Xj, where we basically dumped the bias term, cuz that's just a constant, when we were taking this derivative. And so, this one here again, Xj is just inner product, it's the jth element of this matrix W that's the relevant one for this inner product, let me take the derivative. So now we have this sum here, and now comes again this tricky bit of trying to simplify this sum into something simpler in terms of matrix products. And again, the reason we're getting towards back propagation is that we're reusing here these previous error signals, and elements of the derivative. Now, the simplest, the first thing we'll observe here as we're doing this sum, is that sum is actually also a simple inner product, where we now take the jth column. So this again, this dot notation when the dot is after the first, and next we take the row, here we take the column. So it's a column vector. But then of course we transpose it, so it's a simple inner product for getting us a single number. Just the derivative of this element of the word vectors and the word window. Yes. Great question. So once we have the derivatives for all these different variables, what's the sequence in which we update them, and there's really no sequence we update them all in parallel. We just take one step in all the elements that we now had a variable in or have seen that parameter in. And the complexity there, is in standard machine learning you'll see in many models just like standard logistic regression, you see all your parameters like your W in all the examples. And ours, it's a little more complex, because most words you won't see in a specific window and so, you only update the words that you see in that window. And if you assumed all the other ones, you'd just have very, very large, quite sparse updates, and that's not very RAM efficient, great question. So now we have this simple multiplication here and the sum is just is just inner product. So far so simple, and we have our D dimension vector which I mentioned, is two dimensions. We have the sum over two elements. So, so far so good. Now, really, we would like to get the full gradient here with respect to all XJs for J equals one to three and its simple case, or five D if we have a five word large window. So now the question is, how do we combine this single element here. Into a vector that eventually gives us all the different gradients for all the xij. And j equals 1 to however long our window is Is anybody follow along this closely? That's right. W transposed delta. Well done. So basically our final derivative and final gradient here for. Our score s with respect to the entire window, is just W transpose times delta. Super simple very fast to implement, I can easily think about how to vectorize this again by concatenating multiple deltas from multiple Windows and so on. And it can be very efficiently, like implemented and derived. All right, now the error message is delta that arrives at this hidden layer, has of course the same dimensionality as its hidden layer because we're updating all the windows. And now from the previous slides we also know that when we update a window, it really means we now cut up that final gradient here into the different chunks for each specific word in that window, and that's how we update our first large neural network. So let's put all of this together again. So, our full objective function here was this max and I started out with saying let's assume it's larger than zero so you have this identity here. So this is simple indicator function if. The indication is true, then it's one and if not, it's zero. And then you can essentially ignore that pair of correct and corrupt windows x and xc, respectively. So our final gradient when we have these kinds of max margin functions is essentially implemented this way. And we can very efficiently multiply all of this stuff. All right. So this is just that, this is not right. This is our [INAUDIBLE] But you still have to take the derivative here, but basically this indicator function is the main novelty that we haven't seen yet. All right. Yeah. >> [INAUDIBLE] >> Yeah, it's a long question. The gist of the question is how to we make sure we don't get stuck in local optima. And you've kinda answered it a little bit already which is indeed because of the stochasticity you keep making updates anyway it's very hard to get stuck. In fact, the smaller your, the more stochastic you are, as in the fewer windows you look at each time you want to make an update, the less likely you're getting stuck. If you had tried to get through all the windows and then make one gigantic update, so it's actually very inefficient and much more likely to get you stuck. And then the other observation that it's just slowly coming through some of the theory that we couldn't get into this class. Is that it turns out a lot of the local optima are actually pretty good. And in many cases, not even that far away from what you might think the global optima would be. Also, you'll observe a lot of times, and we'll go through this in some of the project advice in many cases, you can actually perfectly fit. We have a powerful enough neural network model. You can often perfectly fit your input and your training dataset. And you'll actually, eventually spend most of your time thinking about how to regularize your models better and often, at least, even more stochasticity. We'll get through some of those. But yeah, good question. Yeah, in the end, we just have all these updates and it's all very simple. All right, so let's summarize. This was a pretty epic lecture. Well done for sticking through it. Congrats again, this was our super useful basic components lecture. And now this window model is actually really the first one that you might observe and practice and you might actually want to implement. In a real life setting. So to recap, we've learned word vector training, we learned how to combine Windows. We have the softmax and the cross entropy error and we went through some of the details there. Have the scores and the max margin loss, and we have the neural network, and it's really these two steps here that you have to combine differently for problem set. Number one and especially number two in that. So, we just have one more math heavy lecture and after that we can have fun and combine all these things together. Thanks.
Info
Channel: Stanford University School of Engineering
Views: 103,215
Rating: 4.7867804 out of 5
Keywords: Neural networks, Forward computation, Backward propagation, Neuron Units, Max-margin Loss, Gradient checks, Xavier parameter initialization, Learning rates, Adagrad
Id: uc2_iwVqrRI
Channel Id: undefined
Length: 76min 43sec (4603 seconds)
Published: Mon Apr 03 2017
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.