[MUSIC] Stanford University. It's getting real today. So, let's talk about a little
bit of the overview today. So, we'll really get you into
the background for classification. And then, we'll do some interesting things
with updating these word vectors that we so far have learned in
an unsupervised way. We'll update them with some real
supervision signals such as sentiment and other things. Then, we'll look at the first real
model that is actually useful and you might wanna use in practice. Well, other than, of course, the word
vectors, but one sort of downstream task which is window classification and we'll
really also clear up some of the confusion around the cross entropy error and
how it connects with the softmax. And then, we'll introduce the famous
neural network, our most basic LEGO block that we may start to call deep to get
to the actual title of this class. Deep learning in NLP.
And then, we'll actually introduce another loss
function, the max margin loss and take our first steps into
the direction of backprop. So, this lecture will be,
I think very helpful for problem set one. We'll go into a lot of the math
that you'll need probably for number two in the problem set. So, I hope it'll be very useful and
I'm excited for you cuz at the end of this lecture, you'll feel hopefully a lot
better about the magic of deep learning. All right, are there any
organizational questions around problem sets or
programming sessions with the TAs? No, we're all good? Awesome, thanks to the TAs for
clearing up everything. Cool, so let's be very careful about
our notation today because that is one of the main things that
a lot of people trip up over as we go through very complex
chain-rules and so on. So, let's start at the beginning and
say, all right, we have usually a training dataset
of some input X and some output Y. X could be in the simplest case, words
in isolation, just a single word vector. It's not something you would
usually do in practice. But it'll be easy for
us to learn that way. So we'll start with that but then,
we'll move to context windows today. And then eventually, we'll use the same
basic building blocks that we introduce today for sentences and documents and
then complex interactions for everything. Now, the output in the simplest
case it's just a single label. It's just a positive or
a negative kind of sentence. It could be the named entities of
certain words in their context. It can also be other words, so
in machine translation, for instance, you might wanna output eventually
a sequence of other words as our yi and we'll get to that in a couple weeks. And, yeah, basically they have multiword
sequences as potential outputs. All right, so what is the intuition for
classification? In the standard machine learning case,
so not yet the deep learning world, we usually just, for
something as simple logistic regression, basically want to define and learn
a simple decision boundary where we say everything to the left of this or
in one direction is in one class and the other one,
all the other things in the other class. And so, in general machine learning,
we assume our inputs, the Xs are kinda fixed,
they're just set and we'll only train the W parameter,
which is our softmax weights. So, we'll compute the probability of Y,
given the input X with this kind of input. And so, one notational comment here is for the whole dataset,
we often subscript with i but then, when I drop the i we're just looking
at a single example of x and y. Eventually, we're going to overload
at the subscript a little bit and look at the indices of certain vector so,
if you get confused, just raise your hand and ask. I'll try to make it clear
which one is which. Now, let's dive into the softmax. We mentioned it before but we wanna really
carefully define and recall the notation here cuz we'll go and take derivatives
with respect to all of these parameters. So, we can tease apart two steps here for
computing this probability of y given x. The first thing is, we'll take the y'th
row of W and multiply that row with x. And so again this notation here,
when we have Wy. And that means we'll have,
we're taking the y'th row of this matrix. And then, multiplying it here with x. Now if we do that multiple times for
all c from one to our classes. So let's say, this is 1, 2, 3,
the 4th row and multiply each of these. So then we get four numbers here. And these are unnormalized scores. And then, we'll basically,
pipe this vector through the softmax to compute a probability
distribution that sums to one. All right, that's our step one. Any questions around that? Cuz it's just gonna keep
on going from here. All right, great. And, I get that sometimes
in general from previous sort of surveys, it seems to be that
15% of the class are usually bored when we go through all of these,
like all of these derivatives. 15% are super overwhelmed and then the
majority of people are like, okay, it's a good speed, I'm learning something, I'm
getting it, and you're making progress. So, sorry for the 30% for
whom this is too slow or too fast. You can probably just skim
through the lecture slides or speed it up if you're watching online. If you're super familiar with taking
super complex derivatives and if it's a little overwhelming, then
definitely come to all the office hours. We have an awesome set of
TAs who will help you. All right, now we,
let's look at a single example of an x and y that we wanna predict. In general, we want our model to
essentially maximize the probability of the correct class. We wanted to output the right class at the
end by taking the argmax of that output. And maximizing probability is the same
as maximizing log probability, it's the same as minimizing the negative
of that log probability and that is often our objective function. So, why do we call this
the cross-entropy error? Well, we can define the cross-entropy
in the abstract in general as follows. So let's assume we have
the ground truth or gold or target probability distribution,
we use those three terms interchangeably. Basically, what the ideal target
in our training dataset, the y and we'll assume that, that is one at
the right class and zero everywhere else. So if we have for instance, five
classes here and it's the center class. Its the third class and this would be one
and all the other, numbers would be zero. So, if we define this as p
in our computed probability, that our softmax outputs as q then we
would define here the cross-entropy is basically this sum
over all the classes. And in our case, p here is just
one-hot vector that's really only 1 in one location and 0 everywhere else. So, all these other terms
are basically gone. And we end up with just log of q and that's exactly the log of what
our softmax outputs, all right? And then, there are some nice connections
to Kullback-Leibler divergence and so on. I used to talk about it but
we don't have that much time today. So and you can also if you're
familiar of this in stats, you can see this as trying to minimize the
Kullback-Leibler divergence between these two distributions. But really, this is all you need to
know for the purpose of this class. So this is for
one element of your training data set. Now, of course, in general,
you have lots of training examples. So we have our overall objective
function we often denote with J, over all our parameters theta. And we basically sum these negative log
probabilities of the correct classes that we index here, a sub-index with yi. And basically we want to
minimize this whole sum. So that's our cross-entropy error
that we're trying to minimize, and we'll take lots of derivatives off in
a lot of the next couple of hours. All right, any questions so far? So this is the general ML case where
we assume our inputs here are fixed. Yes, it's a single number. So we are not multiplying a vector here,
so p(c) is the probability for that class, so that's one single number. Great question. So the cross entropy, a single number,
our main objective that we're trying to minimize, or
our error that we're trying to minimize. Now, whenever you write
this F subscript Y here, we don't want to forget that F is really
also a function of X, our inputs, right? It's sort of an intermediate step and
it's very important for us to play around with this notation. So we can also rewrite this as W y,
that row, times x, and
we can write out that whole sum. And that can often be helpful as you are
trying to take derivatives of one element at a time to eventually see the bigger
picture of the whole matrix notation. All right, so often we'll write f here
in terms of this matrix notation. So this is our f, this is our W,
and this is our x. So just standard matrix
multiplication with a vector. All right, now most of the time we'll
just talk about this first part of the objective function but
it's a bit of a simplification because in all your real applications you will
also have this regularization term here. As part of your overall
objective function. And in many cases,
this theta here for instance, if it's the W matrix of our
standard logistic regression, we'll essentially just try this
part of the objective function. We'll try to encourage the model to keep
all the weights as small as possible and as close as possible to zero. You can kind of assume if you want as
a Bayesian that you can have a prior, a Gaussian distributed prior that says
ideally all these are small numbers. Often times if you don't have
this regularization term your numbers will blow up and
it will start to overfit more and more. And in fact, this kind of plot is
something that you will very often see in your projects and
even in the problem sets. And when I took my very first statistical
learning class, the professor said, this is the number one plot to remember. So, I don't know if it's that important,
but it is very, very important for all our applications. And it's basically a pretty abstract plot. You can think of the x-axis as
a variety of different things. For instance, how powerful your model is. How many deep layers you'll have or
how many parameters you'll have. Or how many dimensions
each word vector has. Or how long you trained a model for. You'll see the same kind of pattern
with a lot of different, x-axis and then the y-axis here is
essentially your error. Or your objective function that you're
trying to optimize and minimize. And what you often observe is,
the more powerful your model gets, the better you are on
lowering your training error, the better you can fit these x-i,
y-i pairs. But at some point you'll actually start
to over-fit, and then your test error, or your validation or
development set error, will go up again. We'll go into a little bit more details
on how to avoid all of that throughout this course and
in the project advice and so on. But this is a pretty fundamental thing and
just keep in mind that for a lot of the implementations, and your projects you
will want this regularization parameter. But really it's the same one for
almost all the objective functions so we're going to chop it and mostly
focus on actually fitting our dataset. All right,
any questions around regularization? So basically, you can think of
this in terms of if you really care about one specific number,
then you can adjust all your parameters such that it will exactly
go to those different points. And if you force it to not do that,
it will kind of be a little smoother. And be less likely to fit
exactly those points and hence often generalize slightly better. And we'll go through a couple of examples
of what this will look like soon. All right, now as I mentioned
in general machine learning, we'll only optimize the W here,
the parameters of our Softmax classifier. And hence our updates and
gradients will only be pretty small, so in many cases we only have you
know a handful of classes and maybe our word vectors are hundred so if
we have three classes and 100 dimensional word vectors we're trying to classify,
we'd only have 300 parameters. Now, in deep learning,
we have these amazing word vectors. And we actually will want to
learn not just the Softmax but also the word vectors. We can back propagate into them and
we'll talk about how to do that today. Hint, it's going to be taking derivatives. But the problem is when we update
word vectors, conceptually as you are thinking through this, you
have to realize this is very, very large. And now all of the sudden have a very
large set of parameters, right? Let's say your word vectors
are 300 dimensional you have, you know 10,000 words in your vocabulary. All of the sudden you have an immensely
large set of parameters so on this kind of plot you're going
to be very likely to overfit. And so before we dive into all this
optimization, I want you to get a little bit of an intuition of what
it means to update word vectors. So let's go through a very simple example where we might want to
classify single words. Again, it's not something
we'll do very often, but let's say you want to classify single
words as positive or negative. And let's say in our training data set we
have the word TV and telly and say you know this is movie reviews and if you
say this movie is better suited for TV. It's not a very positive thing to say
about a movie that's just coming out into movie theaters. And so we would assume that
in the beginning telly, TV, and television are actually all
close by in the vector space. We learn something with word2vec or
glove vectors and we train these word vectors on a very, very large corpus and
it learned all these three words appear often in a similar context, so
they are close by in the vector space. And now we're going to train but,
our smaller sentiment data set only includes in the training set, the X-i
Y-i as TV and telly and not television. So now what happens as we
train these word vectors? Well, they will start to move around. We'll project sentiment into them and
so you now might see telly and TV, that's a British dataset, so like to
move somewhere else into the vector space. But television actually stays
where it was in the beginning. And now when we want to test it, we would actually now misclassify this
word because it's never been moved. And so what does that mean? The take home message here will be that if you have only a very
small training dataset. That will allow you especially with these
deep models to overfit very quickly, you do not want to train
your word vectors. You want to keep them fixed,
you pre-trained them with nice Glove or word2vec models on a very large corpus or you just downloaded them from the cloud
website and you want to keep them fixed, cuz otherwise you will
not generalize as well. However, if you have a very large dataset
it may be better to train them in a way we're going to describe in
the next couple of slides. So, an example for
where you do that is, for instance, machine translation where you might have
many hundreds of Megabytes or Gigabytes of training data and you don't really need to
do much with the word vectors other than initialize them randomly, and then train
them as part of your overall objective. All right, any questions around generalization
capabilities of word vectors? All right, it might still be
magical how we're training this, so that's what we're gonna describe now. So, we rarely ever really
classify single words. Really what we wanna do is
classify words in their context. And there are a lot of fun and
interesting. Issues that arise in context really
that's where language begins and grammar and
the connection to meaning and so on. So here, a couple of fun examples of
where context is really necessary. So for instance, we have some words
that actually auto-antonyms, so they mean their own opposite. So for instance to sanction can
mean to permit or to punish. And it really depends on the context for
you to understand which one is meant, or to seed can mean to place seeds or
to remove seeds. So without the context, we wouldn't really
understand the meaning of these words. And in one of the examples that you'll see
a lot, which is named entity recognition, let's say we wanna find locations or
people names, we wanna identify is this the location or
not. You may also have things like Paris, which
could be Paris in France or Paris Hilton. And you might have Paris
staying in Paris and you still wanna understand
which one is which. Or if you wanna use deep learning for
financial trading and you see Hathaway, you wanna make sure that if it's just a
positive movie review from Anne Hathaway. You're not all the sudden buying
stocks from Berkshire Hathaway, right? And so,
there are a lot of issues that are fun and interesting and
complex that arise in context. And so, let's now carefully walk
through this first useful model, which is Window classification. So, we'll use as our first motivating
example here 4-class named entity recognition, where we basically
wanna identify a person or location or organization or none of the above for
every single word in a large corpus. And there are lots of different
possibilities that exist. But we'll basically look
at the following model. Which is actually quite
a reasonable model. And also one that started in 2008. So the first beginning by Collobert and
Weston, a great paper, to do the first kind of useful state
of the art Text classification and word classification context. So, what we wanna do is basically train a
softmax classifier by assigning a label to the center word and then concatenating all
the words in a window around that word. So, let's take for example this
subphrase here from a longer sentence. We basically wanna classify
the center word here which is Paris, in the context of this window. And we'll define the window length as 2. 2 being 2 words to the left and 2 words to the right of the current center
word that we're trying to classify. All right, so what we will do
is we'll define our new x for this whole window as the concatenation
of these five word vectors. And just in general throughout all of this lecture all my
vectors are going to be column vectors. Sadly in number two of the problem set,
they're row vectors. Sorry for that. Eventually, all these programming
frameworks they're actually row-wise first and so it's faster in the low-level
optimization to use row vectors. For a lot of the math it's actually I find
it simpler to think of them as column vectors so. We're very clear in the problem set but
don't get tripped up on that. So basically, we'll define this here as
one five D dimensional column vector. So, we have T dimensional word vectors,
we have five of them and we stack them up in one column, all right. Now, the simplest window classifier that
we could think of is to now just put the softmax on top of this
concatenation of five word vectors and we'll define this, our x here. Our inputs is just the x of the entire
window for this concatenation. And we have the softmax on top of that. And so, this is the same
notation that we used before. We're introducing here y hat,
with sadly the subscript y for the correct current class. It's tough, I went through [LAUGH] several
iterations, it's tough to have like prefect notation that works
through the entire lecture always. But you'll see why soon. So, our overall objective here is,
again, this whole sum over all these probabilities that we have,
or negative log of those. So now, the question is, how do we
update these word vectors x here? One x is a window, and
x is now deep inside the softmax. All right, well, the short answer
is we'll take a lot of derivatives. But the long answer is, you're gonna have
to do that a lot in problem set one and maybe in the midterm. So, let's be a little more helpful, and
actually go through some of the steps and give you some hints. So some of this, you'll actually
have to do in your problem set, so I'm not gonna go through all the details. But I'll give you a couple of hints
along the way and then you can know if you're hitting those and then you'll
see if you're on the right track. So, step one, always very
carefully define your variables, their dimensionality and everything. So, y hat will define as the softmax
probability of the vector. So, the normalized scores or
the probabilities for all the different classes that we have. So, in our case we have four. Then we have the target distribution. Again, that will be a one hot
vector where it's all zeroes except at the ground truth index of the class y,
where it's one. And we'll define our f
here as f of x again, which is this matrix multiplication. Which is going to be a C dimensional
vector where capital C is the number of classes that we have, all right. So, that was step one. Carefully define all of your variables and
keep track of their dimensionality. It's very easy when you implement this and
you multiply two things, and they have wrong dimensionality, and
you can't actually legally multiply them, you know you have a bug. And you can do this also
in a lot of your equations. You'd be surprised. In the midterm, you're nervous. But maybe at the end you have some time. And you could totally grade it
by yourself in the first pass, by just making sure that all your
dimensionality of your matrix and vector multiplications are correct. All right, the second tip is the chain
rule, we went over this before, but I heard there's a little bit of
confusion still in the office hours. So, let's define this carefully for
a simple example and then we'll go and give you a couple more hints also for
more complex example. So again, if you have something
very simple, such as a function y, which you can defined here as f of u and
u can be defined as g of x as in the whole function, y of x,
can be described as f of g of x, then you would basically multiply dy,
u times the udx. And so very concretely here,
this is sort of high school level, but we'll define it properly in
order to show the chain rule. So here,
you can basically define u as g(x), which is just the inside in
the parentheses here, so x cubed + 7. It can have y as a function of f(u), where we use 5 times u,
just replacing the inside definition here. So it's very simple,
just replacing things. And now, we can take the derivative
with respect to u and we can take the derivative
with respect to x(u). And then we just multiply these two terms,
and we plug in u again. So in that sense, we all know,
in theory, the chain rule. But, now we're gonna have the softmax, and we're gonna have lots of matrices and
so on. So, we have to be very,
very careful about our notation. And we also have to be
careful about understanding, which parameters appear inside
what other higher level elements. So, f for instance is a function of x. So, if you're trying to take
a derivative with respect to x, of this overall soft max you're gonna have
to sum over all of the different classes inside which x appears. And you'll see here,
this first application, but not just of fy again this is just
a subscript the y element of the effector which is the function of x, but
also multiply it then here by this. So, when you write this out,
another tip that can be helpful is for this softmax part of he derivative
is to actually think of two cases. One where c = y, the correct class, and one where it's basically all
the other incorrect classes. And as you write this out,
you will observe and come up with something like this. So, don't just write that as your thing
you have to put in your problems, the steps on how to get there. Bur, basically at some point you
observe this kinda pattern when you now try to look at all the derivatives
with respect to all the elements of f. And now,
when you have this you realize ,okay at the correct class we're
actually subtracting one here, and all the incorrect classes,
you will not do anything. Now, the problem is when
you implement this, it kind of looks like
a bunch of if statements. If y equals the correct class for my training set, then, subtract 1,
that's not gonna be very efficient. Also, you're gonna go insane if you try
to actually write down equations for more complex neural network
architectures ever. And so, instead, what we wanna do is
always try to vectorize a lot of our notation, as well as our implementation. And so, what this means here,
in this case, is you can actually observe that,
well, this 1 is exactly 1, where t, our hot to target distribution,
also happens to be 1. And so, what you're gonna wanna do,
is basically describe this as y(hat)- t, so
it's the same thing as this. And don't worry if you don't
understand how we got there, cuz that's part of your problem set. You have to, at some point, see this equation while you're
taking those derivatives. And now, the very first baby step towards
back-propagation is actually to define this term, in terms of a simpler single
variable and we'll call this delta. We'll get good, we'll become good friends
with deltas because they are sort of our error signals. Now, the last couple of tips. Tip number six. When you start with this chain rule, you
might want to sometimes use explicit sums, before and
look at all the partial derivatives. And if you do that a couple of times
at some point you see a pattern, and then you try to think of how to
extrapolate from those patterns of single partial derivatives,
into vector and matrix notation. So, for example,
you'll see something like this here, in at some point in your derivation. S,o the overall derivative with respect to
x of our overall objective function for one element, for one element from our
training set x and y is this sum. And it turns out when you
think about this for a while, you take here this row vector but
then you transpose it, and becomes an inner product, well if you
do that multiple times for all the C's and you wanna get in the end a whole vector
out, it turns out you can actually just re-write the sum as W
transpose* the delta. So, this is one error signal here
that we got from our softmax, and we multiply the transpose of
our softmax weights with this. And again,
if some of these are not clear and you're confused,
write them out into full sum, and then you'll see that it's really
just re-write this in vector notation. All right, now what is the dimensionality
of the window vector gradient? So in the end, we have this derivative
of the overall cost here for one element of our training
set with respect to x. But x is a window. All right, so
each say we have a window of five words. And each word is d-dimensional. Now, what should be the dimensionality
of this derivative of this gradient? That's right,
it's five times the dimensionality. And that's another really good way, and
one of the reasons we make you implement this from scratch, if you have any kinda
parameter, and you have a gradient for that parameter, and they're not the same
dimensionality, you'll also know you screwed up and there's some mistake or
bug in either your code or your map. So, it's very simple debugging skill. And way to check your own equations. So, the final derivative with respect
to this window is now this five vector because we had five d-dimensional
vectors that we concatenated. Now, of course the tricky bit is, you actually wanna update your word
vectors and not the whole window, right? The window is just this
intermediate step also. So really, what you wanna do is update and take derivatives with respect to each
of the elements of your word vectors. And so it turns out, very simply,
that can be done by just splitting that error that you've got on the gradient
overall, at the whole window and that's just basically the concatenation of the
reduced of all the different word vectors. And those you can use to update your word
vectors, as you train the whole system. All right, any questions? Is there a mathematical what? Is there a mathematical notation for
the word vector t, other than it's just variable t? Or that seems like a fine notation. You can see this as a probability
distribution, that is very peaked. >> Yeah.
>> That's all, there's nothing else to it. Just a single vector with all zeroes,
except in one location. >> So I'll just write that down? >> You can write that up, yeah. You can always just write out and
it's also something very important. You always wanna define everything, so
that you make sure that the TAs know that you're thinking about the right thing,
as you're writing out your derivatives, you write out the dimensionality,
you define them properly, you can use dot, dot,
dot if it's a larger dimensional vector. You can just define t as your
target distribution [INAUDIBLE] >> The question is, do we still have two vectors for
each word? Great question, no. We essentially, when we did glove and
word2vec, and had these two u's and v's, for all subsequent lectures from now on,
we'll just assume we have the sum of u and v and that's our single vector x,
for each word. So, the question is does this gradient
appear in lots of other windows and it does. So, if you, the answer is yes. If you have the word "in," that vector
here and the gradients will appear in all the windows that have
the word "in" inside of them. And same with museums and so on. And so as you do stochastic gradient
descent you look at one window at a time, you update it, then you go to the next
window, you update it and so on. Great questions. All right. Now, let's look at how we update
these concatenated word vectors. So basically, as we're training this,
if we train it for instance with sentiment we'll push all
the positive words in one direction and the other words in other direction. If we train it, for
named entity recognition and eventually our model can learn that seeing
something like in as the word just before the center word, would be indicative for
that center word to be a location. So now what's missing for
training this full window model? Well mainly the gradient of J with
respect to the softmax weights W. And so
we basically will take similar steps. We'll write down all the partial
derivatives with respect to Wij first and so on. And then we have our full gradient for
this entire model. And again, this will be very sparse, and you're gonna wanna have some clever ways
of implementing these word vector updates. So you don't send a bunch of zeros
around at every single window, Cuz each window will
only have a few words. So in fact, it's so important for
your code in the problem set to think carefully through your
matrix implementations, that it's worth to spend two or
three slides on this. So there are essentially two very
expensive operations in the softmax. The matrix multiplication and
the exponent. Actually later in the lecture, we'll
find a way to deal with the exponent. But the matrix multiplication can also
be implemented much more efficiently. So you might be tempted in the beginning
to think this is probability for this class and
this is the probability for that class. And so implemented a for
loop of all my different classes and then I'll take derivatives or
matrix multiplications one row at a time. And that is going to be very,
very inefficient. So let's go through some very simple
Python code here to show you what I mean. So essentially,
always looping over these word vectors instead of concatenating
everything into one large matrix. And then multiplying these is
always going to be more efficient. So let's assume we have 500
windows that we want to classify, and let's assume each window
has a dimensionality of 300. These are reasonable numbers, and let's assume we have five
classes in our softmax. And so at some point during
the computation, we now have two options. So W here are weights for the softmax. It's gonna be C many rows and
d many columns. Now the word vectors here that
you concatenated for each window. We can either have the list of
a bunch of separate word vectors, or we can have one large matrix
that's going to be d times n. So d many rows and n many windows. So we have 500 windows, so
we have 500 columns here in this 1 matrix. And now essentially, we can multiply
the W here for each vector separately, or we can do this one matrix
multiplication entirely. And you literally have
a 12x speed difference. And sadly with these larger models,
one iteration or something might take a day, eventually for
more complex models large data sets. So the difference is between
literally 12 days or 1 day of you iterating and
making your deadlines and everything. So it's super important,
and now sometimes people are tripped up by what does it
mean to multiply and do this here. Essentially, it's the same
thing that we've done here for one softmax, but
what we did is we actually concatenated. A lot of different input vectors x, and so we'll get a lot of different
unnormalized scores out at the end. And then we can tease them apart again for
them. So you have here, c times t dimensional
matrix for the d dimensional input. So using the same notation, yeah, dimensional of each window times d times
n matrix to get a c times n matrix. So these are all
the probabilities here for your N many training samples. Any questions around that? So it's super important, all your code
will be way too slow if you don't do this. And so
this is very much an implementation trick. And so in most of the equations, we're not gonna actually go there cuz
that makes everything more complicated. And the equations look at only
a singular example at a time, but in the end you're gonna wanna
vectorize all your code. Yeah, matrices are your friend,
use them as much as you can. Also in many cases, especially for
this problem set where you really understand the nuts and bolts of how
to train and optimize your models. You will come across a lot
of different choices. It's like,
I could implement it this way or that way. And you can go to your TA and ask,
should I implement this way or that way? But you can also just use time it
as your magic Python and just let, make a very informed decision and
gain intuition yourself. And just basically wanna
speed test a lot of different options that you have in
your code a lot of the time. All right, so
this is was just a pure softmax, and now the softmax alone
is not play powerful. Because it really only gets with this
linear decision boundaries in your original space. If you have very, very little
training data that could be okay, and you kind of used a not so powerful model
almost as an abstract regularizer. But with more data,
it's actually quite limiting. So if we have here a bunch of words and
we don't wanna update our word vectors, softmax would only give us this linear
decision boundary which is kind of lame. And it would be way better if we could correctly classify these
points here as well. And so basically, this is one of the many
motivations for using neural networks. Cuz neural networks will give us much
more complex decision boundaries and allow us to fit much more complex
functions to our training data. And you could be snarky and actually rename neural networks
which sounds really cool. It's just general function approximators. Just wouldn't have quite the same ring to
it, but it's essentially what they are. So let's define how we get from
the symbol of logistic regression to a neural network and beyond,
and deep neural nets. So let's demystify the whole
thing by starting, defining again some of the terminology. And we can have more fun with the math,
and then one and a half lectures from now. We can just basically use
all of these Lego blocks. So bear with me,
this is going to be tough. And try to concentrate and
ask questions if you have any, cuz we'll keep building now a pretty
awesome large model that's really useful. So we'll have inputs, we'll have
a bias unit, we'll have an activation function and output for each single
neuron in our larger neuron network. So let's define a single neuron first. Basically, you can see it as
a binary logistic regression unit. We're going to have inside, again a set of weights that we
have in a product with our input. So we have the input x
here to this neuron. And in the end,
we're going to add a bias term. So we have an always on feature, and that kind of defines how likely
should this neuron fire. And by firing, I mean have a very
high probability that's close to one. For being on. And f here is always, from now on,
going to be this element wise function. In our case here the sigmoid that just
squashes whatever this sum gives us in our product plus the bias term and basically
just squashes it to be between 0 and 1. All right, so this is the definition
of the single neuron. Now if we feed a vector of inputs through
all this different little logistic regression functions and
neurons, we get this output. And now the main difference between
just predicting directly a softmax and standard machine learning and deep learning is that we'll actually not
force this to give directly the output. But they will themselves be inputs to yet
another neuron. And it's a loss function on top of that
neuron such as cross entropy that will now govern what these
intermediate hidden neurons. Or in the hidden layer what they
will actually try to achieve. And the model can decide itself
what it should represent, how it should transform this input
inside these hidden units here in order to give us a lower
error at the final output. And it's really just this
concatenation of these hidden neurons, these little binary
logistic regression units that will allow us to build very
deep neural network architectures. Now again, for sanity's sake, we're
going to have to use matrix notation cuz all of this can be very simply described
in terms of matrix multiplication. So a1 here is where going to be the final activation of the first neuron,
a2 in second neuron and so on. So instead of writing out the inner
product here, or writing even this as an inner product plus the bias term
we're going to use matrix notation. And it's very important now to pay
attention to this intermediate variables that we'll define because
we'll see these over and over again as we use a chain
rule to take derivatives. So we'll define z here as W
times x plus the bias vector. So we'll basically have here as
many bias terms and this vector has the same dimensionality as the number
of neurons that we have in this layer. And W will have number of rows for
the number of neurons that we have times number of columns for
the input dimensionality of x. And then, whenever we write a of f(z), what that means here is that we'll
actually apply f element wise. So f(z) when z is a vector is just f(z1),
f(z2) and f(z3). And now you might ask, well, why do we
have all this added complexity here with this sigmoid function. Later on we can actually have other
kinds of so called non linearities. This f function and
it turns out that if we don't have the non-linearities in between and
we will just stack a couple of this linear layers together it wouldn't
add a very different function. In fact it would be continuing to
just be a single linear function. And intuitively as you
have more hidden neurons, you can fit more and
more complex functions. So this is like a decision boundary
in a three dimensional space, you can think of it also in
terms of simple regression. If you had just a single hidden neuron, you kinda see here almost
an inverted sigmoid. If you have three hidden neurons,
you could fit this kind of more complex functions and with ten neurons,
each neuron can start to essentially, over fit and try to be very good
at fitting exactly one point. All right, now let's revisit our
single window classifier and instead of slapping a softmax directly
onto the word vectors we're now going to have an intermediate hidden layer
between the word vectors and the output. And that's when we really start to
gain an accuracy and expressive power. So let's define a single
layer neural network. We have our input x that will be again, our window, the concatenation
of multiple word vectors. We'll define z and we'll define a as
element wise on the areas a and z. And now, we can use this
neural activation vector a as input to our final classification layer. The default that we've had so
far was the softmax, but let's not rederive the softmax. We've done it multiple times now,
you'll do it again in a problem set and introduce an even simpler one and walk through all the glory details
of that simple classifier. And that will be a simple,
unnormalized score. And this case here, this will
essentially be the right mechanism for various simple binary
classification problems, where you don't even care that much
about this probability z is 0.8. You really just cares like, is it one,
is it in this class, or is it not? And so we'll define the objective function
for this new output layer in a second. Well, let's first understand
the feed-forward process. And well feed-forward process is what you
will end up using a test time and for each element also in training
before you can take derivative. Always be feed-forward and
then backward to take the derivatives. So what we wanna do here is for example, take basically each window and
then score it. And say if the score is high we want to
train the model such that it would assign high scores to windows where the center
word is a named entity location. Such as Paris, or London, or Germany,
or Stanford, or something like that. Now we will often use and you'll see a in a lot of papers this kind
of graph, so it's good to get used to it. There are various other kinds,
and we'll try to introduce them slowly throughout the lecture but
this is the most common one. So we'll define bottom up,
what each of these layers will do and then we'll take the derivatives and
learn how to optimize it. Now x window here is the concatenation
of all our word vectors. So let's hear, and
I'll ask you a question in a second, let's try to figure out the dimensionality
here of all our parameters so that you're, I know you're with me. So let's say each of our word
vectors here is four dimensional and we have five of these word vectors in
each window that are concatenated. So x is a 20 dimensional vector. And again,
we'll define it as column vectors. And then lets say we have
in our first hidden layer, lets say we have eight units here. So you want an eight unit hidden layer
as our intermediate representation. And then our final scores just
again a simple single number. Now what's the dimensionality
of our W given what I just said? 20 dimensional input, eight hidden units. 20 rows and eight columns. We have one more transfer,
[LAUGH] that's right. So it's going to be eight rows and
20 columns, right? And you can always
whenever you're unsure and you have something like this then
this will have some n times d. And then multiply this and then this
will have, this will always be d, and so these two always
have to be the same, right? So all right, now what's the main intuition behind this
extra layer, especially for NLP? Well, that will allow
us to learn non-linear interactions between these
different input words. Whereas before, we could only say
well if in appears in this location, always increase the probability
that the next word is a location. Now we can learn things and patterns like,
if in is in the second position, increase the probability of this being the location
only if museum is also the first vector. So we can learn interactions
between these different inputs. And now we'll eventually make
our model more accurate. Great question. So do I have a second W there. So the second layer here the scores
are unnormalized, so it'll just be U and because we just have a single U, this will
just be a single column vector and we'll transpose that to get our inner product
to get a single number out for the score. Sorry, yeah, so the question was
do we have a second W vector. So yeah, that's in some
sense our second matrix, but because we only have one hidden neuron in
that layer, we only need a single vector. Wonderful. All right, so,
now let's define the max-margin loss. It's actually a super powerful loss
function often is even more robust than the cross entropy error in softmax,
and is quite powerful and useful. So let's define here two examples. Basically, you want to give
a high score to windows, where the center word is a location. And we wanna give low scores to corrupt or incorrect windows where the center
word is not a named entity location. So museum is technically a location,
but it's not a named entity location. And so the idea for this training objective of max-margin is
to essentially try to make the score of the true windows larger than the ones of
the corrupt windows smaller or lower. Until they're good enough. And we define good enough as being
different by the value of one. And this one here is a margin. You can often see it as
a hyperparameter too and set it to m and try different ones but
in many cases one works fine. This is continuous and
we'll be able to use SGD. So now what's the intuition behind the
softmax, sorry the max-margin loss here? If you have for
instance a very simple data set and you have here a couple
of training samples. And here you have the other class c,
what a standard softmax may give you is a decision
boundary that looks like this. It's like perfectly separates the two. It's a very simple training example. Most standard softmax
classifiers will be able to perfectly separate these two classes. And again, this is just for
illustration in two dimensions. These are much higher
dimensional problems and so on. But a lot of the intuition
carries through. So now here we have our decision
boundary and this is the softmax. Now, the problem is maybe that
was your training data set. But your test set, actually,
might include some other ones that are quite similar to those stuff you saw
at training, but a little different. And now this kind of decision
boundary is not very robust. In contrast to this, what the max margin loss will attempt to do is to
try to increase the margin between the closest points
of your training data set. So if you have a couple of points here and
you have different points here. We'll try to maximize the distance between the closest points here, and
essentially be more robust. So then if at test time you have some
things that are kinda similar, but not quite there, you're more likely
to also correctly classify them. So it's a really great lost or
objective function. Now in our case here when we say a sc for
one corrupt window. In many cases in practice we're
actually going to have a sum over multiple of these. And you can think of this similar to the
skip-gram model where we sample randomly a couple of corrupt examples. So you really only need for
this kind of training a bunch of true examples of this
is a location in this context. And then all the other windows
where you don't have that as your training data are essentially
part of your negative class. All right, any questions around
the max-margin objective function? We're gonna take a lot of
derivatives of it now. That's right, is the corrupt
window just a negative class? Yes, that's exactly right. So you can think of any other window that
doesn't have as its center location just as the other class. All right, now how do we optimize this? We're going to take very similar steps to
what we've done with cross entropy, but now we actually have this hidden layer and
we'll take our second to last step towards the full back-propagation algorithm
which we'll cover in the next lecture. So let's assume our cost
J here is larger than 0. So what does that mean? In the very beginning you will initialize
all your parameters here again. Either randomly or maybe you'll initialize
your word vectors to be reasonable. But they're not gonna be quite perfect at
learning in this context in the window what is location and what isn't. And so in the beginning all your scores
are likely going to be low cuz all our parameters, U and W and b have been
initialized to small, random numbers. And so I'm unlikely going to be great
at distinguishing the window with a correct location at center
versus one that is corrupt. And so basically,
we will be in this regime. After a while of training, eventually
you're gonna get better and better. And then intuitively
if your score here for instance of the good window is five and
one of the corrupt is just two, then you'll see 1- 5 + 2 is less than 0 so you just basically have 0
loss on those elements. And that's another great property of
this objective function which is over time you can start ignoring more and more
of your training set cuz it's good enough. It will assign 0 cost as in 0 error to these examples and so
you can start to focus on your objective function only on the things that the model
still has trouble to distinguish. All right, so let's in the very
beginning assume most of our examples will J will be larger than 0 for them. And so what we're gonna have to do now
is take derivatives with respect to all the parameters of our model. And so what are those? Those are U, W, b and our word vectors x. So we always start from the top and then
we go down because we'll start to reuse different elements and just the simple
combination of taking derivatives and reusing variables is going to
lead us to back propagation. So derivative of s with respect to U. Well, what was s? s was just u transpose times a and so we all know that derivative
of that is just a. So that was easy, first element,
first derivative super straight forward. Now it's important when we
take the next derivative to also be aware of all our definitions. How we define these functions that
we're taking derivatives off. So s is basically U transpose a,
a was f(z) and z was just Wx + b. All right,
it's very important to just keep track. That's like almost 80% of the work. Now, let's take
the derivative like I said, first partial of only one
element of W to gain intuitions. And then we can put it back together and
have a more complex matrix notation. So we'll observe for
Wij that it will actually only appear in the ith activation of our hidden layer. So for example, let's say we have a very
simple input with a three dimensional x. And we have two hidden units,
and this one final score U. Then we'll observe that if we take
the derivative with respect to W23. So the second row and
the third column of W, well that actually only is needed in a2. You can compute a1 without using W23. So what does that mean? That means if we take
the derivative of weight Wij, we really only need to look at
the ith element of the vector a. And hence, we don't need to look
at this whole inner product. So what's the next step? Well as we're taking derivatives with W,
we need to be again aware of where does W appear and all the other parameters
are essentially constant. So U here is not something
we're taking a derivative off. So what we can do is just take it out,
just as like a single number, right. We'll just get it outside,
put the derivative inside here. And now, we just need to very
carefully define our ai. So a subscript i, so
that's where Wij appears. Now, ai was this function,
and we defined it as f of zi. So why don't we just
write this carefully out, and now this is first application
of the chain rule with derivative of ai with respect to zi,
and then zi with respect to Wij. So this is single application
of the chain rule. And then end of it it looks kind of
overwhelming, but each step is very clear. And each step is simple, we're really
writing out all the glory details. So application of the chain rule,
now we're going to define ai. Well ai is just f of zi, and f was just an
element y function on a single number zi. So we can just rewrite ai with
its definition of f of zi, and we keep this one intact, all right? And now derivative of f,
we can just for now assume is f prime. Just a single number, take derivative. We'll just define this as f prime for now. It's also just a single number,
so no harm done. Now we're still in this part here, where we basically wanna take
the derivative of zi with respect to Wij. Well let's define what zi was,
zi was just here. The W of the ith row times x
plus the ith element of b. So let's just replace zi
with it's definition. Any questions so far? All right, good or not? So we have our f prime and
we have now the derivative with respect to Wij of just
this inner product here. And we can again,
very carefully write out well, the inner product is just this
row times this column vector. That's just the sum, and now when we
take the derivative with respect to Wij, all the other Ws are constants. They fall out, and so
basically it's only the xk, the only one that actually appears
in the sum with Wij is xj and so basically this derivative is just Xj. All right, so now we have this
whole expressions of just taking carefully chain rule multiplications
definitions of all our terms and so on. And now basically, what we're gonna want
to do is simplify this a little bit, cuz we might want to
reuse different parts. And so we can define, this first term here
actually happens to only use subindices i. And it doesn't use any other subindex. So we'll just define Uif prime of zi for all the different is as delta i. At first notational simplicity and
xj is our local input signal. And one thing that's very helpful for you to do is actually look at also the
derivative of the logistic function here. Which can be very conveniently computed
in terms of the original values. And remember f of z here, or f of zi of each element is
always just a single number. And we've already computed it
during forward propagation. So we wanna ideally use hidden activation
functions that are very fast to compute. And here, we don't need to compute
another exponent or anything. We're not gonna recompute f of zi cuz
we already did that in the forward propagation step. All right, now we have the partial derivative
here with respect to one element of W. But of course, we wanna have the whole
gradient for the whole matrix. So now the question is,
with the definitions of this delta i for all the different elements of
i of this matrix and xj for all the different elements of the input. What would be a good way of trying to
combine all of these different elements to get a single gradient for the whole
matrix W, if we have two vectors. That's right. So essentially, we can use delta
times x transpose, namely the outer product to get all the combinations
of all elements i and all elements j. And so this again might seem
like a little bit like magic. But if you just think again of
the definition of the outer product here. And you write it out in terms of all
the indices, you'll see that turns out to be exactly what we would want in
one very nice, very simple equation. So we can kind of think of this delta
term actually as the responsibility of the error signal that's now arriving from
our overall loss into this layer of W. And that will eventually
lead us to flow graphs. And that will eventually lead us to you
not having to actually go through all this misery of taking all these derivatives. And being able to abstract it
away with software packages. But this is really the nuts and
bolts of how this works, yeah? Yeah, the question is, this outer product
will get all the elements of i and j? And that's right. So when we have delta times x transposed. Then now we have basically here,
x is usually this vector. So now let's take the right notation. So we wanna have derivative
with respect to W. W was a, 2x3 dimension matrix for
example, 2x3. We should be very careful of our notation. 2x3. So now,
the derivative of j with respect to our w has to, in the end, also be a 2x3 matrix. And if we have delta times x transposed,
then that means we'll have to have a two-dimensional delta, which is
exactly the dimensions that are coming in. [INAUDIBLE] Signal that I
mentions that we have for the number of hidden units that we have. Times this one dimensional,
basically row vector times xt which is a 1 x 3 dimensional
vector that we transpose. And so, what does that mean? Well, that's basically multiplying now,
standard matrix multiplication. You should write that. So now the last term that we haven't
taken derivatives of off the [INAUDIBLE], is our bi and
it'll eventually be very similar. We're going to go through it. We can pull Ui out, we're going to
take f prime, assume that's the same. So now, this is our delta i. We'll observe something very similar. These are very similar steps for bi. But in the end, we're going to
just end up with this term and that's just going to be one. And so,
the derivative of our bi element here, is just delta i and we can again
use all the elements of delta, to have the entire gradient for
the update of b. Any questions? Excellent, so this is essentially,
almost back-propagation. We’ve so far only taken derivatives and
using the chain rule. And first thing, when I went through this, this is like a lot of the magic of deep
learning, is just becoming a lot clear. We’ve just taken derivatives, we have
an objective function and then we update based on our derivatives, all
the parameters of these large functions. Now the main remaining trick, is to re-use
derivatives that we've computed for the higher layers in computing
derivatives for the lower layers. It's very much an efficiency trick. You could not use it and it would
just be very, very inefficient to do. But this is the main insight of why we re-named taking
derivatives as back propagation. So what is the last derivatives
that we need to take? For this model, well again,
it's in terms of our word vectors. So let's go through all of those. Basically, we'll have to take the
derivative of the score with respect to every single element of our word vectors. Where again, we concatenated all
of them into a single window. And now, the problem here is that each word vector actually
appears in both of these terms. And both hidden units use all of
the elements of the input here. So we can't just look at a single element. We'll really have to sum over, both of the
activation units in the simple case here, where we just have two hidden units and
three dimensional inputs. Keeps it a little simpler,
and there's less notation. So then, we basically start with this. I have to take derivatives with
respect to both of the activations. And now, we're just going to go
through similar kinds of steps. We have s. We defined s as u transpose
times our activation. That was just Ui then ai
was just f of w and so on. Now, what we'll observe as we're going
through all these similar steps again is that, we'll actually see the same
term here reused from before. It's Ui x F prime of Zi. This is exactly the same. That we've seen here. F prime of Zi. And what that means is,
we can reuse that same delta. And that's really one of the big insights. Fairly trivial but very exciting,
cuz it makes it a lot faster. But, what's still different now, is that of course we have to take
the the derivative with respect. To each of these, to this inner product
here in Xj, where we basically dumped the bias term, cuz that's just a constant,
when we were taking this derivative. And so, this one here again,
Xj is just inner product, it's the jth element of this matrix
W that's the relevant one for this inner product,
let me take the derivative. So now we have this sum here, and
now comes again this tricky bit of trying to simplify this sum into something
simpler in terms of matrix products. And again, the reason we're getting
towards back propagation is that we're reusing here these previous error signals,
and elements of the derivative. Now, the simplest, the first thing we'll
observe here as we're doing this sum, is that sum is actually also a simple inner
product, where we now take the jth column. So this again, this dot notation
when the dot is after the first, and next we take the row,
here we take the column. So it's a column vector. But then of course we transpose it, so it's a simple inner product for
getting us a single number. Just the derivative of this element of
the word vectors and the word window. Yes. Great question. So once we have the derivatives for all these different variables, what's
the sequence in which we update them, and there's really no sequence we
update them all in parallel. We just take one step in all the elements
that we now had a variable in or have seen that parameter in. And the complexity there,
is in standard machine learning you'll see in many models just like
standard logistic regression, you see all your parameters like
your W in all the examples. And ours, it's a little more complex,
because most words you won't see in a specific window and so, you only update
the words that you see in that window. And if you assumed all the other ones,
you'd just have very, very large, quite sparse updates, and that's not
very RAM efficient, great question. So now we have this simple
multiplication here and the sum is just is just inner product. So far so simple, and we have our D
dimension vector which I mentioned, is two dimensions. We have the sum over two elements. So, so far so good. Now, really, we would like to get the full
gradient here with respect to all XJs for J equals one to three and
its simple case, or five D if we have a five
word large window. So now the question is, how do we
combine this single element here. Into a vector that eventually gives us all
the different gradients for all the xij. And j equals 1 to however long our window
is Is anybody follow along this closely? That's right. W transposed delta. Well done. So basically our final derivative and
final gradient here for. Our score s with respect to the entire
window, is just W transpose times delta. Super simple very fast to implement,
I can easily think about how to vectorize this again by concatenating multiple
deltas from multiple Windows and so on. And it can be very efficiently,
like implemented and derived. All right, now the error message is
delta that arrives at this hidden layer, has of course the same dimensionality as
its hidden layer because we're updating all the windows. And now from the previous slides we
also know that when we update a window, it really means we now cut up that final gradient here into the different chunks
for each specific word in that window, and that's how we update our
first large neural network. So let's put all of this together again. So, our full objective function
here was this max and I started out with saying let's assume it's larger
than zero so you have this identity here. So this is simple indicator function if. The indication is true,
then it's one and if not, it's zero. And then you can essentially
ignore that pair of correct and corrupt windows x and
xc, respectively. So our final gradient when we have these kinds of max margin functions
is essentially implemented this way. And we can very efficiently
multiply all of this stuff. All right. So this is just that, this is not right. This is our [INAUDIBLE] But you still
have to take the derivative here, but basically this indicator function is
the main novelty that we haven't seen yet. All right. Yeah. >> [INAUDIBLE] >> Yeah, it's a long question. The gist of the question is how to we make
sure we don't get stuck in local optima. And you've kinda answered it a little
bit already which is indeed because of the stochasticity you keep making updates
anyway it's very hard to get stuck. In fact, the smaller your,
the more stochastic you are, as in the fewer windows you look at
each time you want to make an update, the less likely you're getting stuck. If you had tried to get through all the
windows and then make one gigantic update, so it's actually very inefficient and
much more likely to get you stuck. And then the other observation
that it's just slowly coming through some of the theory that
we couldn't get into this class. Is that it turns out a lot of the local
optima are actually pretty good. And in many cases, not even that far away from what you
might think the global optima would be. Also, you'll observe a lot of times,
and we'll go through this in some of the project advice in many cases,
you can actually perfectly fit. We have a powerful enough
neural network model. You can often perfectly fit your input and
your training dataset. And you'll actually, eventually spend
most of your time thinking about how to regularize your models better and often,
at least, even more stochasticity. We'll get through some of those. But yeah, good question. Yeah, in the end, we just have all
these updates and it's all very simple. All right, so let's summarize. This was a pretty epic lecture. Well done for sticking through it. Congrats again, this was our super
useful basic components lecture. And now this window model is actually
really the first one that you might observe and practice and
you might actually want to implement. In a real life setting. So to recap, we've learned word vector training,
we learned how to combine Windows. We have the softmax and
the cross entropy error and we went through some of the details there. Have the scores and the max margin loss,
and we have the neural network, and it's really these two steps here that you have
to combine differently for problem set. Number one and
especially number two in that. So, we just have one more
math heavy lecture and after that we can have fun and
combine all these things together. Thanks.