[MUSIC] Stanford University. >> All right, hello everybody. Welcome to Lecture seven or
maybe it's eight. Definitely today is the beginning of where we talk about models that
really matter in practice. We'll talk today about the simplest
recurrent neural network model one can think of. But in general, this model family is what most people
now use in real production settings. So it's really exciting. We only have a little bit
of math in between and a lot of it is quite applied and
should be quite fun. Just one organizational
item before we get started. I'll have an extra office
hour today right after class. I'll be again on Queuestatus 68 or so. Last week we had to end at 8:30. And there's still a lot of
people who had a question, so I'll be here after class for
probably another two hours or so. Try to get through everybody's questions. Are there any questions around projects? >> [LAUGH]
>> And organizational stuff? All right, then let's take a look
at the overview for today. So to really appreciate the power of
recurrent neural networks it makes sense to get a little bit of background
on traditional language models. Which will have huge RAM requirements and
won't be quite feasible in their best kinds of settings where
they obtain the highest accuracies. And then we'll motivate recurrent
neural networks with language modeling. It's a very important
kind of fundamental task in NLP that tries to
predict the next word. Something that sounds quite simple but
is really powerful. And then we'll dive a little bit into
the problems that you can actually quite easily understand once
you have figured out how to take gradients and you actually
understand what backpropagation does. And then we can go and
see how to extend these models and apply them to real sequence tasks
that people really run in practice. All right, so let's dive right in. Language models. So basically, we want to just compute the probability
of an entire sequence of words. And you might say,
well why is that useful? Why should we be able to compute
how likely a sequence is? And actually comes up for
a lot of different kinds of problems. So one, for instance,
in machine translation, you might have a bunch of potential
translations that a system gives you. And then you might wanna understand
which order of words is the best. So "the cat is small" should get a higher
probability than "small the is cat". But based on another language
that you translate from, it might not be as obvious. And the other language might have
a reversed word order and whatnot. Another one is when you do speech
recognition, for instance. It also comes up in the machine
translation a little bit, where you might have, well this particular example is clearly
more a machine translation example. But comes up also in speech
recognition where you might wanna understand which word might be the better
choice given the rest of the sequence. So "walking home after school" sounds
a lot more natural than "walking house after school". But home and
house have the same translation or same word in German which is haus,
H A U S. And you want to know which one is
the better one for that translation. So comes up in a lot of
different kinds of areas. Now basically it's hard to compute
the perfect probabilities for all potential sequences 'cause
there are a lot of them. And so what we usually end up doing is
we basically condition on just a window, we try to predict the next word based
on the just the previous n words before the one that
we're trying to predict. So this is, of course,
an incorrect assumption. The next word that I will utter will
depend on many words in the past. But it's something that had to be done to use traditional count based
machine learning models. So basically we'll approximate this
overall sequence probability here with just a simpler version. In the perfect sense this would basically
be the product here of each word, given all preceding words
from the first one all the way to the one just
before the i_th one. But in practice, this probability with
traditional machine learning models we couldn't really compute so we actually approximate that with some
number of n words just before each word. So this is a simple Markov assumption
just assuming the next action or next word that is uttered just
depends on n previous words. And now if we wanted to use traditional
methods that are just basically based on the counts of words and not
using our fancy word vectors and so on. Then the way we would compute and estimate
these probabilities is essentially just by counting how often does, if you want to
get the probability for the second word, given the first word. We would just basically count up how often
do these two words co-occur in this order, divided by how often the first
word appears in the whole corpus. Let's say we have a very large corpus and
we just collect all these counts. And now if we wanted to condition not just
on the first and the previous word but on the two previous words, then we'd
have to compute all these counts. And now you can kind of sense that well, if we want to ideally condition on as
many n-grams as possible before but we have a large vocabulary of say 100,000
words, then we'll have a lot of counts. Essentially 100,000 cubed, many numbers we would have to store
to estimate all these probabilities. Does that make sense? Are there any questions for
these traditional methods? All right, now, the problem with
that is that the performance usually improves as we have more and
more of these counts. But, also,
you now increase your RAM requirements. And so,
one of the best models of this traditional type actually required 140 gigs of RAM for
just computing all these counts when they wanted to compute them for
126 billion token corpus. So it's very,
very inefficient in terms of RAM. And you would never be able
to put a model that basically stores all these different n-gram counts. You could never store it in a phone or
any small machine. And now, of course, once computer
scientists struggle with a problem like that, they'll find ways to deal with it,
and so, there are a lot of different
ways you can back off. You say, well, if I don't find the 4-gram,
or I didn't store it, because it was not frequent enough,
then maybe I'll try the 3-gram. And if I can't find that or I don't have
many counts for that, then I can back off and estimate my probabilities with fewer
and fewer words in the context size. But in general you want
to have at least tri or 4-grams that you store and the RAM
requirements for those are very large. So that is actually something
that you'll observe in a lot of comparisons between deep
learning models and traditional NLP models that are based on
just counting words for specific classes. The more powerful your models are, sometimes the RAM requirements can
get very large very quickly, and there are a lot of different ways
people tried to combat these issues. Now our way will be to use
recurrent neural networks. Where basically, they're similar to
the normal neural networks that we've seen already, but they will actually tie
the weights between different time steps. And as you go over it, you keep using, re-using essentially the same
linear plus non-linearity layer. And that will at least in theory,
allow us to actually condition what we're trying to predict
on all the previous words. And now here the RAM requirements will
only scale with the number of words not with the length of the sequence
that we might want to condition on. So now how's this really defined? Again, they're you'll see different
kinds of visualizations and I'm introducing you to a couple. I like sort of this unfolded one where we
have here a abstract hidden time step t and we basically, it's conditioned on
H_t-1, and then here you compute H_t+1. But in general,
the equations here are quite intuitive. We assume we have a list of word vectors. For now,
let's assume the word vectors are fixed. Later on we can actually loosen
that assumption and get rid of it. And now, at each time step
to compute the hidden state. At that time step will essentially
just have these two matrices, these two linear layers,
matrix vector products and we sum them up. And that's essentially similar to
saying we concatenate h_t-1 and the word vector at time step t, and
we also concatenate these two matrices. And then we apply
an element-wise non-linearity. So this is essentially just a standard
single layer neural network. And then on top of that we can
use this as a feature vector, or as our input to our standard
softmax classification layer. To get an output probability for
instance over all the words. So now the way we would write
this out in this formulation is basically the probability that
the next word is of this specific, at this specific index j conditioned
on all the previous words is essentially the j_th element
of this large output vector. Yes? What is s? So here you can have different
ways to define your matrices. Some people just use u, v,
and w or something like that. But here we basically use the superscript
just identify which matrix we have. And these are all different matrices, so
W_(hh), the reason we call it hh is it's the W that computes the hidden
layer h given the input h t- 1. And then you have an h_x here,
which essentially maps x into the same vector space that we have. Our hidden states in and
then s is just our softmax w. The weights of the softmax classifier. And so let's look at the dimensions here. It's again very important. You have another question? So why do we concatenate and
not add is the question. So they're the same. So when you write W_(h) using same notation plus W_(hx) times x then this is actually the same thing. And so this will now basically be
a vector, and we are feed in linearity but it doesn't really change things, so
let's just look at this inside part here. Now if we concatenated h and
x together we're now have, and let's say, x here has a certain
dimensionality which we'll call d. So x is in R_d and
our h will define to be in for having the dimensionality R_(Dh). Now, what would the dimensionality be
if we concatenated these two matrices? So we have here the output has to be,
again a Dh matrix. And now this vector here is a, what dimensionality does this factor
have when we concatenate the two? That's right. So this is a d plus Dh times one and here we have Dh times our matrix. It has to be the same dimensionality, so d plus Dh and
that's why we could essentially concatenate here W_h in this way,
and W_hx here. And now we could basically multiply these. And if you, again if this is confusing,
you can write out all the indices. And you realize that these
two are exactly the same. Does that make sense? Right, so as you sum up all the values
here, It'll essentially just get summed up also, it doesn't matter
if you do it in one go or not. Just a single layer and that worked
where you compact in two inputs but it's in many cases for recurrent
neutral networks is written this way. All right. So now, here are two other ways
you'll often see these visualized. This is kind of a not unrolled version of
a hidden, of a recurrent neural network. And sometimes you'll also see
sort of this self loop here. I actually find these kinds of
unrolled versions the most intuitive. All right. Now when you start and you. Yup? Good question. So what is x[t]? It's essentially the word vector for the word that appears
at the t_th time step. As opposed to x_t and intuitively here
x_t you could define it in any way. It's really just like as you go through
the lectures you'll actually observe different versions but intuitively
here x_t is just a vector at xt but here xt is already an input, and
what it means in practice is you actually have to now go at that t time
step, find the word identity and pull that word vector from your glove or word
to vec vectors, and get that in there. So x_t we used in previous
lectures as the t_th element for instance in the whole embedding matrix,
all our word vectors. So this is just to make it very explicit
that we look up the identity of the word at the tth time step and then get
the word vector for that identity, like the vector in all our word vectors. Yep. So I'm showing here a single layer
neural network at each time step, and then the question is whether that
is standard or just for simplicity? It is actually the simplest and
still somewhat useful. Variant of a recurrent neural network,
though we'll see a lot of extensions even in this class, and then in the lecture
next week we'll go to even better versions of these kinds of
recurrent neural networks. But this is actually a somewhat
practical neural network, though we can improve it in many ways. Now, you might be curious when
you just start your sequence, and this is age 0 here and
there isn't any previous words. What you would do and the simplest thing
is you just initialize the vector for the first hidden layer at the first or the
0 time step as just a vector of all 0s. Right and this is the X[t] definition
you had just describe through the column vector of L which is our embedding matrix
at index [t] which the time step t. All right so it's very important to keep
track properly of all our dimensionality. Here, W(S) to Softmax actually goes
over the size of our vocabulary, V times the hidden state. So the output here is the same
as the vector of the length of the number of words that we
might wanna to be able to predict. All right, any questions for the feed forward definition of
a recurrent neural network? All right, so how do we train this? Well fortunately, we can use all the same machinery we've
already introduced and carefully derived. So basically here we have probability
distribution over the vocabulary and we're going to use the same exact cross
entropy loss function that we had before, but now the classes are essentially
just the next word. So this actually sometimes
creates a little confusion on the nomenclature that we have
'cause now technically this is unsupervised in the sense that
you just give it raw text. But this is the same kind of objective
function we use when we have supervised training where we have a specific
class that we're trying to predict. So the class at each time step is
just a word index of the next word. And you're already familiar with that, here we're just summing over the entire
vocabulary for each of the elements of Y. And now, in theory, you could just. To evaluate how well you can predict
the next word over many different words in longer sequences, you could in theory just
take this negative of the average log probability is over this entire dataset. But for maybe historical reasons, and also
reasons like information theory and so on that we don't need to get into, what's
more common is actually to use perplexity. So that's just 2 to
the power of this value and, hence, we want to basically
be less perplexed. So the lower our perplexity is,
the less the model is perplexed or confused about what the next word is. And we essentially, ideally we'll assign
a higher probability to the word that actually appears in the longer
sequence at each time step. Yes? Any reason why 2 to the J? Yes, but it's sort of a rat hole
we can go down, maybe after class. Information theory bits and
so on, it's not necessary. All right. >> [LAUGH]
>> All right, so now you would think, well this is pretty
simple, we have a single set of W matrices, and training should
be relatively straightforward. Sadly, and this is really the main
drawback of this and a reason of why we introduce all these other more powerful
recurrent neural network models, training these kinds of models
is actually incredibly hard. And we can now analyze, using the tools of back propagation and
chain rule and all of that. Now we can analyze and
understand why that is. So basically we're multiplying here,
the same matrix at each time step, right? So you can kind of think of
this matrix multiplication as amplifying certain patterns over and
over again at every single time step. And so, in a perfect world, we would want the inputs from many time
steps ago to actually be able to still modify what we're trying to predict
at a later, much later, time step. And so, one thing I would like
to encourage you to do is to try to take the derivatives
with respect to these Ws, if you just had a two or
three word sequence. It's a great exercise,
great preparation for the midterm. And it'll give you some
interesting insights. Now, as we multiply the same matrix
at each time step during foreprop, we have to do the same thing during
back propagation We have, remember, our deltas, our air signals and sort of
the global elements of the gradients. They will essentially at each time step
flow through this network backwards. So when we take our cross-entropy
loss here, we take derivatives, we back propagate we compute our deltas. Now the first time step here that just
happened close to that output would make a very good update and
will probably also make a good update to the word vector here if
we wanted to update those. We'll talk about that later. But then as you go backwards in
time what actually will happen is your signal might get either too weak,
or too strong. And that is essentially called
the vanishing gradient problem. As you go backwards through time, and you
try to send the air signal at time step t, many time steps into the past, you'll
have the vanishing gradient problem. So, what does that mean and
how does it happen? Let's define here a simpler, but
similar recurrent neural network that will allow us to give you an intuition and
simplify the math downstream. So here we essentially just say, all
right, instead of our original definition where we had some kind of f
some kind of non-linearity, here we use the sigma function,
you could use other one. First introduce the rectified linear units
and so on instead of applying it here, we'll apply it in the definition
just right in here. So it's the same thing. And then let's assume, for now,
we don't have the softmax. We just have here, a standard,
a bunch of un-normalized scores. Which really doesn't matter for
the math, but it'll simplify the math. Now if you want to compute the total
error with respect to an entire sequence, with respect to your W then
you basically have to sum up all the errors at all the time steps. At each time step, we have an error of how incorrect we
were about predicting the next word. And that's basically the sum here and now we're going to look at the element
at the t timestamp of that sum. So let's just look at a single time step,
a single error at a single time step. And now even computing that will
require us to have a very large chain rule application,
because essentially this error at time step t will depend on all
the previous time steps too. So you have here the delta or
dE_t over dy_t, so the t, the hidden state. Sorry, the soft max output or
here these unnormalized square output Yt. But then you have to multiply that
with the partial derivative of yt with respect to the hidden state. So that's just That's just this guy
right here, or this guy for ht. But now, that one depends on,
of course, the previous one, right? This one here, but it also depends
on that one, and that one, and the one before that, and so on. And so that's why you have to sum over
all the time step from the first one, all the way to the current one, where
you're trying to predict the next word. And now, each of these was
also computed with a W, so you have to multiply partial of that,
as well. Now, let's dig into
this a little bit more. And you don't have to worry too
much if this is a little fast. You won't have to really
go through all of this, but it's very similar to a lot of
the math that we've done before. So you can kind of feel comfortable for
the most part going over it at this speed. So now, remember here,
our definition of h_t. We basically have all these partials
of all the h_t's with respect to the previous time steps,
the h's of the previous time steps. Now, to compute each of these,
we'll have to use the chain rule again. And now, what this means is essentially a partial derivative of a vector
with respect to another vector. Something that if we're clever with
our backprop definitions before, we never actually have to do in practice,
right? 'cause this is a very large matrix, and we're combining the computation with the
flow graph, and our delta messages before such that we don't actually have to
compute explicitly, these Jacobians. But for the analysis of the math here, we'll basically look at
all the derivatives. So just because we haven't defined it,
what's the partial for each of these is essentially called the Jacobian,
where you have all the partial derivatives with respect to each element of the top
here ht with respect to the bottom. And so in general, if you have
a vector valued function output and a vector valued input, and you take
the partials here, you get this large matrix of all the partial derivatives
with respect to all outputs. Any questions? All right, so basically here,
a lot of chain rule. And now, we got this beast
which is essentially a matrix. And we multiply, for each partial here, we actually have to multiply all of these,
right? So this is a large product
of a lot of these Jacobians. Now, we can try to simplify this,
and just say, all right. Let's say, there is an upper bound. And we also,
the derivative of h with respect to h_j. Actually, with this simple definition of
each h actually can be computed this way. And now,
we can essentially upper bound the norm of this matrix with
the multiplication of basically these equation right here,
where we have W_t. And if you remember our
backprop equations, you'll see some common terms here, but we'll actually write this out as
not just an element wise product. But we can write the same thing as
a diagonal where we have instead of the element wise. Elements we basically just put them into
the diagonal of a larger matrix, and with zero path,
everything that is off diagonal. Now, we multiply these two norms here. And now, we just define beta, W and
beta h, as essentially the upper bounds. Some number, single scalar for each as like how large they
could maximally be, right? We have W, we could compute easily
any kind of norm for our W, right? It's just a matrix, computed matrix norm,
we get a single number out. And now, basically, when we write
this all, we put all this together, then we see that an upper bound for
this Jacobians is essentially for each one of these
elements as this product. And if we define each of the elements
here, in terms of their upper bounds beta, then we basically have this product
beta here taken to the t- k power. And so as the sequence gets longer and
longer, and t gets larger and larger, it really depends on the value
of beta to have this either blow up or get very, very small, right? If now the norms of this matrix,
for instance, that norm, and then you have
control over that norm, right? You initialize your wait matrix W with some small random values initially
before you start training. If you initialize this to a matrix that
has a norm that is larger than one, then at each back propagation step and
the longer the time sequence goes. You basically will get a gradient
that is going to explode, cuz you take some value that's larger
than one to a large power here. Say, you have 100 or something,
and your norm is just two, then you have two to the 100th as an upper
bound for that gradient and vice-versa. If you initialize your matrix W in
the beginning to a bunch of small random values such that the norm of
your W is actually smaller than one, then the final gradient that will be
sent from ht to hk could become a very, very small number, right,
half to the power of 100th. Basically, none of the errors will arrive. None of the error signal, we got small and
smaller as you go further and further backwards in time. Yeah. So if the gradient here is exploding, does
that mean a word that is further away has a bigger impact on a word that's closer? And the answer is when
it's exploding like that, you'll get to not a number in no time. And that doesn't even become a practical
issue because the numbers will literally become not a number,
cuz it's too large a value to compute. And we'll have to think
of ways to come back. It turns out the exploding gradient
problem has some really great hacks that make them easier to deal with than
the vanishing gradient problem. And we'll get to those in a second. All right, so now,
you might say this could be a problem. Now, why is the vanishing gradient
problem, an actual common practice? And again, it basically prevents
us from allowing a word that appears very much in the past
to have any influence on what we're trying to break in
terms of the next word. And so here a couple of examples from just
language modeling where that is a real problem. So let's say, for instance,
you have Jane walked into the room. John walked in too. It was late in the day. Jane said hi to. Now, you can put an almost
probability mass of one, that the next word in this blank is John,
right? But if now,
each of these words have the word vector, you type it in to the hidden state,
you compute this. And now, you want the model to pick up
the pattern that if somebody met somebody else, and your all this complex stuff. And then they said hi too, and
the next thing is the name. You wanna put a very high probability
on it, but you can't get your model to actually send that error signal way
back over here, to now modify the hidden state in a way that would allow you
to give John a high probability. And really, this is a large problem in
any kind of time sequence that you have. And many people might
intuitively say well, language is mostly a Sequence problem,
right? You have words that appear
from left to right or in some temporal order as we speak. And so this is a huge problem. And now we'll have a little bit
of code that we can look into. But before that we'll have
the awesome Shayne give us a little bit of an intercession,
intermission. >> Hi, so let's take a short break
from recurrent neural networks to talk about transition-based
dependency parsing, which is exactly what you guys saw
this time last week in lecture. So just as a recap, a transition-based
dependency parser is a method of taking a sentence and
turning it into dependence parse tree. And you do this by looking at
the state of the sentence and then predicting a transition. And you do this over and over again in a greedy fashion until
you have a full transition sequence which itself encodes, the dependency
parse tree for that sentence. So I wanna show you how to get from
the model that you'll be implementing in your assignment two question two, which you're hopefully working
on right now, to SyntaxNet. So what is SyntaxNet? SyntaxNet is a model that Google came out with and they claim
it's the world's most accurate parser. And it's new,
fast performant TensorFlow framework for syntactic parsing is available for
over 40 languages. The one in English is called
the Parse McParseface. >> [LAUGH]
>> So my slide seemed to have been jumbled a little bit here, but
hopefully you can read through it. So basically the baseline we're
gonna begin with is the Chen and Manning model which came out in 2014. And Chen and Manning are respectively
your head TA and instructor. And the models that produce SyntaxNet
in just two stages of improvements, those directly modified Chen and Manning's model, which is exactly what
you guys will be doing in assignment two. And so we're going to focus today
on the main bulk of these changes, modifications which were
introduced in 2015 by Weiss et al. So without further ado, I'm gonna look
at their three main contributions. So the first one is they leverage
unlabeled data using something called Tri-Training. The second is that they tuned
their neural network and made some slight modifications. And the last and probably most important
is that they added a final layer on top of the model involving a structured
perceptron with beam search. So each of these seeks to solve a problem. So the first one is tri-training. So as you know, in most supervised models, they perform better the more
data that they have. And this is especially the case for
dependency parsing, where as you can imagine there are an
infinite number of possible sentences with a ton of complexity and
you're never gonna see all of them, and you're gonna see even some
of them very, very rarely. So the more data you have, the better. So what they did is they took
a ton of unlabeled data and two highly performing dependency parsers
that were very different from each other. And when they agreed, independently
agreed on a dependency parse tree for a given sentence, then that would
be added to the labeled data set. And so now you have ten
million new tokens of data that you can use in addition
to what you already have. And this by itself improved
a highly performing network's performance by 1% using
the unlabeled attachment score. So the problem here was not having
enough data for the task and they improved it using this. The second augmentation they made
was by taking the existing model, which is the one you
guys are implementing, which has an input layer
consisting of the word vectors. The vectors for the part of speech tags
and the arc labels with one hidden layer and one soft max layer predicting which
transition and they changed it to this. Now this is actually pretty much the same
thing, except for three small changes. The first is that they added, there are two hidden layers
instead of one hidden layer. The second is that they used
a RELU nonlinearity function instead of the cube nonlinearity function. And the third and most important is
that they added a perceptron layer on top of the soft max layer. And notice that the arrows,
that it takes in as input the outputs from all
the previous layers in the network. So this perceptron layer wants
to solve one particular problem, and this problem is that greedy algorithms
aren't able to really look ahead. They make short term decisions and as a result they can't really
recover from one incorrect decision. So what they said is, let's allow
the network then to look ahead and so we're going to have a tree
which we can search over and this tree is the tree of all the possible
partial transition sequences. So each edge is a possible transition
form the state that you're at. As you can imagine, even with three transitions your tree
is gonna blossom very, very quickly and you can't look that far ahead and
explore all of the possible branches. So what you have to do
is prune some branches. And for that they use beam search. Now beam search is only
gonna keep track of the top K partial transition
sequences up to a depth of M. Now how do you decide which K? You're going to use a score computed
using the perceptron weights. You guys probably have a decent idea
at this point of how perceptron works. The exact function they used
is shown here, and I'm gonna leave up the annotations so you can take
a look at it later if you're interested. But basically those are the three
things that they did solve, the problems with the previous
Chen & Manning model. So in summary, Chen & Manning had
an unlabeled attachment score of 92%, already phenomenal performance. And with those three changes,
they boosted it to 94%, and then there's only 0.6%
left to get you to SyntaxNet, which is Google's 2016
state of the art model. And if you're curious what the did to get
that 0.6%, take a look at Andrew All's paper Which uses global normalization
instead of local normalization. So the main takeaway, and
it's pretty straight forward but I can't stress it enough, is when you're
trying to improve upon an existing model, you need to identify the specific
flaws that are in this model. In this case the greedy algorithm and
solved those problems specifically. In this case they did that
using semi-supervised method using unlabeled data. They tune the model better and they use the structured
perception with beam search. Thank you very much. >> [APPLAUSE]
>> Kind of awesome. You can now look at these
kinds of pictures and you totally know what's going on. And in like state of the art stuff
that the largest companies in the world publishes. Exciting times. All right, so we'll gonna through a little bit of
like a practical Python notebook sort of implementation that shows you a simple
version of the vanishing gradient problem. Where we don't even have a full recurrent
real network we just have a simple two layer neural network and even in
those kinds of networks you will see that the error that you start at
the top and the norm of the gradients as you go down through your network,
the norm is already getting smaller. And if you remember these were the two
equations where I said if you get to the end of those two equations you know
all the things that you need to know, and you'll actually see these three
equations in the code as well. So let's jump into this. I don't see it. Let me get out of the presentation All right, better, all right. Now, zoom in. So here, we're going to define
a super simple problem. This is a code that we started,
and 231N (with Andrej), and we just modified it to
make it even simpler. So let's say our data set,
to keep it also very simple, is just this kind of
classification data set. Where we have basically three classes,
the blue, yellow, and red. And they're basically in
the spiral clusterform. We're going to define our
simple nonlinearities. You can kind of see it as a solution
almost to parts of the problem set, which is why we're only showing it now. And we'll put this on the website too,
so no worries. You can visit later. But basically, you could define here f,
our different nonlinearities, element-wise, and the gradients for them. So this is f and
f prime if f is a sigmoid function. We'll also look at the relu, the other
nonlinearity that's very popular. And here, we just have the maximum between
0 and x, and very simple function. Now, this is a relatively
straight forward definition and implementation of this simple
three layer neural network. Has this input, here our nonlinearity,
our data x, just these points in two dimensional space, the class,
it's one of those three classes. We'll have this model here,
we have our step size for SDG, and our regularization value. Now, these are all our parameters,
w1, w2 and w3 for all the outputs, and
variables of the hidden states. Two sets is bigger, all right. >> [LAUGH]
>> All right, now, if our nonlinearity is the relu, then we have here relu,
and we just input x, multiply it. And in this case,
your x can be the entirety of the dataset, cuz the dataset's so small, each
mini-batch, we can essentially do a batch. Again, if you have realistic datasets,
you wouldn't wanna do full batch training, but we can get away with it here. It's a very tiny dataset. We multiply w1 times x
plus our bias terms, and then we have our element-wise
rectified linear units or relu. Then we've computed in layer two,
same idea. But now, it's input instead of
x is the previous hidden layer. And then we compute our scores this way. And then here, we'll normalize
our scores with the softmax. Just exponentiate our scores,
some of them. So very similar to the equations
that we walk through. And now,
it's just basically an if statement. Either we have used relu
as our activations, or we use a sigmoid, but
the math inside is the same. All right, now,
we're going to compute our loss. Our good friend, the simple average cross
entropy loss plus the regularization. So here,
we have negative log of the probabilities, we summed them up overall the elements. And then here, we have our regularization
as the L2, standard L2 regularization. And we just basically sum up the squares
of all the elements in all our parameters, and I guess it does cut off a little bit. Let me zoom in. All three have the same of
amount of regularization, and we add that to our final loss. And now, every 1,000 iterations,
we'll just print our loss and see what's happening. And this is something you
always want to do too. You always wanna visualize,
see what's going on. And hopefully,
a lot of this now looks very familiar. Maybe if implemented it not quite as
efficiently, as efficiently in problem set one, but maybe you have, and
then it's very, very straightforward. Now, that was the forward propagation,
we can compute our error. Now, we're going to go backwards, and we're computing our delta
messages first from the scores. Then we have here, back propagation. And now,
we have the hidden layer activations, transposed times delta
messages to compute w. Again, remember, we have always for
each w here, we have this outer product. And that's the outer
product we see right here. And now, the softmax was the same
regardless of whether we used a value or a sigmoid. Let's walk through the sigmoid here. We now, basically, have our delta scores,
and have here the product. So this is exactly computing delta for
the next layer. And that's exactly this equation here,
and just Python code. And then again,
we'll have our updates dw, which is, again, this outer product right there. So it's a very nice
sort of equations code, almost a nice one to one
mapping between the two. All right, now, we're going to go through the network
from the top down to the first layer. Again, here, our outer product. And now, we add the derivatives for
our regularization. In this case, it's very simple, just matrices themselves
times the regularization. And we combine all our gradients
in this data structure. And then we update all our parameters
with our step_size and SGD. All right, then we can evaluate how
well we do on the training set, so that we can basically print out
the training accuracy as we train us. All right, now, we're going to
initialize all the dimensionality. So we have there just our two
dimensional inputs, three classes. We compute our hidden sizes
of the hidden vectors. Let's say, they're 50, it's pretty large. And now, we can run this. All right, we'll train it with both
sigmoids and rectify linear units. And now,
once we wanna analyze what's going on, we can essentially now plot some of
the magnitudes of the gradients. So those are essentially the updates as we
do back propagation through the snap work. And what we'll see here is
the some of the gradients for the first and the second layer when
we use sigmoid non-linearities. And basically here, the main takeaway
messages that blue is the first layer, and green is the second layer. So the second layer is
closer to the softmax, closer to what we're trying to predict. And hence, it's gradient is
usually had larger in magnitude than the one that arrives
at the first layer. And now, imagine you do this 100 times. And you have intuitively your vanishing
gradient problem in recurrent neural networks. They'll essentially be zero. They're already almost half in size over the iterations when
you just had two layers. And the problem is a little less strong
when you use rectified linear units. But even there, you're going to have
some decrease as you continue to train. All right,
any questions around this code snippet and vanishing creating problems? No, sure. [LAUGH] That's a good question. The question is why
are the gradings flatlining. And it's essentially
because the dataset is so simple that you actually just
perfectly fitted your training data. And then there's not much else to do
you're basically in a local optimum and then not much else is happening. So yeah, so these are the outputs where
if you visualize the decision boundaries, here at the relue and the relue you
see a little bit more sort of edges, because you have sort of linear
parts of your decision boundary and the sigmoid is a little smoother,
little rounder. All right, so now you can implement a very
quick versions to get an intuition for the vanishing gradient problem. Now the exploding gradient problem is,
in theory, just as bad. But in practice,
it turns out we can actually have a hack, that was first introduced by
Thomas Mikolov, and it's very unmathematical in some ways 'cause say,
all you have is a large gradient of 100. Let's just cap it to five. That's it,
you just define the threshold and you say whenever the value is larger
than a certain value, just cut it. Totally not the right
mathematical direction anymore. But turns out to work very
well in practice, yep. So vanishing creating problems,
how would you cap it? It's like it gets smaller and
smaller, and you just multiply it? But then it's like, it might overshoot. It might go in the completely
wrong direction. And you don't want to have the hundredth
word unless it really matters. You can't just make all
the hundred words or thousand words of the past
all matter the same amount. Right?
Intuitively. That doesn't make that much sense either. So this gradient clipping solution
is actually really powerful. And then a couple years after it
was introduced, Yoshua Bengio and one of his students Actually gained
a little bit of intuition and it's something I encourage
you always to do too. Not just in the equations, where you
can write out recurrent neural network, where everything's one dimensional,
and the math comes out easy and you gain intuition about it. But you can also, and this is what
they did here, implement a very simple recurrent neural network which
just had a single hidden unit. Not very useful for anything in practice
but now, with the single unit W. And you know, at still the bias term, they can actually visualize exactly
what the air surface looks like. So and oftentimes we call the air
surface or the energy landscape or so that the landscape of
our objective function. This error surface and basically. You can see here the size of
the z axis here is the error that you have when you trained
us on a very simple problem. I forgot what the problem here was but it's something very simple
like keep around this unit and remember the value and then just
return that value 50 times later. Something simple like that. And what they essentially observe
is that in this air surface or air landscape you have
these high curvature walls. And so as you do an update each little line here you can interpret as
what happens at an sg update step. You update your parameters. And you say, in order to minimize
my objective function right now, I'm going to change the value
of my one hidden unit and my bias term just like by this amount
to go over here, go over here. And all of a sudden you hit
these large curvature walls. And then your gradient basically blows up,
and it moves you somewhere way different. And so intuitively what happens here is, if you rescale to the thick size with
the special method, then essentially you're not going to jump to some crazy,
faraway place, but you're just going to stay in this general area that seemed
useful before you hit that curvature wall. Yeah? So the question is, intuitively,
why wouldn't such a trick work for the vanishing grading problem but it does
work for the exploding grading problem. Why does the reason for the vanishing does not apply to
the exploding grading problem. So intuitively,
this is exactly the issue here. So the exploding,
as you move way too far away, you basically jump out of the area
where you, in this case here for instance, we're getting closer and
closer to a local optimum, but the local optimum was very
close to high curvature wall. And without the gradient problem,
without the clipping trick, you would go way far away. Right, now, on the vanishing grading
problem, it get's smaller and smaller. So in general clipping doesn't make sense,
but let's say, so that's the obvious answer. You can't, something gets smaller and
smaller, it doesn't help to have a maximum and then make it, you know cut it to that
maximum 'cause that's not the problem. It goes in the opposite direction. And so. That's kind of most
obvious intuitive answers. Now, you could say. Why couldn't you, if it gets below
a certain threshold, blow it up? But then that would mean that. Let's say you had. You wanted to predict the word. And now you're 50 time steps away. And really,
the 51st doesn't actually impact the word you're trying to
predict at time step T, right? So you're 50 times to 54 and
it doesn't really modify that word. And now you're artificially going to
blow up and make it more important. So that's less intuitive than saying, I don't wanna jump into some completely
different part of my error surface. The wall just comes from this is what
the error surface looks like for a very very simple recurrent node network
with a very simple kind of problem that it tries to solve. And you can actually use most
of the networks that you have, you can try to make them
have just two parameters and then you can visualize
something like this too. In fact it's very intuitive
sometimes to do that. When you try different optimizers,
we'll get to those in a later lecture like Adam or SGD or achieve momentum,
we'll talk about those soon. You can actually always try to visualise
that in some simple kind of landscape. This just happens to be the landscape that
this particular recurrent neural network problem has with one-hidden unit and
just a bias term. So the question is, how could we know for sure that this happens with non-linear
actions and multiple weight. So you also have some
non-linearity here in this. So that intuitively wouldn't prevent
us from transferring that knowledge. Now, in general, it's very hard. We can't really visualize
a very high dimensional spaces. There is actually now an interesting
new idea that was introduced, I think by Ian Goodfellow
where you can actually try to, let's say you have your parameter space,
inside your parameter space, you have some kind of cross function. So you say my w matrices are at this value
and so on, and I have some error when all my values are here, and then I start
to optimize and I end up somewhere here. Now the problem is, we can't
visualize it because it's usually in realistic settings,
you have the 100 million. Workflow. At least a million or so
parameters, sometimes 100 million. And so, something crazy might be going
on as you optimize between this. And so, because we can't visualize it and we can't even sub-sample it because
it's such a high-dimensional space. What they do is they actually
draw a line between the point from where they started with their random
initialization before optimization. And end the line all the way to the point where you actually
finished the optimization. And then you can evaluate along
this line at a certain intervals, you can evaluate how big your area is. And if that area changes between
two such intervals a lot, then that means we have very
high curvature in that area. So that's one trick of how
you might use this idea and gain some intuition of
the curvature of the space. But yeah, only in two dimensions can we
get such nice intuitive visualizations. Yeah. So the question is why don't
we just have less dependence? And the question of course,
it's a legit question, but ideally we'll let
the model figure this out. Ideally we're better at
optimizing the model, and the model has in theory these
long range dependencies. In practice, they rarely ever do. In fact when you implement these, and
you can start playing around with this and this is something I
encourage you all to do too. As you implement your models you can try
to make it a little bit more interactive. Have some IPython Notebook,
give it a sequence and look at the probability of the next word. And then give it a different sequence
where you change words like ten time steps away, and
look again at the probabilities. And what you'll often observe is that
after seven words or so, the words before actually don't matter, especially not for
these simple recurrent neural networks. But because this is a big problem, there are actually a lot of
different kinds of solutions. And so the biggest and
best one is one we'll introduce next week. But a simpler one is to use
rectified linear units and to also initialize both of your w's
to ones from hidden to hidden and the ones from the input to the hidden
state with the identity matrix. And this is a trick that I
introduced a couple years ago and then it was sort of combined
with rectified linear units. And applied to recurrent
neural networks by Quoc Le. And so the main idea here is if
you move around in your space. Let's say you have your h, and usually we have here our whh times h,
plus whx plus x. And let's assume for now that h and
x have the same dimensionality. So then all these
are essentially square matrices. And we have here our different vectors. Now, in the standard initialization,
what you would do is you'd just have a bunch of small random values
and all the different elements of w. And what that means is
as you start optimizing, whatever x is you have some random
projection into the hidden space. Instead, the idea here is we actually
have identity initialization. Maybe you can scale it, so instead
you have a half times the identity, and what does that do? Intuitively when you combine
the hidden state and the word vector? That's exactly right. If this is an identity initialized matrix. So it's just, 1, 1, 1,
1, 1, 1 on the diagonal. And you multiply all of these by one half. Same as just having a half,
a half, a half, and so on. And you multiply this with this vector and
you do the same thing here. What essentially that means is that
you have a half, times that vector, plus half times that other vector. And intuitively that means in
the beginning, if you don't know anything. Let's not do a crazy random projection
into the middle of nowhere in our parameter space, but just average. And say, well as I move through the space
my hidden state is just a moving average of the word vectors. And then I start making some updates. And it turns out when you look here and you apply this to the very
tight problem of MNIST. Which we don't really have to go into,
but its a bunch of small digits. And they're trying to basically predict what digit it is by going over
all the pixels in a sequence. Instead of using other kinds of neural networks like
convolutional neural networks. And basically we look
at the test accuracy. These are very long time sequences. And the test accuracy for
these is much, much higher. When you use this identity initialization
instead of random initialization, and also using rectified linear units. Now more importantly for
real language modeling, we can compare recurrent neural
networks in this simple form. So we had the question before like,
do these actually matter or did I just kind of describe single
layer recurrent neural networks for the class to describe the concept. And here we actually have these
simple recurrent neural networks, and we basically compare. This one is called Kneser-Ney with 5
grams, so a lot of counts, and some clever back off and smoothing techniques which
we won't need to get into for the class. And we compare these on
two different corpora and we basically look at the perplexity. So these are all perplexity numbers,
and we look at the neural network or the neural network that's
combined with Kneser-Ney, assuming probability estimates. And of course when you combine the two
then you don't really get the advantage of having less RAM. So ideally this by itself would do best,
but in general combining the two
used to still work better. These are results from five years ago,
and they failed most very quickly. I think the best results now are pure
neural network language models. But basically we can see
that compared to Kneser-Ney, even back then, the neural
network actually works very well. And has much lower perplexity than just
the Kneser-Ney or just account based. Now one problem that you'll
observe in a lot of cases, is that the softmax is really,
really large. So your word vectors are one
set of parameters, but your softmax is another set of parameters. And if your hidden state is 1000, and let's say you have
100,000 different words. Then that's 100,000 times 1000 dimensional
matrix that you'd have to multiply with the hidden state at
every single time step. So that's not very efficient, and so one way to improve this is with
a class-based word prediction. Where we first try to predict some
class that we can come up, and there are different kinds
of things we can do. In many cases you can sort,
just the words by how frequent they are. And say the thousand most frequent
words are in the first class, the next thousand most frequent
words in the second class and so on. And so you first basically classify, try
to predict the class based on the history. And then you predict the word inside
that class, based on that class. And so this one is only
a thousand dimensional, and so you can basically do this. And now the more classes
the better the perplexity, but also the slower the speed
the less you gain from this. And especially at training time
which is what we see here, this makes a huge difference. So if you have just very few classes,
you can actually reduce the number here of seconds
that each eproc takes. By almost 10x compared to
having more classes or even more than 10x if you
have the full softmax. And even the test time, is faster cuz now
you only essentially evaluate the word probabilities for the classes that
have a very high probability here. All right, one last trick and
this is maybe obvious to some but it wasn't obvious to others even in
the past when people published on this. But you essentially only need
to do a single backward's pass through the sequence. Once you accumulate all the deltas
from each error at each time set. So looking at this figure,
really quick again. Here, essentially you have
one forward pass where you compute all the hidden states and
all your errors, and then you only have a single
backwards pass, and as you go backwards in time you keep accumulating
all the deltas of each time step. And so originally people said, for this
time step I'm gonna go all the way back, and then I go to the next time step,
and then I go all the way back, and then the next step, and all the way back,
which is really inefficient. And is essentially same as combining all the deltas in one clean
back propagation step. And again, it's kind of is intuitive. An intuitive sort of
implementation trick but people gave that the term back
propagation through time. All right, now that we have these
simple recurrent neural networks, we can use them for
a lot of fun applications. In fact, the name entity recognition
that we're gonna use in example with the Window. In the Window model, you could only
condition the probability of this being a location, a person, or an organization
based on the words in that Window. The recurrent neural network
you can in theory take and condition these probabilities
on a lot larger context sizes. And so
you can do Named Entity Recognition (NER), you can do entity level sentiment in
context, so for instance you can say. I liked the acting, but
the plot was a little thin. And you can say I want to now for
acting say positive, and predict the positive class for that word. Predict the null class, and
all sentiment for all the other words, and then plot should get
negative class label. Or you can classify opinionated
expressions, and this is what researchers at Cornell where they
essentially used RNNs for opinion mining and essentially wanted
to classify whether each word in a relatively smaller purpose here is
either the direct subjective expression or the expressive subjective expression,
so either direct or expressive. So basically this is direct
subjective expressions, explicitly mention some private state or
speech event, whereas the ESEs just indicate the sentiment or emotion without
explicitly stating or conveying them. So here's an example, like the committee as usual has
refused to make any statements. And so you want to classify
as usual as an ESE, and basically give each of these
words here a certain label. And this is something you'll actually
observe a lot in sequence tagging paths. Again, all the same models
the recurrent neural network. You have the soft max at every time step. But now the soft max actually
has a set of classes that indicate whether a certain
expression begins or ends. And so here you would basically
have this BIO notation scheme where you have the beginning or
the end, or a null token. It's not any of the expressions
that I care about. So here you would say for instance,
as usual is an overall ESE expression, so it begins here, and
it's in the middle right here. And then these are neither ESEs or DSEs. All right, now they started with
the standard recurrent neural network, and I want you to at some point be able
to glance over these equations, and just say I've seen this before. It doesn't have to be W superscript HH,
and so on. But whenever you see, the summation
order of course, doesn't matter either. But here, they use W, V, and
U, but then they defined, instead of writing out softmax,
they write g here. But once you look at these equations,
I hope that eventually you're just like it's just a recurrent neural network,
right? You have here,
are your hidden to hidden matrix. You have your input to hidden matrix, and
here you have your softmax waits you. So same idea, but these are the actual
equations from this real paper that you can now kind of read and immediately sort
of have the intuition of what happens. All right, you need directional
recurrent neural network where we, if we try to make the prediction here,
of whether this is an ESE or whatever name entity recognition,
any kind of sequence labelling task, what's the problem with
this kind of model? What do you think as we go
from left to right only? What do you think could be a problem for
making the most accurate predictions? That's right. Words that come after
the current word can't be helping us to make accurate
predictions at that time step, right? Cuz we only went from left to right. And so one of the most common
extensions of recurrent neural networks is actually to do bidirectional
recurrent neural networks where instead of just going from left to
right, we also go from right to left. And it's essentially the exact same model. In fact, you could implement it by
changing your input and just reversing all the words of your input, and
then it's exactly the same thing. And now, here's the reason why they
don't have superscripts with WHH, cuz now they have these
arrows that indicate whether you're going from left to right,
or from right to left. And now, they basically have
this concatenation here, and in order to make a prediction at a certain
time step t they essentially concatenate the hidden states from both the left
direction and the right direction. And those are now the feature vectors. And this vector ht coming from the left,
has all the context ordinal, again seven plus words,
depending on how well you train your RNN. From all the words on the left, ht from the right has all the contacts
from the words on the right, and that is now your feature vector to make an
accurate prediction at a certain time set. Any questions around bidirectional
recurrent neural networks? You'll see these a lot in all
the recent papers you'll be learning, in various modifications. Yeah. Have people tried
Convolutional Neural Networks? They have, and we have a special lecture
also we will talk a little bit about Convolutional Neural Networks. So you don't necessarily have a cycle,
right? You just go, basically as you implement
this, you go once all the way for your the left, and you don't have any interactions with
the step that goes from the right. You can compute your
feet forward HTs here for that direction,
are only coming from the left. And the HT from the other direction,
you can compete, in fact you could paralyze this if
you want to be super efficient and. Have one core,
implement the left direction, and one core implement the right direction. So in that sense it doesn't make
the vanishing create any problem worse. But, of course,
just like any recurring neural network, it does have the vanishing
creating problem, and the exploding creating problems and it has
to be clever about flipping it and so, yeah We call them standard feedforward neural networks or
Window based feedforward neural networks. And now we have recurrent neural networks. And this is really one of
the most powerful family and we'll see lots of extensions. In fact, if there's no other
question we can go even deeper. It is after all deep learning. And so, now you'll observe [LAUGH] we
definitely had to skip that superscript. And we have different, Characters here for each of our matrices, because,
instead of just going from left to right, you can also have a deep neural
network at each time step. And so now, to compute the ith
layer at a given time step, you essentially again, have only the things
coming from the left that modify it but, you just don't take in the vector from the
left, you also take the vector from below. So, in the simplest definition that is
just your x, your input vector right? But as you go deeper you now also have
the previous hidden layers input. Instead of why are the, So the question is, why do we feed the hidden layer into
another hidden layer instead of the y? In fact, you can actually have so
called short circuit connections, too, where each of these h's can
go directly to the y as well. And so here in this figure you see
that only the top ones go into the y. But you can actually have short circuit
connections where y here has as input not just ht from the top layer,
noted here as capital L, but the concatenation of all the h's. It's just another way to make
this monster even more monstrous. And in fact there a lot of modifications,
in fact, Shayne has a paper, an ArXiv right now on a search based
odyssey type thing where you have so many different kinds of knobs that you can
tune for even more sophisticated recurrent neural networks of the type that we'll
introduce next week that, it gets a little unwieldy and it turns out a lot of
the things don't matter that much, but each can kind of give you a little
bit of a boost in many cases. So if you have three layers,
you have four layers, what's the dimensionality
of all the layers and the various different kinds of connections
and short circuit connections. We'll introduce some of these, but
in general this like a pretty decent model and will eventually extract away from
how we compute that hidden state, and that will be a more complex kind of cell
type that we'll introduce next Tuesday. Do we have one more question? So now how do we evaluate this? It's very important to evaluate
your problems correctly, and we actually talked about this before. When you have a very imbalanced data set,
where some of the classes appear very frequently and others are not very
frequent, you don't wanna use accuracy. In fact, in these kinds of sentences,
you often observe, this is an extreme one where you have a lot of ESEs and
DSEs but in many cases, just content. Standard sort of non-sentiment context and words, and so a lot of these
are actually O, have no label. And so it's very important to use F1 and we basically had this question also after
class, but it's important for all of you to know because the F1 metric is really
one of the most commonly used metrics. And it's essentially just the harmonic
mean of precision and recall. Precision is just the true
positives divided by true positives plus false positives and recall is just true positives divided
by true positives plus false negatives. And then you have here the harmonic
mean of these two numbers. So intuitively, you can be very
accurate by always saying something or have a very high recall for
a certain class but if you always miss another class
That would hurt you a lot. And now here's an evaluation
that you should also be familiar with where basically this is something
I would like to see in a lot of your project reports too as you analyze the
various hyper parameters that you have. And so one thing they found here is they
have two different data set sizes that they train on,
in many cases if you train with more data, you basically do better but then also it's
not always the case that more layers. So this is the depth that we had here, the
number l for all these different layers. It's not always the case
that more layers are better. In fact here, the highest performance
they get is with three layers, instead of four or five. All right, so let's recap. Recurring neural networks, best deep learning model family that
you'll learn about in this class. Training them can be very hard. Fortunately, you understand
back propagation now. You can gain an intuition of
why that might be the case. We'll in the next lecture extend
them some much more powerful models the Gated Recurring Units or LSTMs,
and those are the models you'll see all over the place in all the state
of the art models these days. All right. Thank you.
Intuitively, if you have a big enough network using relus then it should be able to fit whatever non-linearity you have. A circle can be very closely approximated by polygons. Probably the more important factor in choosing between relu and other activations would be increased training speed, smoothness of your decision boundary is probably an artificial concern.