[students murmuring] - Okay, so good afternoon
everyone, let's get started. So hi, so for those of
you who I haven't met yet, my name is Serena Yeung and I'm the third and final instructor for this class, and I'm also a PhD student
in Fei-Fei's group. Okay, so today we're going
to talk about backpropagation and neural networks, and so
now we're really starting to get to some of the core
material in this class. Before we begin, let's see, oh. So a few administrative details, so assignment one is due
Thursday, April 20th, so a reminder, we shifted
the date back by a little bit and it's going to be due
11:59 p.m. on Canvas. So you should start thinking
about your projects, there are TA specialties
listed on the Piazza website so if you have questions
about a specific project topic you're thinking about, you
can go and try and find the TAs that might be most relevant. And then also for Google Cloud,
so all students are going to get $100 in credits
to use for Google Cloud for their assignments and project, so you should be receiving an email for that this week, I think. A lot of you may have already, and then for those of you who haven't,
they're going to come, should be by the end of this week. Okay so where we are, so
far we've talked about how to define a classifier
using a function f, parameterized by weights
W, and this function f is going to take data x as input,
and output a vector of scores for each of the classes
that you want to classify. And so from here we can also define a loss function, so for
example, the SVM loss function that we've talked about
which basically quantifies how happy or unhappy we are with the scores that we've produced, right, and then we can use that to
define a total loss term. So L here, which is a
combination of this data term, combined with a regularization
term that expresses how simple our model is,
and we have a preference for simpler models, for
better generalization. And so now we want to
find the parameters W that correspond to our lowest loss, right? We want to minimize the loss function, and so to do that we want to find the gradient of L with respect to W. So last lecture we talked
about how we can do this using optimization, and we're going to iteratively take steps in the direction of steepest descent, which is
the negative of the gradient, in order to walk down this loss landscape and get to the point
of lowest loss, right? And we saw how this gradient
descent can basically take this trajectory, looking
like this image on the right, getting to the bottom
of your loss landscape. Oh! Okay, and so we also
talked about different ways for computing a gradient, right? We can compute this numerically using finite difference approximation which is slow and approximate,
but at the same time it's really easy to write out, you know you can always
get the gradient this way. We also talked about how to
use the analytic gradient and computing this is, it's fast and exact once you've
gotten the expression for the analytic gradient, but
at the same time you have to do all the math and the
calculus to derive this, so it's also, you know, easy
to make mistakes, right? So in practice what we want
to do is we want to derive the analytic gradient and use this, but at the same time check
our implementation using the numerical gradient to make sure that we've gotten all of our math right. So today we're going to
talk about how to compute the analytic gradient for
arbitrarily complex functions, using a framework that I'm going
to call computational graphs. And so basically what a
computational graph is, is that we can use this
kind of graph in order to represent any function,
where the nodes of the graph are steps of computation
that we go through. So for example, in this example, the linear classifier
that we've talked about, the inputs here are x and W, right, and then this multiplication
node represents the matrix multiplier,
the multiplication of the parameters W with
our data x that we have, outputting our vector of scores. And then we have another
computational node which represents our hinge loss, right, computing our data loss term, Li. And we also have this
regularization term at the bottom right, so this node which computes our regularization term, and then our total loss
here at the end, L, is the sum of the regularization
term and the data term. And the advantage is
that once we can express a function using a computational graph, then we can use a technique
that we call backpropagation which is going to recursively
use the chain rule in order to compute the gradient with respect to every variable
in the computational graph, and so we're going to
see how this is done. And this becomes very
useful when we start working with really complex functions, so for example,
convolutional neural networks that we're going to talk
about later in this class. We have here the input image at the top, we have our loss at the bottom, and the input has to
go through many layers of transformations in order to get all the way down to the loss function. And this can get even
crazier with things like, the, you know, like a
neural turing machine, which is another kind
of deep learning model, and in this case you can see
that the computational graph for this is really insane, and especially, we end up, you know,
unrolling this over time. It's basically completely impractical if you want to compute the gradients for any of these intermediate variables. Okay, so how does backpropagation work? So we're going to start
off with a simple example, where again, our goal is
that we have a function. So in this case, f of x, y, z equals x plus y times z, and we want to find the
gradients of the output of the function with respect
to any of the variables. So the first step, always, is we want to take our function f, and we want to represent it using
a computational graph. Right, so here our computational
graph is on the right, and you can see that we have our, first we have the plus node, so x plus y, and then we have this
multiplication node, right, for the second computation
that we're doing. And then, now we're going
to do a forward pass of this network, so given the values of the variables that we have, so here, x equals negative two, y equals five and z equals negative four,
I'm going to fill these all in in our computational graph,
and then here we can compute an intermediate value,
so x plus y gives three, and then finally we pass it through again, through the last node, the multiplication, to get our final node
of f equals negative 12. So here we want to give every
intermediate variable a name. So here I've called this
intermediate variable after the plus node q, and we
have q equals x plus y, and then f equals q times z,
using this intermediate node. And I've also written
out here, the gradients of q with respect to x
and y, which are just one because of the addition,
and then the gradients of f with respect to q and z,
which is z and q respectively because of the multiplication rule. And so what we want to
find, is we want to find the gradients of f with
respect to x, y and z. So what backprop is, it's
a recursive application of the chain rule, so we're
going to start at the back, the very end of the computational graph, and then we're going to
work our way backwards and compute all the
gradients along the way. So here if we start at
the very end, right, we want to compute the
gradient of the output with respect to the last
variable, which is just f. And so this gradient is
just one, it's trivial. So now, moving backwards,
we want the gradient with respect to z, right, and we know that df over dz is equal to q. So the value of q is just three, and so we have here, df
over dz equals three. And so next if we want to do df over dq, what is the value of that? What is df over dq? So we have here, df over
dq is equal to z, right, and the value of z is negative four. So here we have df over dq
is equal to negative four. Okay, so now continuing to
move backwards to the graph, we want to find df over dy, right, but here in this case, the
gradient with respect to y, y is not connected directly to f, right? It's connected through an
intermediate node of z, and so the way we're going to do this is we can leverage the
chain rule which says that df over dy can be
written as df over dq, times dq over dy, and
so the intuition of this is that in order to get to
find the effect of y on f, this is actually equivalent to if we take the effect of q times q on f,
which we already know, right? df over dq is equal to negative four, and we compound it with the
effect of y on q, dq over dy. So what's dq over dy
equal to in this case? - [Student] One. - One, right. Exactly. So dq over dy is equal to
one, which means, you know, if we change y by a little bit, q is going to change by approximately the same amount right, this is the effect, and so what this is
doing is this is saying, well if I change y by a little bit, the effect of y on q is going to be one, and then the effect of q on f
is going to be approximately a factor of negative four, right? So then we multiply these together and we get that the effect of y on f is going to be negative four. Okay, so now if we want
to do the same thing for the gradient with respect to x, right, we can do the, we can
follow the same procedure, and so what is this going to be? [students speaking away from microphone] - I heard the same. Yeah exactly, so in this
case we want to, again, apply the chain rule, right? We know the effect of q on
f is negative four, and here again, since we have
also the same addition node, dq over dx is equal to one, again, we have negative four times
one, right, and the gradient with respect to x is
going to be negative four. Okay, so what we're doing is, in backprop, is we basically have all of these nodes in our computational graph, but each node is only aware of its
immediate surroundings, right? So we have, at each node,
we have the local inputs that are connected to this node, the values that are flowing into the node, and then we also have the output that is directly outputted from this node. So here our local inputs are
x and y, and the output is z. And at this node we also know
the local gradient, right, we can compute the gradient
of z with respect to x, and the gradient of z with respect to y, and these are usually really
simple operations, right? Each node is going to be something like the addition or the multiplication that we had in that earlier example, which is something where
we can just write down the gradient, and we
don't have to, you know, go through very complex
calculus in order to find this. - [Student] Can you go
back and explain why more in the last slide was
different than planning the first part of it using
just normal calculus? - Yeah, so basically if we go back, hold on, let me... So if we go back here, we
could exactly write out, find all of these using just calculus, so we could say, you know,
we want df over dx, right, and we can probably
expand out this expression and see that it's just going to be z, but we can do this for, in this case, because it's simple, but
we'll see examples later on where once this becomes a
really complicated expression, you don't want to have to use calculus to derive, right, the
gradient for something, for a super-complicated expression, and instead, if you use this formalism and you break it down into
these computational nodes, then you can only ever work with gradients of very simple computations, right, at the level of, you know,
additions, multiplications, exponentials, things as
simple as you want them, and then you just use the chain rule to multiply all these together, and get your, the value of your gradient without having to ever
derive the entire expression. Does that make sense? [student murmuring] Okay, so we'll see an
example of this later. And so, was there another question, yeah? [student speaking away from microphone] - [Student] What's the negative four next to the z representing? - Negative, okay yeah,
so the negative four, these were the, the green values on top were all the values of
the function as we passed it forward through the
computational graph, right? So we said up here that x
is equal to negative two, y is equal to five, and
z equals negative four, so we filled in all of these
values, and then we just wanted to compute the value of this function. Right, so we said this value
of q is going to be x plus y, it's going to be negative
two plus five, it is going to be three, and we have z
is equal to negative four so we fill that in here,
and then we multiplied q and z together, negative four times three in order to get the
final value of f, right? And then the red values underneath were as we were filling in the gradients as we were working backwards. Okay. Okay, so right, so we said that, you know, we have these local, these nodes, and each node basically gets
its local inputs coming in and the output that it
sees directly passing on to the next node, and we also
have these local gradients that we computed, right, the gradient of the immediate output of the node with respect to the inputs coming in. And so what happens during
backprop is we have these, we'll start from the
back of the graph, right, and then we work our way from the end all the way back to the beginning, and when we reach each
node, at each node we have the upstream gradients coming back, right, with respect to the
immediate output of the node. So by the time we reach
this node in backprop, we've already computed the gradient of our final loss l,
with respect to z, right? And so now what we want to find next is we want to find the
gradients with respect to just before the node,
to the values of x and y. And so as we saw earlier, we
do this using the chain rule, right, we have from the chain rule, that the gradient of this loss function with respect to x is going to be the gradient with respect
to z times, compounded by this gradient, local gradient
of z with respect to x. Right, so in the chain rule we always take this upstream gradient coming down, and we multiply it by the local gradient in order to get the gradient
with respect to the input. - [Student] So, sorry, is it, it's different because
this would never work to get a general formula into the, or general symbolic
formula for the gradient. It only works with instantaneous values, where you like. [student coughing] Or passing a little constant
value as a symbolic. - So the question is
whether this only works because we're working
with the current values of the function, and so it works, right, given the current values of
the function that we plug in, but we can write an expression for this, still in terms of the variables, right? So we'll see that gradient
of L with respect to z is going to be some
expression, and gradient of z with respect to x is going to
be another expression, right? But we plug in these,
we plug in the values of these numbers at the
time in order to get the value of the gradient
with respect to x. So what you could do is you
could recursively plug in all of these expressions, right? Gradient with respect, z with respect to x is going to be a simple,
simple expression, right? So in this case, if we
have a multiplication node, gradient of z with
respect to x is just going to be y, right, we know that, but the gradient of L with respect to z, this is probably a complex part of the graph in itself, right, so
here's where we want to just, in this case, have this numerical, right? So as you said, basically
this is going to be just a number coming down, right, a value, and then we just multiply it with the expression that we have
for the local gradient. And I think this will be
more clear when we go through a more complicated
example in a few slides. Okay, so now the gradient
of L with respect to y, we have exactly the
same idea, where again, we use the chain rule,
we have gradient of L with respect to z, times the gradient of z with respect to y, right,
we use the chain rule, multiply these together
and get our gradient. And then once we have these,
we'll pass these on to the node directly before,
or connected to this node. And so the main thing
to take away from this is that at each node we just
want to have our local gradient that we compute, just keep track of this, and then during backprop as
we're receiving, you know, numerical values of gradients
coming from upstream, we just take what that is, multiply it by the local gradient, and then this is what we then send back
to the connected nodes, the next nodes going backwards,
without having to care about anything else besides
these immediate surroundings. So now we're going to go
through another example, this time a little bit more complex, so we can see more why
backprop is so useful. So in this case, our
function is f of w and x, which is equal to one over one plus e to the negative of w-zero times x-zero plus w-one x-one, plus w-two, right? So again, the first step always is we want to write this out as
a computational graph. So in this case we can see
that in this graph, right, first we multiply together the
w and x terms that we have, w-zero with x-zero, w-one with x-one, and w-two, then we add all
of these together, right? Then we do, scale it by negative one, we take the exponential, we add one, and then finally we do
one over this whole term. And then here I've also
filled in values of these, so let's say given values that we have for the ws and xs, right,
we can make a forward pass and basically compute what the value is at every stage of the computation. And here I've also written
down here at the bottom the values, the expressions
for some derivatives that are going to be helpful later on, so same as we did before
with the simple example. Okay, so now then we're going
to do backprop through here, right, so again, we're going to start at the very end of the
graph, and so here again the gradient of the output with
respect to the last variable is just one, it's just trivial, and so now moving
backwards one step, right? So what's the gradient with respect to the input just before one over x? Well, so in this case, we know
that the upstream gradient that we have coming down,
right, is this red one, right? This is the upstream gradient
that we have flowing down, and then now we need to find
the local gradient, right, and the local gradient of this node, this node is one over x, right, so we have f of x equals
one over x here in red, and the local gradient of this df over dx is equal to negative one
over x-squared, right? So here we're going to take
negative one over x-squared, and plug in the value
of x that we had during this forward pass, 1.37,
and so our final gradient with respect to this variable is going to be negative one over
1.37 squared times one equals negative 0.53. So moving back to the next node, we're going to go through the
exact same process, right? So here, the gradient
flowing from upstream is going to be negative 0.53, right, and here the local gradient,
the node here is a plus one, and so now looking at our
reference of derivatives at the bottom, we have that
for a constant plus x, the local gradient is just one, right? So what's the gradient with respect to this variable using the chain rule? So it's going to be the upstream gradient of negative 0.53 times
our local gradient of one, which is equal to negative 0.53. So let's keep moving
backwards one more step. So here we have the exponential, right? So what's the upstream
gradient coming down? [student speaking away from microphone] Right, so the upstream
gradient is negative 0.53, what's the local gradient here? It's going to be the local
gradient of e to the x, right? This is an exponential
node, and so our chain rule is going to tell us that our gradient is going to be negative 0.53
times e to the power of x, which in this case is negative one, from our forward pass, and
this is going to give us our final gradient of negative 0.2. Okay, so now one more node here, the next node is, that
we reach, is going to be a multiplication with negative one, right? So here, what's the upstream
gradient coming down? - [Student] Negative 0.2? - [Serena] Negative 0.2,
right, and what's going to be the local gradient, can
look at the reference sheet. It's going to be, what was it? I think I heard it. - [Student] That's minus one? - It's going to be minus
one, exactly, yeah, because our local gradient
says it's going to be, df over dx is a, right, and the value of a that we scaled x by is negative one here. So we have here that the gradient is negative one times negative 0.2, and so our gradient is 0.2. Okay, so now we've
reached an addition node, and so in this case we
have these two branches both connected to it, right? So what's the upstream gradient here? It's going to be 0.2, right,
just as everything else, and here now the gradient with respect to each of these branches,
it's an addition, right, and we saw from before
in our simple example that when we have an addition node, the gradient with respect
to each of the inputs to the addition is just
going to be one, right? So here, our local gradient
for looking at our top stream is going to be one times
the upstream gradient of 0.2, which is going to give
a total gradient of 0.2, right? And then we, for our bottom branch we'd do the same thing, right, our
upstream gradient is 0.2, our local gradient is one again, and the total gradient is 0.2. So is everything clear about this? Okay. So we have a few more
gradients to fill out, so moving back now we've
reached w-zero and x-zero, and so here we have a
multiplication node, right, so we saw the multiplication
node from before, it just, the gradient with respect to one of the inputs just is
the value of the other input. And so in this case, what's the gradient with respect to w-zero? - [Student] Minus 0.2. - Minus, I'm hearing minus 0.2, exactly. Yeah, so with respect to w-zero, we have our upstream gradient, 0.2, right, times our, this is the bottom one, times our value of x,
which is negative one, we get negative 0.2 and
we can do the same thing for our gradient with respect to x-zero. It's going to be 0.2
times the value of w-zero which is two, and we get 0.4. Okay, so here we've filled
out most of these gradients, and so there was the question earlier about why this is simpler
than just computing, deriving the analytic gradient,
the expression with respect to any of these variables, right? And so you can see here,
all we ever dealt with was expressions for local gradients that we had to write out, so
once we had these expressions for local gradients, all we did was plug in the values for
each of these that we have, and use the chain rule to
numerically multiply this all the way backwards and get the gradients with respect to all of the variables. And so, you know, we can also fill out the gradients with respect
to w-one and x-one here in exactly the same way, and so one thing that I want to note is that
right when we're creating these computational graphs, we can define the computational nodes at any
granularity that we want to. So in this case, we broke it down into the absolute simplest
that we could, right, we broke it down into
additions and multiplications, you know, it basically can't
get any simpler than that, but in practice, right,
we can group some of these nodes together into
more complex nodes if we want. As long as we're able to write down the local gradient for that node, right? And so as an example, if we
look at a sigmoid function, so I've defined the sigmoid function in the upper-right here, of a sigmoid of x is equal to one over one
plus e to the negative x, and this is something that's
a really common function that you'll see a lot in
the rest of this class, and we can compute the gradient for this, we can write it out, and if
we do actually go through the math of doing this analytically, we can get a nice expression at the end. So in this case it's equal
to one minus sigma of x, so the output of this function
times sigma of x, right? And so in cases where we
have something like this, we could just take all the computations that we had in our graph
that made up this sigmoid, and we could just replace it with one big node that's a sigmoid, right, because we do know the local
gradient for this gate, it's this expression, d of the
sigmoid of x over dx, right? So basically the important thing here is that you can, group any nodes that you want to make any sorts of a little
bit more complex nodes, as long as you can write down
the local gradient for this. And so all this is is
basically a trade-off between, you know, how much math
that you want to do in order to get a more, kind
of concise and simpler graph, right, versus how simple you want each of your gradients to be, right? And then you can write out as complex of a computational graph that you want. Yeah, question? - [Student] This is a
question on the graph itself, is there a reason that the
first two multiplication nodes and the weights are not connected
to a single addition node? - So they could also be connected into a single addition node,
so the question was, is there a reason why w-zero and x-zero are not connected with w-two? All of these additions
just connected together, and yeah, so the reason, the answer is that you can do that if you want, and in practice, maybe you
would actually want to do that because this is still a
very simple node, right? So in this case I just wrote
this out into as simple as possible, where each node
only had up to two inputs, but yeah, you could definitely do that. Any other questions about this? Okay, so the one thing that I really like about thinking about this
like a computational graph is that I feel very comforted, right, like anytime I have to take a gradient, find gradients of something,
even if the expression that I want to compute
gradients of is really hairy, and really scary, you know,
whether it's something like this sigmoid or something worse, I know that, you know, I could
derive this if I want to, but really, if I just
sit down and write it out in terms of a computational graph, I can go as simple as I need to to always be able to apply
backprop and the chain rule, and be able to compute all
the gradients that I need. And so this is something that
you guys should think about when you're doing your homeworks,
as basically, you know, anytime you're having trouble
finding gradients of something just think about it as
a computational graph, break it down into all of these parts, and then use the chain rule. Okay, and so, you know, so we talked about how we could group these
set of nodes together into a sigmoid gate, and
just to confirm, like, that this is actually exactly equivalent, we can plug this in, right? So we have that our input
here to the sigmoid gate is going to be one, in
green, and then we have that the output is going
to be here, 0.73, right, and this'll work out if you plug it in to the sigmoid function. And so now if we want to
do, if we want to take the gradient, and we want
to treat this entire sigmoid as one node, now what we should do is we need to use this local gradient that we've derived up here, right? One minus sigmoid of x
times the sigmoid of x. So if we plug this in, and here we know that the value of sigmoid of x was 0.73, so if we plug this value
in we'll see that this, the value of this gradient
is equal to 0.2, right, and so the value of this
local gradient is 0.2, we multiply it by the x
upstream gradient which is one, and we're going to get
out exactly the same value of the gradient with respect
to before the sigmoid gate, as if we broke it down into all
of the smaller computations. Okay, and so as we're looking
at what's happening, right, as we're taking these
gradients going backwards through our computational graph, there's some patterns that you'll notice where there's some
intuitive interpretation that we can give these, right? So we saw that the add gate is
a gradient distributor right, when we passed through
this addition gate here, which had two branches coming out of it, it took the gradient,
the upstream gradient and it just distributed it,
passed the exact same thing to both of the branches
that were connected. So here's a couple more
that we can think about. So what's a max gate look like? So we have a max gate
here at the bottom, right, where the input's coming in are z and w, z has a value of two, w has
a value of negative one, and then we took the max of
this, which is two, right, and so we pass this
down into the remainder of our computational graph. So now if we're taking the
gradients with respect to this, the upstream gradient is, let's
say two coming back, right, and what does this local
gradient look like? So anyone, yes? - [Student] It'll be zero for
one, and one for the other? - Right. [student speaking away from microphone] Exactly, so the answer that was given is that z will have a gradient of two, w will have a value, a gradient of zero, and so one of these is going to get the full value of the
gradient just passed back, and routed to that variable,
and then the other one will have a gradient of zero, and so, so we can think of this as kind
of a gradient router, right, so, whereas the addition node passed back the same gradient to
both branches coming in, the max gate will just take the gradient and route it to one of the branches, and this makes sense because
if we look at our forward pass, what's happening is that only the value that was the maximum got passed down to the rest of the
computational graph, right? So it's the only value
that actually affected our function computation at
the end, and so it makes sense that when we're passing
our gradients back, we just want to adjust what, you know, flow it through that
branch of the computation. Okay, and so another one,
what's a multiplication gate, which we saw earlier, is there
any interpretation of this? [student speaking away from microphone] Okay, so the answer that was given is that the local
gradient is basically just the value of the other variable. Yeah, so that's exactly right. So we can think of this as
a gradient switcher, right? A switcher, and I guess
a scaler, where we take the upstream gradient and we scale it by the value of the other branch. Okay, and so one other thing to note is that when we have a place where one node is connected to multiple nodes, the gradients add up at this node, right? So at these branches, using
the multivariate chain rule, we're just going to take the value of the upstream gradient coming
back from each of these nodes, and we'll add these
together to get the total upstream gradient that's
flowing back into this node, and you can see this from
the multivariate chain rule and also thinking about this,
you can think about this that if you're going to
change this node a little bit, it's going to affect both
of these connected nodes in the forward pass,
right, when you're making your forward pass through the graph. And so then when you're
doing backprop, right, then now the, both of
these gradients coming back are going to affect this node, right, and so that's how we're
going to sum these up to be the total upstream gradient
flowing back into this node. Okay, so any questions about backprop, going through these forward
and backward passes? - [Student] So we haven't did anything to actually update the weights. [speaking away from microphone] - Right, so the question is,
we haven't done anything yet to update the values of these weights, we've only found the
gradients with respect to the variables, that's exactly right. So what we've talked about
so far in this lecture is how to compute gradients with
respect to any variables in our function, right,
and then once we have these we can just apply everything we learned in the optimization lecture,
last lecture, right? So given the gradient,
we now take a step in the direction of the gradient in order to update our weight,
our parameters, right? So you can just take this entire framework that we learned about last
lecture for optimization, and what we've done here is
just learn how to compute the gradients we need for
arbitrarily complex functions, right, and so this is going
to be useful when we talk about complex functions like
neural networks later on. Yeah? - [Student] Do you mind writing out the, all the variate, so you could help explain this slide a little better? - Yeah, so I can write
this maybe on the board. Right, so basically if we're
going to have, let's see, if we're going to have the gradient of f with respect to some variable x, right, and let's say it's
connected through variables, let's see, i, we can basically... Right, so this is basically saying that if x is connected to
these multiple elements, right, which in this case, different q-is, then the chain rule is taking all, it's going to take the effect of each of these intermediate variables, right, on our final output f, and
then compound each one with the local effect of our variable x on that intermediate value, right? So yeah, it's basically just
summing all these up together. Okay, so now that we've, you
know, done all these examples in the scalar case, we're going to look at what happens when we have vectors, right? So now if our variables x, y and z, instead of just being numbers,
we have vectors for these. And so everything stays exactly
the same, the entire flow, the only difference is
that now our gradients are going to be Jacobian matrices, right, so these are now going
to be matrices containing the derivative of each
element of, for example z with respect to each element of x. Okay, and so to, you
know, so give an example of something where this is
happening, right, let's say that we have our input is
going to now be a vector, so let's say we have a
4096-dimensional input vector, and this is kind of a common
size that you might see in convolutional neural networks later on, and our node is going to be an
element-wise maximum, right? So we have f of x is equal to the maximum of x compared with zero
element-wise, and then our output is going to be also a
4096-dimensional vector. Okay, so in this case, what's the size of our Jacobian matrix? Remember I said earlier,
the Jacobian matrix is going to be, like each row is, it's going to be partial derivatives, a matrix of partial derivatives
of each dimension of the output with respect to
each dimension of the input. Okay, so the answer I
heard was 4,096 squared, and that's, yeah, that's correct. So this is pretty large,
right, 4,096 by 4,096 and in practice this is
going to be even larger because we're going to
work with many batches of, you know, of, for example, 100 inputs at the same time, right,
and we'll put all of these through our node at the same
time to be more efficient, and so this is going to scale this by 100, and in practice our Jacobian's
actually going to turn out to be something like
409,000 by 409,000 right, so this is really huge, and basically completely impractical to work with. So in practice though,
we don't actually need to compute this huge
Jacobian most of the time, and so why is that, like, what does this Jacobian matrix look like? If we think about what's happening here, where we're taking this
element-wise maximum, and we think about what are each of the partial derivatives, right, which dimension of the inputs affect which dimensions of the output? What sort of structure can we
see in our Jacobian matrix? [student speaking away from microphone] Okay, so I heard that it's
diagonal, right, exactly. So because this is element-wise,
right, each element of the input, say the first
dimension, only affects that corresponding element
in the output, right? And so because of that
our Jacobian matrix, which is just going to
be a diagonal matrix. And so in practice then,
we don't actually have to write out and formulate
this entire Jacobian, we can just know the effect
of x on the output, right, and then we can just
use these values, right, and fill it in as we're
computing the gradient. Okay, so now we're going to go through a more concrete vectorized
example of a computational graph. Right, so let's look at a case where we have the function f of x and W is equal to, basically the
L-two of W multiplied by x, and so in this case we're going to say x is n-dimensional and W is n by n. Right, so again our first step, writing out the
computational graph, right? We have W multiplied by
x, and then followed by, I'm just going to call this L-two. And so now let's also fill
out some values for this, so we can see that, you
know, let's say have W be this two by two matrix, and x is going to be this
two-dimensional vector, right? And so we can say, label
again our intermediate nodes. So our intermediate node
after the multiplication it's going to be q, we
have q equals W times x, which we can write out
element-wise this way, where the first element is
just W-one-one times x-one plus W-one-two times x-two and so on, and then we can now express
f in relation to q, right? So looking at the second
node we have f of q is equal to the L-two norm of q, which is equal to q-one
squared plus q-two squared. Okay, so we filled this in, right, we get q and then we get our final output. Okay, so now let's do
backprop through this, right? So again, this is always the first step, we have the gradient with respect
to our output is just one. Okay, so now let's move back one node, so now we want to find the
gradient with respect to q, right, our intermediate
variable before the L-two. And so q is a two-dimensional vector, and what we want to do is we want to find how each element of q
affects our final value of f, right, and so if we
look at this expression that we've written out
for f here at the bottom, we can see that the gradient of f with respect to a specific
q-i, let's say q-one, is just going to be two times q-i, right? This is just taking this derivative here, and so we have this expression for, with respect to each element of q-i, we could also, you know, write this out in vector form if we want to, it's just going to be two
times our vector of q, right, if we want to write
this out in vector form, and so what we get is
that our gradient is 0.44, and 0.52, this vector, right? And so you can see that it just took q and it scaled it by two, right? Each element is just multiplied by two. So the gradient of a vector
is always going to be the same size as the original vector, and each element of this
gradient is going to, it means how much of
this particular element affects our final output of the function. Okay, so now let's move
one step backwards, right, what's the gradient with respect to W? And so here again we want
to use the same concept of trying to apply the chain rule, right, so we want to compute our local gradient of q with respect to W, and so let's look at this again element-wise,
and if we do that, let's see what's the
effect of each q, right, each element of q with
respect to each element of W, and so this is going to be the Jacobian that we talked about earlier,
and if we look at this in this multiplication, q is equal to W times x, right,
what's the derivative, or the gradient of the first element of q, so our first element up top,
with respect to W-one-one? So q-one with respect to W-one-one? What's that value? X-one, exactly. Yeah, so we know that this is x-one, and we can write this
out more generally of the gradient of q-k with respect
to W-i,j is equal to X-j. And then now if we want
to find the gradient with respect to, of f,
with respect to each W-i,j. So looking at these derivatives now, we can use this chain rule
that we talked earlier where we basically compound df over dq-k for each element of q with dq-k over W-i,j for each element of W-i,j, right? So we find the effect of each element of W on each element of q, and
sum this across all q. And so if you write this
out, this is going to give this expression of two
times q-i times x-j. Okay, and so filling this out then we get this gradient with respect to W, and so again we can compute
this each element-wise, or we can also look at this
expression that we've derived and write it out in
vectorized form, right? So okay, and remember, the important thing is always to check the gradient
with respect to a variable should have the same shape as
the variable, and something, so this is something
really useful in practice to sanity check, right,
like once you've computed what your gradient should
be, check that this is the same shape as your variable, because again, the element,
each element of your gradient is quantifying how much that element is contributing to your, is
affecting your final output. Yeah? [student speaking away from microphone] The both sides, oh the both sides one is an indicator function,
so this is saying that it's just one if k equals i. Okay, so let's see, so we've done that, and so now just see, one more example. Now our last thing we need to find is the gradient with respect to q-I. So here if we compute the
partial derivatives we can see that dq-k over dx-i is
equal to W-k,i, right, using the same way as we did it for W, and then again we can
just use the chain rule and get the total
expression for that, right? And so this is going to be the gradient with respect to x, again,
of the same shape as x, and we can also write this out in vectorized form if we want. Okay, so any questions about this, yeah? [student speaking away from microphone] So we are computing the Jacobian, so let me go back here, right, so if we're doing, so right, so we have these partial derivatives of q-k with respect to x-i, right, and these are forming your, the entries
of your Jacobian, right? And so in practice what we're going to do is we basically take that,
and you're going to see it up there in the chain rule,
so the vectorized expression of gradient with respect to x, right, this is going to have the Jacobian here which is this transposed value here, so you can write it
out in vectorized form. [student speaking away from microphone] So well, so in this case the matrix is going to be the same size as W right, so it's not actually a large
matrix in this case, right? Okay, so the way that we've
been thinking about this is like a really modularized
implementation, right, where in our computational graph, right, we look at each node
locally and we compute the local gradients and chain them with upstream gradients coming down, and so you can think of this as basically a forward and a backwards API, right? In the forward pass we
implement the, you know, a function computing
the output of this node, and then in the backwards
pass we compute the gradient. And so when we actually
implement this in code, we're going to do this
in exactly the same way. So we can basically think
about, for each gate, right, if we implement a forward
function and a backward function, where the backward function
is computing the chain rule, then if we have our entire
graph, we can just make a forward pass through the
entire graph by iterating through all the nodes in the graph, all the gates. Here I'm going to use
the word gate and node, kind of interchangeably,
we can iterate through all of these gates and just call forward on each of the gates, right? And we just want to do this
in topologically sorted order, so we process all of
the inputs coming in to a node before we process that node. And then going backwards,
we're just going to then go through all of the gates
in this reverse sorted order, and then call backwards
on each of these gates. Okay, and so if we look at then the implementation for
our particular gates, so for example, this MultiplyGate here, we want to implement
the forward pass, right, so it gets x and y as inputs,
and returns the value of z, and then when we go backwards, right, we get as input dz, which
is our upstream gradient, and we want to output the gradients on the input's x and
y to pass down, right? So we're going to output dx and dy, and so in this case, in this example, everything is back to
the scalar case here, and so if we look at
this in the forward pass, one thing that's important
is that we need to, we should cache the values
of the forward pass, right, because we end up using this in the backward pass a lot of the time. So here in the forward pass,
we want to cache the values of x and y, right, and
in the backward pass, using the chain rule,
we're going to, remember, take the value of the upstream gradient and scale it by the value
of the other branch, right, and so we'll keep, for
dx we'll take our value of self.y that we kept, and multiply it by dz coming down, and same for dy. Okay, so if you look at a lot
of deep-learning frameworks and libraries you'll see
that they exactly follow this kind of modularization, right? So for example, Caffe is a
popular deep learning framework, and you'll see, if you go look
through the Caffe source code you'll get to some
directory that says layers, and in layers, which are
basically computational nodes, usually layers might be
slightly more, you know, some of these more complex
computational nodes like the sigmoid that
we talked about earlier, you'll see, basically just a whole list of all different kinds of
computational nodes, right? So you might have the sigmoid, and I know there might be here, there's
like a convolution is one, there's an Argmax is another layer, you'll have all of these
layers and if you dig in to each of them, they're
just exactly implementing a forward pass and a backward pass, and then all of these are called when we do forward and backward pass through the entire network that we formed, and so our network is just basically going to be stacking up all of these, the different layers that we
choose to use in the network. So for example, if we
look at a specific one, in this case a sigmoid layer, you'll see that in the sigmoid layer, right, we've talked about the sigmoid function, you'll see that there's a forward pass which basically computes
exactly the sigmoid expression, and then a backward pass, right, where it is taking as input
something, basically a top_diff, which is our upstream
gradient in this case, and multiplying it by a local
gradient that we compute. So in assignment one you'll get practice with this kind of, this
computational graph way of thinking where, you know, you're
going to be writing your SVM and Softmax classes, and taking the gradients of these. And so again, remember always
you want to first step, represent it as a
computational graph, right? Figure out what are all the computations that you did leading up to the output, and then when you, when it's time to do your backward pass,
just take the gradient with respect to each of
these intermediate variables that you've defined in
your computational graph, and use the chain rule to
link them all together. Okay, so summary of what
we've talked about so far. When we get down to, you know,
working with neural networks, these are going to be
really large and complex, so it's going to be
impractical to write down the gradient formula by hand
for all your parameters. So in order to get these gradients, right, we talked about how, what we
should use is backpropagation, right, and this is kind of
one of the core techniques of, you know, neural
networks, is basically using backpropagation to
get your gradients, right? And so this is a recursive application of the chain rule where we have
this computational graph, and we start at the back and
we go backwards through it to compute the gradients with respect to all of the intermediate variables, which are your inputs, your parameters, and everything else in the middle. And we've also talked about how really this implementation and
this graph structure, each of these nodes is
really, you can see this as implementing a forward
and backwards API, right? And so in the forward
pass we want to compute the results of the operation, and we want to save any intermediate values that we might want to use later
in our gradient computation, and then in the backwards
pass we apply this chain rule and we take this upstream gradient, we chain it, multiply it
with our local gradient to compute the gradient with respect to the inputs of the node,
and we pass this down to the nodes that are connected next. Okay, so now finally we're going to talk about neural networks. All right, so really, you
know, neural networks, people draw a lot of analogies
between neural networks and the brain, and different types of biological inspirations,
and we'll get to that in a little bit, but first let's
talk about it, you know, just looking at it as a function, as a class of functions
without all of the brain stuff. So, so far we've talked about, you know, we've worked a lot with this
linear score function, right? f equals W times x, and
so we've been using this as a running example of a
function that we want to optimize. So instead of using the
single in your transformation, if we want a neural network where we can just, as the simplest form, just stack two of these together, right? Just a linear transformation
on top of another one in order to get a two-layer
neural network, right? And so what this looks like is
first we have our, you know, a matrix multiply of W-one with x, and then we get this intermediate variable and we have this non-linear
function of a max of zero with W, max with this
output of this linear layer, and it's really important to
have these non-linearities in place, which we'll
talk about more later, because otherwise if you just
stack linear layers on top of each other, they're
just going to collapse to, like a single linear function. Okay, so we have our first linear layer and then we have this
non-linearity, right, and then on top of this we'll
add another linear layer. And then from here, finally
we can get our score function, our output vector of scores. So basically, like, more broadly speaking, neural networks are a class of functions where we have simpler functions, right, that are stacked on top of each other, and we stack them in a hierarchical way in order to make up a more
complex non-linear function, and so this is the idea of
having, basically multiple stages of hierarchical computation, right? And so, you know, so this is kind of the main way that we do this is by taking something like this matrix
multiply, this linear layer, and we just stack multiple
of these on top of each other with non-linear functions
in-between, right? And so one thing that this
can help solve is if we look, if we remember back to
this linear score function that we were talking about, right, remember we discussed earlier how each row of our weight matrix W was
something like a template. It was a template that sort
of expressed, you know, what we're looking for in the input for a specific class, right,
so for example, you know, the car template looks something like this kind of fuzzy red car,
and we were looking for this in the input to compute the
score for the car class. And we talked about one
of the problems with this is that there's only one template, right? There's this red car, whereas in practice, we actually have multiple modes, right? We might want, we're looking
for, you know, a red car, there's also a yellow
car, like all of these are different kinds of cars, and so what this kind of multiple
layer network lets you do is now, you know, each of
this intermediate variable h, right, W-one can still be
these kinds of templates, but now you have all of these scores for these templates in h,
and we can have another layer on top that's combining
these together, right? So we can say that actually
my car class should be, you know, connected to, we're looking for both red cars as well
as yellow cars, right, because we have this matrix W-two which is now a weighting
of all of our vector in h. Okay, any questions about this? Yeah? [student speaking away from microphone] Yeah, so there's a lot of ways, so there's a lot of different
non-linear functions that you can choose from,
and we'll talk later on in a later lecture about
all the different kinds of non-linearities that
you might want to use. - [Student] For the pictures in the slide, so, on the bottom row you have images of your vector W-one weight, and so maybe you would have images
of another vector W-two? - So W-one, because it's
directly connected to the input x, this is what's
like, really interpretable, because you can formulate
all of these templates. W-two, so h is going to be a score of how much of each template
you solve, for example, all right, so it might be
like you have a, you know, like a, I don't know, two for the red car, and like, one for the yellow
car or something like that. - [Student] Oh, okay, so
instead of W-one being just 10, like, you would have a left-facing horse and a right-facing horse,
and they'd both be included-- - Exactly, so the question
is basically whether in W-one you could have
both left-facing horse and right-facing horse,
right, and so yeah, exactly. So now W-one can be many different
kinds of templates right? They're not, and then W-two,
now we can, like basically it's a weighted sum of
all of these templates. So now it allows you to weight
together multiple templates in order to get the final
score for a particular class. - [Student] So if you're
processing an image then it's actually left-facing horse. It'll get a really high score with the left-facing horse template, and a lower score with the
right-facing horse template, and then this will take
the maximum of the two? - Right, so okay, so the question is, if our image x is like a left-facing horse and in W-one we have a template of a left-facing horse and
a right-facing horse, then what's happening, right? So what happens is yeah,
so in h you might have a really high score for
your left-facing horse, kind of a lower score for
your right-facing horse, and W-two is, it's a weighted
sum, so it's not a maximum. It's a weighted sum of these templates, but if you have either a really high score for one of these templates,
or let's say you have, kind of a lower and medium score
for both of these templates, all of these kinds of combinations are going to give high scores, right? And so in the end what you're going to get is something that generally scores high when you have a horse of any kind. So let's say you had a front-facing horse, you might have medium values for both the left and the right templates. Yeah, question? - [Student] So is W-two
doing the weighting, or is h doing the weighting? - W-two is doing the
weighting, so the question is, "Is W-two doing the weighting
or is h doing the weighting?" h is the value, like in this example, h is the value of scores
for each of your templates that you have in W-one, right? So h is like the score function, right, it's how much of each
template in W-one is present, and then W-two is going
to weight all of these, weight all of these intermediate scores to get your final score for the class. - [Student] And which
is the non-linear thing? - So the question is, "which
is the non-linear thing?" So the non-linearity usually
happens right before h, so h is the value right
after the non-linearity. So we're talking about
this, like, you know, intuitively as this example of like, W-one is looking for, you know, has these same templates as before, and W-two is a weighting for these. In practice it's not
exactly like this, right, because as you said, there's all these non-linearities thrown in and so on, but it has this approximate
type of interpretation to it. - [Student] So h is just W-one-x then? - Yeah, yeah, so the
question is h just W-one-x? So h is just W-one times x,
with the max function on top. Oh, let me just, okay so, so we've talked about
this as an example of a two-layer neural network,
and we can stack more layers of these to get deeper networks
of arbitrary depth, right? So we can just do this one more time at another non-linearity and
matrix multiply now by W-three, and now we have a three-layer
neural network, right? And so this is where the
term deep neural networks is basically coming from, right? This idea that you can stack
multiple of these layers, you know, for very deep networks. And so in homework you'll get a practice of writing and you know, training one of these neural networks, I
think in assignment two, but basically a full
implementation of this using this idea of forward pass, right, and backward passes, and using chain rule to compute gradients
that we've already seen. The entire implementation of
a two-layer neural network is actually really simple, it
can just be done in 20 lines, and so you'll get some practice
with this in assignment two, writing out all of these parts. And okay, so now that we've sort of seen what neural networks are
as a function, right, like, you know, we hear
people talking a lot about how there's biological
inspirations for neural networks, and so even though it's
important that to emphasize that these analogies are really loose, it's really just very loose ties, but it's still interesting to understand where some of these connections
and inspirations come from. And so now I'm going to
talk briefly about that. So if we think about a neuron, in kind of a very simple way, this neuron is, here's a diagram of a neuron. We have the impulses
that are carried towards each neuron, right, so we have a lot of neurons connected together
and each neuron has dendrites, right, and these are sort
of, these are what receives the impulses that come into the neuron. And then we have a cell body, right, that basically integrates
these signals coming in and then there's a kind
of, then it takes this, and after integrating all
these signals, it passes on, you know, the impulse carries away from the cell body to downstream
neurons that it's connected to, right, and it carries
this away through axons. So now if we look at what
we've been doing so far, right, with each computational node, you can see that this actually has, you can see it in kind of a similar way, right? Where nodes are connected to each other in the computational graph,
and we have inputs, or signals, x, x right,
coming into a neuron, and then all of these x,
right, x-zero, x-one, x-two, these are combined and
integrated together, right, using, for example our weights, W. So we do some sort of computation, right, and in some of the computations
we've been doing so far, something like W times x plus b, right, integrating all these together, and then we have an activation function that we apply on top, we get
this value of this output, and we pass it down to
the connecting neurons. So if you look at that
this, this is actually, you can think about this in
a very similar way, right? Like, you know, these are
what's the signals coming in are kind of the, connected
at synapses, right? The synapse connecting
the multiple neurons, the dendrites are
integrating all of these, they're integrating all of
this information together in the cell body, and then we have the output carried on the output later on. And so this is kind of the analogy that you can draw between them, and if you look at these
activation functions, right? This is what basically takes
all the inputs coming in and outputs one number
that's going out later on, and we've talked about examples like sigmoid activation function, right, and different kinds of non-linearities, and so sort of one kind of
loose analogy that you can draw is that these
non-linearities can represent something sort of like the firing, or spiking rate of the neurons, right? Where our neurons transmit
signals to connecting neurons using kind of these
discrete spikes, right? And so we can think of, you know, if they're spiking very
fast then there's kind of a strong signal that's passed later on, and so we can think of this value after our activation function
as sort of, in a sense, sort of this firing rate
that we're going to pass on. And you know in practice,
I think neuroscientists who are actually studying this say that kind of one of the non-linearities that are most similar to the way that neurons are actually behaving is a ReLU function, which
is a ReLU non-linearity, which is something that we're going to look at more later
on, but it's a function that's at zero for all
negative values of input, and then it's a linear
function for everything that's in kind of a positive regime. And so, you know, we'll talk more about this activation function later on, but that's kind of, in practice, maybe the one that's most similar to how neurons are actually behaving. But it's really important
to be extremely careful with making any of these
sorts of brain analogies, because in practice biological neurons are way more complex than this. There's many different
kinds of biological neurons, the dendrites can perform really complex non-linear computations. Our synapses, right, the
W-zeros that we had earlier where we drew this analogy,
are not single weights like we had, they're actually
really complex, you know, non-linear dynamical systems in practice, and also this idea of interpreting
our activation function as a sort of rate code
or firing rate is also, is insufficient in practice, you know. It's just this kind of
firing rate is probably not a sufficient model of how neurons will actually communicate to
downstream neurons, right, like even as a very simple
way, there's a very, the neurons will fire at a variable rate, and this variability probably
should be taken into account. And so there's all of these, you know, it's kind of a much more complex thing than what we're dealing with. There's references, for example
this dendritic computation that you can look at if you're
interested in this topic, but yeah, so that in practice, you know, we can sort of see how
it may resemble a neuron at this very high level, but neurons are, in practice, much more
complicated than that. Okay, so we talked about how
there's many different kinds of activation functions
that could be used, there's the ReLU that I mentioned earlier, and we'll talk about all
of these different kinds of activation functions in
much more detail later on, choices of these activation functions that you might want to use. And so we'll also talk
about different kinds of neural network architectures. So we gave the example
of these fully connected neural networks, right,
where each layer is this matrix multiply, and
so the way we actually want to call these is, we said
two-layer neural network before, and that corresponded to
the fact that we have two of these linear layers,
right, where we're doing a matrix multiply, two
fully connected layers is what we call these. We could also call this a
one-hidden-layer neural network, so instead of counting the number of matrix multiplies we're doing, counting the number of
hidden layers that we have. I think it's, you can use either, I think maybe two-layer neural network is something that's a
little more commonly used. And then also here, for our
three-layer neural network that we have, this can also be called a two-hidden-layer neural network. And so we saw that, you
know, when we're doing this type of feed forward, right, forward pass through a neural network, each of these nodes in this network is basically doing the
kind of operation of the neuron that I showed earlier, right? And so what's actually happening is is basically each hidden
layer you can think of as a whole vector, right,
a set of these neurons, and so by writing it out this way with these matrix multiplies
to compute our neuron values, it's a way that we can
efficiently evaluate this entire layer of neurons, right? So with one matrix multiply
we get output values of, you know, of a layer of let's say 10, or 50 or 100 of neurons. All right, and so looking at
this again, writing this out, all out in matrix form,
matrix-vector form, we have our, you know, non-linearity here. F that we're using, in this
case a sigmoid function, right, and we can take our data
x, some input vector or our values, and we can apply
our first matrix multiply, W-one on top of this,
then our non-linearity, then a second matrix multiply to get a second hidden layer, h-two, and then we have our final output, right? And so, you know, this
is basically all you need to be able to write a neural network, and as we saw earlier, the backward pass. You then just use backprop
to compute all of those, and so that's basically all there is to kind of the main idea
of what's a neural network. Okay, so just to
summarize, we talked about how we could arrange neurons
into these computations, right, of fully-connected or linear layers. This abstraction of a
layer has a nice property that we can use very
efficient vectorized code to compute all of these. We also talked about how it's important to keep in mind that neural networks do have some, you know,
analogy and loose inspiration from biology, but they're
not really neural. I mean, this is a pretty loose analogy that we're making, and
next time we'll talk about convolutional neural networks. Okay, thanks.