Good morning. Today we talk about the second
part of lecture in Neural Network, where we will talk about Multilayer Neural Network.
We have already looked at individual neural units and discussed that they can represent
linear functions, but the main excitement about neural network is because they can represent
non-linear functions. And we can represent non-linear functions by stacking layers of
perceptrons into different architectures. So, today we will look at Feed Forward Multilayer
Neural Networks which is a particular type of connections in neural network Now, first of all we will look at what are
the limitations of perceptrons. We have seen that the really not discussed, but perceptrons
have this monotinicity property. So in a perceptron what we are doing is that,
we have an input and we have multiple outputs and we have weights associated with them.
And because of this type of connection perceptrons have a monotonicity property. If a link has
positive weight activation can only increase as the input value increases. It cannot represent
functions where input interactions can cancel each other. So each input is individually
interacting with the neural, so it cannot handle interactions between the For example, perceptrons can be used to represent gates like and suppose there
are two inputs x 1 and x 2 you want to pass one if both x 1 and x 2 are true. Then you
can set the weights and the thresh hold such that the result will be 1 only when x 1 and
x 2 is 1. However, perceptron cannot represent the XOR
function. In XOR function, suppose we have two variables x 1 and x 2 which can take 0
and 1. In XOR is true for in this case it is negative here and here. So, XOR function
is not linearly separable and it is not monotonic in the individual inputs and it cannot be
represented by a perceptron. So, a solution to this is to have a multilayer
neural network. Where we have this units stacked on each other. So, we have these inputs. So, we have x 1, x 2, x 3, x n as the input
and we have let us say 1, this is the first layer we call it Hidden layer 1 and it has
units let us say z 1, z 2, z 3, z l and then there is the output layer where we have units
y 1, y 2. Now, we have inputs feeding through from input to this layer and etcetera, and
then from here to this layer. We can have a multilayer network where we have the input
layer and the output layer and between the input and the output we can have other units
which are called the Hidden Units. Why Hidden? Because, in the training examples they are
not observed, we have the input and the output these are called the hidden units. And through
these hidden units we can represent many non-linear functions.
For example, if you look at this picture we have x 1 and x 2 and each of z 1 and z 2 is
a linear function of x 1, x 2. And then y is a linear function you know does a linear
separation between x 1 and x 2, but y does a y is a function of z 1 and z 2 and if we
use suitable non-linear activation functions then this sort of connection can represent
XOR or other non-linear function. So, we can look at this expression of what
multilayer networks can express. We have seen that single layer networks can represent linearly
separable function. Multilayer networks can express interactions among the input. In particular
a 2 layer network means this is a 2 layer network where you have 1 hidden layer and
1 output layer, this is the input this is the hidden computing layer this is the output
computing layer. And these 2 layer neural network can represent any Boolean function.
And continuous functions within a tolerance provided of course you have the requisite
number of hidden units and you would use appropriate activation functions then all Boolean functions
and all continuous functions within a certain tolerance can be represented using 2 layer
neural networks. If you have 3 layer neural networks, then you can represent all computable
functions. These functions can be represented using 2 layer and 3 layer neural networks.
So, they have very good representation capacity. But the next question is, is it learnable?
Just because a presentation exists to represent a function does not immediately mean that
you can learn the function well. But, for neural networks like this learning algorithms
do exist, but they have weaker guarantees. In perceptron learning rule we said that,
if a function exists then this procedure will converge. So, for multilayer neural networks
we cannot give such strong guarantees, but algorithms exist and people are working on
different very exciting types of algorithms. So, let us look at a general structure of
a multilayer network. This is a 3 layer network where there is the input, and then we have
the first hidden layer, the second hidden layer, and the output. This is an example
of a layered feed forward neural network. This is a feed forward neural network because
the inputs, these connections that we have drawn are single connectional. Input to first
hidden layer, first hidden layer to second hidden layer, second hidden layer to output,
all the edges are single directional and it is going forward from the input to output
there is no back link. So, this is why is called feed forward neural
network. This is called a layered network because we have organized the neurons into
layers and layer i is connected to layer i plus 1. Also this particular diagram shows
a fully connected layered feed forward network, where there are 2 hidden layers, 1 output
layer and of course the input layer is there. So, in this particular type of feed forward
neural network the input will be going from feed forward from input to the output through
the hidden layer. Now in the while talking about perceptron training we said that based
on the error in the output we change, if we observe there is an error in the output between
what should be the ideal value and what is computed then we change the weights of this
connections so that that error is made smaller. So that is what we looked at in perceptron
training. Here also we need to do the same thing. However,
here there is one difficulty. We know what should be the ideal output here and the ideal
output here, so based on that we can change these weights. But, at the hidden node the
ideal output is not told to us, it is not known to us directly, so on what basis do
we compute these weights. We know the ideal output here, we can compute the error directly
and on that basis we can update these weights. But we need to know what is the error here.
So, what is done is that the error that is observed here is propagated backwards. The
assumption is the error here is a result of error here, here, and here, because these
are the three nodes that are feeding inputs. This error here is due to errors here, and
therefore what we do is that this error we now back propagate to these nodes. The error
here also we back propagate to this nodes based on these each of the nodes we know what
is the notional error, based on that error we update this weights. If there were more
layers here, the error here will again be back propagated to those other nodes.
So, the error signal flows backward. The computation in the network is forward, but the error signals
flow backward based on this computation of the error signals we figure out how the weights
have to be changed. That is why the method for updating weights in such multilayered
neural network is called back propagation. So the input is going forward and the error
signal is going backward. So, these are the steps in the back propagation
training algorithm. First step is initialization, we give the structure of the network and we
initialize the values of these weights and usually we give these weights small random
values. We give them small random values to the weights at different biases in the network. Now after initialization we do the forward
computing. Now what we do is that we take our training
example which comprises x 1, y 1, x 2, y 2, x m, y m. These are our training examples
we take the examples one by one and apply them to the network. We apply x 1 to the network
we get some output y 1 hat. So, given x 1 we get y 1 hat, so this is the output that
we get. And the ideal output is y 1 hat, and the outlet let me draw it here. So, the output
that we get is y 1 hat, here we get y 2 hat, here we get y m hat for a particular configuration
of the network. And if this y 1 hat and y 1 are different this is different then we
look at the error and based on this we change the weights.
So, in the format computing we apply the input first to this layer here we first take the
summation followed by the activation function we get the output at this layer; we get the
output z 1, z 2, z 3, and then based on that we compute y 1. y 2 etcetera as this is computed
as the activation function applied to sigma w i z i right here, it is activation function
applied to sigma x i w i. So, we get the do the forward computing. After doing the forward computing, now we
have to update the weights. As i said that we can update the weight at the output layer
in a similar fashion as we did for single layer units. If you think of you have single
layer sigmoid units we have already worked out how to update the weights. These weights,
suppose let us say this set of weights we call them w and this set of weights we call
them v. So, w is updated using the sigmoid layer training rule which we discussed in
the last class. But, how to update v; As we said that we do
not know the target values of z 1, z 2, z 3, so what we do is that we back propagate
the error here to the error here. So, we propagate errors which are visible at the output units
to this hidden units and then based on that error we do the similar training rule. And
if there were more layers between this and the input so these errors can be further propagated
in this direction, further propagated downwards. So, error back propagation can be continued
if the net has more than hidden layers. Now, we will have to see how to compute the
errors. And for that let us try to do the derivation of this computation. Now, for the output neuron so the error function
is E equal to half sigma y minus y hat whole square, and we can take summation over all
the training examples if you wish. Now let us think of in the network we have different
units, these units are either at the output or at the hidden layer, so let us take any
unit chain. For each unit j, so j is a unit and for j the output is o j. First of all
we have seen that in the unit there are two components; one is the summation component,
the second is the activation function component. So first, the summation is applied and then
the activation function is applied. Let us say that output here we call net j and output
after this we call o j, this is a unit j. So, the output of the summation is net j and
output of the final output is o j. So for unit j, o j equal to phi of net j and net
j equal to summation w i x i, where i ranges over all the inputs to the unit j. This is
equal to phi of sigma k equal to 1 to n, k ranges of all the input w k j. So, w k j is
the weight from the unit k to this unit j and o k is the output of unit k. So, output
of unit k is an input to unit j. Suppose, this is a node which is j and this has these
three inputs; 1, 2, 3,, so this is w 1 j, this is w 2 j, this is w 3 j and the output
is here o 1, o 2, o 3. So, this is the output computed as unit j.
Now, if we want to find the partial derivative of the error with respect to w i j we get
this is equal to del e by del o j, del o j by del net j, del net j by del w i j. So to
simplify the computation we have used the chain rule to write del e the partial derivative
of the error with respect to w i j as the product of these three partial derivatives,
so to make the computation simple for us. This is m j this is my unit j and these are
the units which are feeding to this unit, this is the unit i. And j has you know initially
the output is net j and then the output is o j. So this is o j and this is o i.
Now, we have decomposed this partial derivative in these three components. Now what we will
do is that let us remember this and we will do these three computations separately then
we will write them together. So first let us look at del net j by del w i j, del net
j by del w i j. So what is net j, net j is sigma o i w i j. And so we have del del w
i j sigma of w k j o k equal to 1 to n, so this is equal to o i. Because only here we
have w i j and these other link 1 to n they have nothing to do with i. So, del net j by
del w i j is simply o i. In corresponding to o i only we have w i j, and we have o 1
w 1 j, o 2 w 2 j, o 3 w 3 j they have nothing to do with w i j. So only here we will have
and so this is equal to o i. So, del net j by del w i j this we have this
part we have computed and this a is equal to o i. Let me clean this so that we can work
out the other two components of this expression. Next let us take this second term del o j
by del net j. What is o j? O j equal to phi of net j, so del del net j we have phi of
net j. We have already seen that this particular derivative will depend on the form of the
activation function if you assume sigmoid activation function, for sigmoid activation
function we will have this is equal to phi of net j 1 minus phi of net j. So, this is
the second term here which we have computed given like this.
Now, let us look at the third term del e by del o j. Now we can take the derivative with
respect to o j and we can get recursive expression of the derivative as follows; this is equal
to summation over i. Now first of all let us say that the inputs to this network are
1, 2, 3, 4 up to n, so, let us say z equal to 1, 2, n these are the inputs to the unit
o j. We can write this as follows; summation i So, del e o j, before that let me write
this. So, this is e of the output o j is equal to
this error is dependent on or rather write let me just write it in a proper way. It is
other way round. So, this error at the unit o j has come from the units which are upstream
of o j. If you allow let me rub this part, so that we can make the drawing here.
So, this is my the unit j and j has certain input which we have already seen, but j outputs
to the next layer, and let the output of j go to the input of these
units and let us call this set of nodes z. And let set say that z comprises of z 1, z
2, etcetera, or just for simplicity let me denote it for this step as 1 2 l. These are
the nodes to which the output of j is feeding. During back propagation the error of these
nodes is due to the error here. So, the error of o j is based on error of net z 1 net z
2 net z 3, or for simplicity let me write net 1, net 2, etcetera net l by del o j. So, these components due to net 1, net 2,
net l we can do a summation. We can write this as let me again compact it error of 1,
2, l, and this we can write as summation over l, this index letters call it i d or let us
use l as the index and l we have del e by del net l del net l by del o j. So, this error
is coming from these units and i am taking the summation over that so summation of del
e by del o j del e where it is coming from this unit. So, we were writing it in chain
form as del e by del net l del net l by del o j. So, for this node we have net 1 and then
o 1, net 2 then o 2, net 3 then o 3, and the output of this contributes to the net 1 contributes
to net 2 contributes to net 3. Now what we can write is that, if we look
at this so we can write this as summation of over l del e by. So, let us look at the
output here, output here is o l. We can write it as del e by del o l, del o l by del net
l. This part i have expanded to del e by del o l times del o l by del net l and we have
del net l by del net j. So, del net l by del o j, so net l will depend on sigma w j l o
j, so from this we get w j l. This component is equal to w j l. So, we have this expression.
We can compute this derivative of error with respect to o j if all the derivatives with
respect to the output of the next layer are already computed. Now, we can write this as.
Suppose we have del e by del w i j equal to delta j o i. We have the output of the current
unit and the delta or the errors which are available here these errors are the errors
which are available here, which is being proper contributing to the error here. So, this delta
j corresponds to the error which was computed at this layer which is brought here. And the
component of this delta j are, so delta j is one of this units and it is component is
del e by del o j times del o j by del net j And this will be treated in different ways depending on whether the unit is the output
unit or it is an intermediate unit. If it If it is an output unit we can use the formula which
we have already seen so it will be equal to o j minus t j, t j is the target output at
that node which we wrote us here, so o j minus t j times o j times 1 minus o j. This will
be if j is an output neuron. And if it is an intermediate neuron this will be equal
to it will get the outputs from the next layer, get the errors or the delta values from the
next layer as we have written here so it will be sigma over the set z delta l w j l times
o j times 1 minus o j, if j is an in a neuron. So, this delta j is if we are looking at for
the output neuron we have already seen this delta j equal to this, but for an intermediate
neuron it comes from this formula where we will get delta j as sigma over z. Where, z
is the set of those units to which this unit feeds input, so sigma over z the delta of
those nodes the errors computed at those nodes times w j l times o j into 1 minus o j. This
particular part, this comes due to the sigmoid activation function, if you change the activation
function this will be a little different. So, we stop the class today and in the next
class we will look at how this is incorporated into the back propagation algorithm.
Thank you very much.