Good morning, today we will talk about Multi
layer Neural Network and the Backpropagation algorithm. To refresh your memory, let us see, what we
can do by a single layer neural network. If we use a single layer perceptron, which
we have already seen in a single layer perceptron, we have linear summation of the input units
followed by non-linear function and the non-linear function, in the case of perceptron is a thresh-holding
function, but we could also use other function such as the Sigmoid function, the Tanh function
or the Relu function. So, we do sigma of sigma w i x i if you are using the sigmoid function
or in general we can use function phi. Now, if you have a single layer neural network
or single layer perceptron it can the decision boundary is represented by a straight line.
Suppose, x 1 and x 2 are the 2 features that you have the decision boundary is a straight
line and such units will work to represent functions where the 2 classes examples belonging
to 2 classes can be separated by a straight line, for example, if you look at the slide. Here, we have 2 classes; 1 class 1 denoted
by red class, class 2 denoted by blue circle and there is a straight line function that
separates them. On the right, we see 2 other examples, where we have class 1, which is
red; class two, which is blue and there is a line separating class 1 from class 2 and
there exist linear decision boundary which separates class 1 from class 2.
Now, let us look at this fourth example. In the bottom left here, we have 2 red points,
2 red plus points 2, blue 0 points. Now, these examples we cannot have a linear boundary
which separates the blue points from the red points. So, such machine learning problem
cannot be represented by single layer perceptron. Now, this is an example of a function which
can be represented by a single layer perceptron and this particular function is the Boolean
OR function which most of you are familiar with the Boolean OR function outputs one,
if any of the inputs is 1 and output 0, only if all the inputs are not 1 now this function
can be represented by a single layer perceptron and learning the function means learning the
weights on the corresponding edges. So, there are 3 weights associated with this
perceptron, w 1 from x 1 to the unit w 2 from x 2 that the unit and w 0, which is the bias
now this particular diagram shows you that some specific values of w 0, w 1 and w 2 for
which we can implement this function this function can be implemented by some sets of
values this is an example of a set of values which can implement this function. So, this
is the representation of the OR function and given this particular values of w 0, w 1 and
w 2, we get a linear decision boundary which separates the space into 2 regions. So, that
the plus points are in 1 region the minus points are in another region. Similarly, this is another example of a function
which can be represented by a single layer perceptron. This is the Boolean AND function,
where the output is 1 only if both the inputs are 1 otherwise the output is 0, it is represented
by this diagram and thus decision layer corresponding to some specific weights right. So, for example,
if I said w 1 equal to 1 w 2 equal to 1 and w 0 equal to minus 1.5, it is it will denote
decision surface which works for this and function there could be other combinations
of weights also which work for the and function, but this is 1 combination of weights which
implements the and function. But, when we have the XOR function, which
is 1, if exactly one of the inputs is 1 and 0; if both both of them as 0 or both of them
and 1 that is XOR function and as we can see there is no linear decision boundary that
separates the 0 points from the 1 point. So, in order to represent this function we can
go for multi layer perceptrons. This is an example of implementation of the
Boolean XOR function. So, we have initially we have the first layer we have 2 perceptrons;
the first perceptron h 1 and h 2 and then the second layer we have 1 perceptron and
together these three units. In 2 layers, they can represent the Boolean XOR function for
certain combination of weights, for example, we can have the first the left unit left h
1 at the first layers to represent the OR function by putting the weights as 1 1 and
minus 0.5. We can have the second node at the first layer
represent the and function by putting the weights has 1 1 minus 1.5 and we can have
the node at the second layer implement, the final XOR function by setting the weights
has 1 minus 1 minus 0.5, this is one example implementation of the XOR function by using,
2 layer perceptron XOR cannot be implemented by a 1 layer perceptron, but it can be represented
by a 2 layer perceptron function. Now, in general if you look at multi layer
neural networks, we can say this thing about the representation capability of neural networks
if you have single layer neural networks they have limited representation power and we have
already discussed they can represent linear decisions surfaces and therefore, if the examples
of 2 classes are linearly separable then only they can represented by a single layer perceptron.
If you have non-linear functions you have to go for multiple layers and as you can see
if we had only linear units combination of linear units could be at another linear unit.
So, in order for multi layer neural networks to represent non-linear of function it is
important that the functions implemented at the individual units are non-linear that is
why we go for non-linear units either threshold unit or sigmoid unit or tanage unit or so
on. Now, when we go for multi layer network, if we go for a 2 layer network, suppose this
is the input unit and we have 1 hidden layer. So, in a neural network, this is the input
x 1 x 2 x n are the inputs and this is the output and we can have 1 or more hidden layers.
Suppose this is the first hidden layer comprising three nodes right and we can have connection
from the input to the first layer and we can have connection from the first layer to the
second layer. So, such a network this is the network with
1 hidden layer if we take a network with 1 hidden layer it is normally called a 2 layer
neural network, such neural networks can represent all Boolean functions all Boolean functions
can be represented by neural network with a single hidden layer. It is easy to see that
it is possible because you may know that any Boolean function can be represented using
Nand gates, using 2 layer of Nand gates and I leave it for you to figure out the we can
have a single layer perceptron represent the Nand function.
We have earlier seen that neural network can represent a single layer perceptron can represent
the and function you can I leave it you as an exercise to see that it can represent the
Nand function which is the inverse of the and function and by cascading 2 layers of
Nand you can represent any Boolean function and any Boolean function can therefore, be
represented by a neural network with a single hidden layer. So, this is quite obvious. Secondly we can say every bounded continuous
function can be approximated with arbitrarily small error by neural network with 1 hidden
layer. So, not just Boolean function if you take a continuous function if that continuous
function is bounded right that is it does not go to infinity, it is within a bound then
any continuous function can be approximated by arbitrarily small error using single hidden
layer neural network, but if you have a neural network with 2 hidden layers like this. So, this is the first layer this is the second
layer h 2 and there is a connection from the nodes in h 1 to the nodes in h 2 and then
the there is the output. So, this is called a 2 hidden layer neural network it can be
shown that any function at all can be approximated to arbitrary accuracy by a network with 2
hidden layers if you are using a network with 2 hidden layers such a network can represent
any arbitrary function which is a very powerful statement; however, where is the cache just
because given network can represent a function does not mean that the function will be learnable
in the sense that as we will see in a neural network we do not know. So, we say that there
exist neural networks which can represent this function, but that neural network comprises
of a number of nodes in the different layers and that number of bits right.
So, we know that a function can be represented by a 2 hidden layer neural network, but we
do not know how many nodes we should put what should the weights p and we do not know how
may nodes we will put. So, that to figuring how to how many nodes we will put and what
would be the weight may turn out to be hard for different problems and there when by said
that any Boolean function can be represented by a network with 1 hidden layer I did not
mention any thing about the number of nodes that you require they can be some Boolean
function for which the number of nodes that you require can be very large. So, just because
a function is representable may not mean that it is learnable.
Now, we will see how we can learn in multi layer neural network using the back propagation
algorithm now if you look at the slide this is the schematics of a multi layer neural
network. So, we have the inputs we have the first hidden layer. So, this shows at 2 hidden layer neural network
the input the elder nodes are the first hidden layer the blue nodes are the second hidden
layer and the green nodes are the output layer. Now, as we have earlier discussed that when
you give the input and you have observe the output and if you have a training set the
training set will tell you for a given input what should be the ideal output and from the
training set you can find out what is the error for a neural network unit for a particular
input and we can update the weights. So, that this error is reduced. Now, the error is only
observed at the output layer. So, if you have a neural network. Where this is the input layer and this is
the output layer and these are some hidden layers and there is connectivity between this
and this, this and this, this and this for the nodes at the output layer we can find
out the error for a given input. So, if I take the first input x 1 we can find out what
is the output y 1 and we can find out what is the output that we are getting using the
neural network. So, we can find out at every node for that given input what is the actual
error and we can try to change the weights. So, that the error becomes small, but the
error can only be observed that the output, we do not know for a training example what
should be the value of a node here or a node here right.
So, what we are going to do is that the error that we find at this layer we are going to
back propagate the error and estimate the error at the inside hidden layers we are going
to take the error which we observed at the output layer back propagate the error to the
previous layer. So, we say that the error here is because of the error which was computed
here. So, the blame of the error here specially depends on at this node, which in turn depends
on the error at these nodes. So, we will take this error and back propagate it to the other
nodes from which it takes input and if you look at the weight of this edge and the weight
of this edge if this has a higher magnitude of weight this node has higher contribution
here if it as a smaller magnitude of weight it has a smaller contribution here.
So, when we portion error backwards we apportion the error proportional to the weight if the
edge has larger weight we put larger error we apportion to that node. So, back propagation
works in this way when we apply the neural network on a particular input the input signal
propagates in this way, right input signal gets computed in this way. So, that we can
get the output, but when we find the error we find the error at this layer and the error
is back propagated to the previous layers and based on the notional error out after
back propagation based on that notional error we do the weight updating at this layers.
So, here we update the weights based on the directly observed errors after we have back
propagated the error we find the notional error at this level and based on that we change
these weights again we back propagate this error further here and based on that we change
these weights. So, that is why we call this method back propagation back propagation is
a method to train multi layer neural network the updating of the weights of the neural
network is done in such a way. So, that the error observed can be reduced the error is
only directly observed at the output layer that error is back propagated to the previous
layers and with that notional error which has been back propagated we do the weight
update in the previous layers. Now, in the last class we have looked at the
derivation of the error derivation of the update rule of a neural unit based on are
the error right. We saw how we could use gradient descent to find out the error gradient at
a unit and we could change the error based on going to the negative of the error gradient
just to recapitulate. If you have 1 output neuron the error function
is given by E equal to half y minus o whole square for a particular input y is the expected
out y is the goal standard output o is the actual output. So, y minus o gives you the
error. So, half y minus o whole square is the particular measure of error that you are
using. Now, for each unit j the output o j is defined
as o j equal to the function phi applied on net j when net j is the sum of the weighted
sum of the units at the previous layer. So, net j is sigma w k j and phi is the non-linear
function that we are using as we have mentioned we could use phi as a Sigmoid function or
Tanh or Relu or some such function. And w k j corresponds to those edges which are coming
from the previous unit to this unit. The input net j to a neural is the weighted sum of outputs
o k of the previous n neurons which connect to this neuron and following the method which
we followed in the last class now given this we can find out given that is formula E equal
to half y minus o whole square and o is as it is defined here we can now try to find
out the derivative of this error function with respect to the different weights.
So, as we have n different weight correspond n plus 1 different weights corresponding to
the previous inputs and the bias we can take the partial derivative of this error function
with respect to each weight. And this where quantity partial derivative of the error function
with respect to w i j which is 1 specific weight can be re written as by the chain rule
del e by del o j times del o j by del net j times del net j by del w i j right. So,
net j equal to sigma w k j o k and o j equal to phi of net j and we can write this del
e by del w i j in this spot. Now,. So, this is what we have re-written
here this is the error function this is o j and del e by del w i j this is what we did
in the previous slide now this can be written as del e by del o j can be written as summation
over l del e by del o l del o l by del net z l times w j z l which corresponds to this
output is coming from the other a other outputs other nodes which are index by l and this
is phi of net j into 1 minus phi of net j, which is the corresponding to the sigmoid
function and this is o i this follows from the derivation that we worked out in the last
class. So, simplifying we can find out at del e by
del w i j is delta j o i where this quantity is called delta j right the detail derivation
we have done in the last class. So, del e by delta w i j is delta j o i where delta
j comes from the nodes of the next layer from the errors in the layer is back propagated
to the previous layer where delta j is given by del e by del o j del o j by del net j which
is equal to o j minus y j into o j times 1 minus o j if j is an output neuron if i am
at the last layer otherwise it is recursively computed as sigma over z delta z l w j l times
o j times 1 minus o j if j is a neuron in the hidden layer.
So, delta j can be computed directly at the output layer and once you have computed delta
for each of the output units, you can compute of the previous unit after you have computed
all the deltas of the previous unit you can compute the delta of two previous unit. So,
that is called the back propagation and based on that this is the recursive computation
of delta starting from the last or the output layer and going 1 level backward up to the
beginning. Now, once you are figured this out you know what is del e by del w i j and
then you change the weights using gradient descent delta w i j equal to minus eta del
e by del w i j right. So, this is the gradient this eta is the learning factor small eta
means slow convergence large eta means faster rate and minus because we are doing gradient descent. So, based on this we can write the back propagation
algorithm which is actually very simple as is given in this slide, we take a neural network,
we have a structure, we have 1 hidden layer or 2 hidden layer whatever you want you decide
the number of layers in the neural network number of units in each layer and you do connection
from this input to the first lead in layer first to the second, second to the output
and then you have a number of weights initially you initialize all the weights to small random
numbers after that you carry out an iterative process which is given as here until satisfied
what you do is you input you have a set of training examples you input the first training
example to the network and compute the network output.
So, you give x you find 1 now for each you find o you get x 1 you find o 1 give x 2 you
find o 2 now for each input unit k you may have only 1 output or multiple outputs. So,
where each output unit k you compute delta k at the output layer as o k into 1 minus
o k into y k minus o k then you go to the previous hidden layer for each hidden unit
h you compute delta h as equal to o h into 1 minus o h into sigma over w h k delta k
for all k which are in the outputs now after that you update each network weight w i j
as w i j is w i j plus delta w i j and delta w i j is minus eta delta j x i j as we have
already seen. So, this is the back propagation algorithm
we have an input we have the output that we get from the network and we have the target
output we find the error from the target output based on that we update the weights we back
propagate the weights. So, the previous layer by propagating the delta value and continue
we continue for all the hidden layers. So, this is the back propagation algorithm, but
it is very simple to implement. So, in back propagation, we do gradient descent
over the network weight vector and even though, the example, we have shown is for a layered
graph we can do back propagation over any directed a cyclic graph the second thing to
observe is that by doing back propagation we are not guarantee to find the global best
we only get a local minima. So, we have a very complex error surface comprising of the
weights at all the layers by doing back propagation we are updating the weights to do better and
better and we continue doing it until the network converges, but when the network converges
it will converge to a local minima which need not be a global minima and there are certain
tricks that 1 can use to prevent getting trapped in a local minima.
For example, 1 such trick is to include momentum factor called alpha. So, the idea is that
if you are going and trying to hit local minima. You try to prevent that by maintaining the
previous direction of movement by the general direction of movement you do not want to deviate
and get start. So, momentum what it does is that when you change delta w i j you not only
look at eta delta x j x i j which we have derived earlier, but we also keep another
factor which keeps track of the direction of weight change at the previous iteration
delta w i j n is the weight change at the nth iteration which is equal to eta delta
j x i j plus alpha times direction of weight change in the previous iteration.
If you apply the momentum training may be slow, but you are less likely to hit a local
minima or bad local minima, but 1 thing to note is that in neural network when you use
multiple layers even if training is slow after you have learn the weights applying the neural
network is very fast. Now, there are few other observations, I will
make when we do the weight update we can do a batch update that is given a particular
configuration and given a set of training examples with respect to the all the training
examples we can compute the partial derivative of the error and find the best way of updating
it or we can do it for 1 input at a time. So, the first method is called batch gradient
descent. The second method is call the stochastic gradient descent there you take 1 input at
a time based on that you change the weights now stochastic gradient descent is less likely
to gets stuck in a local minima because if there is a local minima, it is unlikely that
for all examples it will be that minima. So, first if you are stochastic gradient descent
the neural network is more likely to get towards the global minima and there is something in
between which is called mini batch gradient descent. So, instead of taking all the examples
at a time or a single example at a time you take a batch of examples at a time and with
respect to that you do gradient descent. So, batch gradient descent calculates outputs
for the entire data set accumulates the errors then back propagates and makes a single update.
It is too slow to converge and it make gets stuck in local minima stochastic or online
gradient descent on the other hand will take 1 training example at a time with respect
to that it will find the error gradient it converges to solution faster and often helps
get the system out of local minima and in between we have mini batch gradient decide. Now, the entire training process can be divided
in to epochs. If you have a number of training example 1, it epoch will look at all the training
examples. Once then we will have the second epoch then the third epoch etcetera. When
we are learning an epochs when do you stop? So, we keep training the neural network on
the entire training set over and over again and each episode is called an epoch and we
can stop where the training error is not is getting saturation or we can use cross validation
while we are training the neural network, we can also keep validating it on a held outside
and when we see that the training and validation errors are closed then we can stop or we can
stop when we reach a maximum number of epochs. Now, in neural networks like other machine
learning algorithms over fitting can occur and this over fitting is illustrated by this
diagram. So, on the x-axis, we have the number of iterations and the y-axis, we have root
mean square error as is typical of many machine learning algorithm as we increase the number
of iteration the error on the training set keeps on reducing and it may even become 0
may or may not become 0, but the error on a held out set typically will initially decrease
and then it will increase were over fitting has occurred right.
So, this is the zone where over fitting has occurred. Ideally you should stop before the
validation error starts to increase. So, if you can keep track of the validation error
you will know that this is the place where you must stop and not continue any more iteration
because beyond that the network is likely to over fit and the accuracy of the network
will go down. So, this picture illustrates local minima
as we said neural networks gets can gets stuck in local minima for small networks and we
also said that if we use stochastic gradient descent it is less likely do get stuck in
local minima, but in practice when you have a large network with many weights local minima
is not. So, common because we have many weights it is unlikely that and you are doing every
weight separately it is unlikely that the same local minima will be the minima of all
the weights. So, in conclusion we can say the artificial
neural network, let us you use highly expressive non-linear function that can represent all
most all functions. It comprises of a parallel network of logistic function units or other
types of units are also possible the principle works by minimizing the sum of squared training
errors there are also neural networks with different other laws functions, but we will
not talk about it in this class here we have looked at neural networks to minimize the
root mean square error we can add a regularization term to a neural network which I did not talk
about. So, what you can do is that you can write
to prevent the weights from getting large by penalizing networks where the weights have
large values by adding regularization term neural networks can get stuck in a local minima
and it may exhibit over fitting. With this, we come to a conclusion in the
next class. We will give a very brief introduction on deep learning.
Thank you very much.