Hello. So, in the last class we have started
discussion on radial basis function neural network. We have seen that a radial basis
function neural network consists of three layers, 1 is of course, the input layer and
1 of the three layers is the output layer and in between the input layer and output
layer we have a hidden layer. So, unlike in case of multilayer perception, where we can
have 1 or more hidden layers. In case of the radial basis function network we have only
1 hidden layer, and every neuron in the hidden layer computes a radial basis function. So,
when I have the neurons in the hidden layer. For every neuron, which computes a radial
basis function, radial basis functional value for an input feature vector. Every radial
basis function has got 2 parameters, 1 is called the receptor and other one, which defines
the spread of the radial basis function. So, the architecture that we have is something
like this. We have 1 input layer, so the input layer contains a number of neurons and the
number of neurons in the input layer is same as the dimensional, dimensionality of the
feature vector. So, that if the feature vectors of
are of dimension d, I will have d number of nodes in the input layer so there will be
d number of nodes. When the dimensionality of the feature vector is d, in the hidden
layer I will have a number of nodes. And suppose the number of nodes in the hidden layer is
say capital M. So, as we discussed in our previous class that the purpose of the hidden
layer nodes is to project the d dimensional feature vector into a higher dimensional feature
vector. So, as I have n number of nodes in the middle layer, so obviously this M, the
number of nodes in the hidden layer is greater than the dimensionality of the feature vector
which is d. And as we say it that every node in the hidden layer computes a basic radial
basis function. Like this and at the output layer, which are basically the classifying
neurons, I have the number of neurons of the number of nodes, which is the same as the
number of classes that we have. So, if I have c number of classes then at the output layer
I will have c number of neurons. So, there we have c number of neurons. Where c is the
number of class in which the pattern has to be classified. Then every node in the input
layer is connected, is feeding input to every node in the hidden layer and output of the
hidden layer every node output from the hidden layer is connected to every node in the output
layer. So, I have the connections, which is something like this. So, these are the connections
from the input layer nodes to the hidden layer nodes. And because the purpose of this connection
is simply to forward the input feature vector to the nodes in the hidden layer, we can assume
that weight of each of this connection is equal to 1 and that is a difference with the
connections from the hidden layer to the output layer nodes. Because, in every output layer
node, computes a linear combination of the outputs of the hidden layer node. So, the
connection from the output layer node, to the connection from the hidden layer nodes
to the output layer nodes is something like this. Sorry. Where we can say that every field
i th node in the hidden layer is connected to the j th node in the output layer to a
connection weight, which is equal to W i j. So, because of this, every node in the output
layer computes a linear combination of the outputs of the hidden layer. Based on this,
the value of this linear combination the output layer nodes decide to which class the input
vector should be classified. Now, what can be done is, these
output and nodes can also impose a non-linear function to ensure that if a particular input
feature vector belongs to class omega j. In that case only the output of the j th node
will be equal to 1 and output of all other output layer node will be equal to 0. Similarly,
if a feature vector, input feature vector belongs to say class 1 then only the output
of the first node in the output layer will equal to 1 and outputs of all other nodes
will be equal to 0. So, as we discussed in the previous class that such a radial basis
function network and RBF network incorporates 2 types of learning. 1 is, we have to learn
that for every node in the hidden layer, because every node in the hidden layer represents
a radial basis function, what should be the receptor of that radial basis function and
what should be the spread of that radial basis function. So, if the radial basis function
is a Gaussian function, that is if it is something like this, say phi of x is equal to say e
to the power minus x minus t square of this upon 2 sigma square, where t is the receptor
and sigma which is the variance. It decides that what is the spread of radial basis function.
So, every, for every i th radial basis function phi i x t i is the receptor and sigma i is
the spread, so I have to know that what is the receptor for every radial basis function
and what is the spread of every radial basis function. So, this is 1 level of learning
and the second level of learning is, once through these radial basis functions a d dimensional
feature vector is projected onto a M dimensional feature vector.
So basically, what we are doing is, we are increasing the dimensionality of the feature
vector. As we have indicated in our last class that the purpose of increasing dimensionality
is that if the feature vectors are linearly non separable in the d dimensional space then
we cast them into a higher dimensional space. Then the possibility that they will be linearly
separable in a higher dimensional space increases and this possibility increases with the value
of M. So, as we increase the dimensionality more and more, the possibility of linear separability
of the feature vectors also increases. So, the feature vectors in the d dimensional space,
which are not linearly separable, when I asked them into an M dimensional space at M is greater
than d, it is more likely that those feature vectors, will be linearly separable in an
M dimensional space. And once the feature vectors are linearly separable in the M dimensional
space, then the linear combination of the outputs of this hidden layers is likely to
give me a class belongingness. And that linear combination is decided by the weight vectors
by the connection weights from the hidden layer nodes to the output layer nodes. So,
we also have to learn that what should be the connection weight W i j from say i th
node in the hidden layer to the j th node in the output layer. So, this is the second
level of learning, so in the first level of learning for every radial basis function,
we try to learn what is the receptor and what is the spread of radial basis function. And
for, in the second level, we try to learn what is the connection weight from the input
layer from the hidden layer nodes to the output nodes. As we have discussed in the previous
class that the usual way, a common method of learning the radial basis function is if
you are given a set of feature vectors of the training purpose. Suppose, value of M
is equal to 3, so what we do is, we partition weight of cluster the set of feature vectors
into the number of clusters. So, if we have M number of nodes in the hidden
layer and I have say N number of feature vectors, N number of feature vectors, which are given
for training purpose and I have M number of nodes in the hidden layer, obviously in this
case N has to be greater than M. Otherwise, clustering N number of feature vectors into
N number of clusters does not make any sense. So, I have to have more number of feature
vectors than the number of clusters that I have to form. So, I cluster this N number
of feature vectors into M number of clusters and I can assume that centroid or mean of
every cluster represent the corresponding receptor okay. So, if I take i th cluster,
i th cluster represents the receptor or the i th radial basis function, so the situation
is something like this. If I have a set of feature vectors, say these are the feature
vectors belonging to different classes. Typically, what I do is I cluster this feature vectors
into three different clusters. Every cluster center now represents a receptor, so this
is 1 receptor, this is 1 receptor, this is 1 receptor. So, this is the receptor t 1,
this is the receptor t 2 and this is receptor t 3. So, the first operation we have to perform
is the clustering of the feature vectors and these clustering operations we will discuss
in details in future lectures. Now, once I have these different receptors, to find out
what should be the spread of a particular radial basis function, what you do is for
say i th receptor, I find out P number of nearest
neighbors or P number of nearest receptors and for this P number of nearest receptors,
I compute what is the mean distance or root mean square distance. So, there are different
possibilities I can take any value I can choose any value out of these P number of receptor.
So, what I do is the way I compute sigma i for the i th cluster, for the i th radial
basis function is I have t i, which is the receptor for the i th radial basis function
and then I take P number of nearest receptors, which are nearest to t i. So, suppose 1 such
receptor is t k, so what I do is I compute t i minus t k square, take summation of this,
for k is equal to 1 2 P as I have P number of receptors, 1 upon P of this and square
root of this. So, this defines the spread of the i th radial basis function. So, for
every i th radial basis function, I have t i and have sigma i and once these 2 are known
then my radial basis function phi i of x is simply e to the power minus x minus t i square
upon 2 sigma i square. Now, let us see that by using this concept whether I can make a
linear classifier using the radial basis function concept for the XOR problem and XOR is very,
very common problem, which is used for illustrating such
operations. So, as we have said earlier, if I take an XOR function I have a 2-dimensional
feature vector binary feature vector having components x 1 and x 2. Suppose, this represents
0, this is x 1 equal to 1 here. I have x 2 is equal to 0 and here I have x 2 equals to
1. The value of the XOR function when it is 0, 0 is equal to 0 0 1 the value is 1. 1 0
the value is 1 and 1 1 again the value is equal to 0. So, find that here I have 2 dimensional
binary feature vectors and. So what I do is this 2 dimensional
feature vector, I want to cast into a four dimensional space, by using four radial basis
functions. So, I have the radial function, radial basis functions phi 1, phi 2, phi 3
and phi 4. For phi 1, I choose t 1, is equal to 0, 0 that is the receptor of the radial
basis function phi 1. Similarly, for phi 2, I can choose t 2 which is a 0 1, that is the
receptor of the radial basis function phi 2. Similarly, the receptors of other radial
basis functions I can choose as t 3 is equal to 1 0 and for this I choose t 4 is equal
to 1 1. So, these are the four receptors for the four radial basis functions. Next, I had
to choose the spread sigma 1. For the first radial basis function I have to choose sigma
2 for the second radial basis function, sigma 3 for the third radial basis function and
sigma 4 for the fourth radial basis function. Now, for this for every receptor I have to
find out P number of nearest receptors and suppose I choose that the value of P is equal
to 2. Now, here you find that for every receptor there are three neighbors, 2 of the neighbors
are at a distance, are at distances of 1 and 1 of the neighbors is at a distance of 1 point
4, that is square root of 2. So, that is easily verifiable from here I have receptor over
here, which is t 1, t 2 is at a distance 1, t 3 is at a distance 1, but t 4 is at a distance
of square root of 2, which is 1 point 4 or 1 point 414. So, when I take P is equal to
2, I have to take 2 nearest neighbors both of them are at distance 1 and root mean square
distance of these 2 distances will also be equal to 1 so I have spread sigma 1 is equal
to 1, I have spread sigma 2 also equal to 1, have spread sigma 3 also equal to 1 and
have spread sigma 4 that also equal to 1. So, I get phi 1 x, which is of the form e
to the power minus x minus t 1 square of this, upon 2, and sigma 1 square and sigma 1 being
equal to 1. This will be equal to 2. Similarly, for phi 2 x, I will have e to the power minus
x minus t 2 square of this upon 2 phi 3 x will be e to the power minus x minus t 3 upon
2 and phi 4 x. So, if I compute to these values for each of the feature vectors taking 0 0
is 1 of the feature vector 0 1 as another feature vector 1 0 as another feature vector
and 1 1 as another feature vector the functional values will be something like this.
So, I put that in the form of a table here I have input feature vectors, inputs are 0
0, 0 1, 1 0 and 1 1 and I have the RBF functions phi 1 phi 2 phi 3 and phi 4. So, when you
input the feature vector 0 0 to phi 1 you find that your x is equal to t 1. So, this
exponent is equal to 0, which means that phi 1 x will be equal to 1. So, this phi 1 x over
here this will be 1 point 0. Similarly, for phi 2 my x is 0 0 t 2 is 0 1, so if I compute
this phi 2 x, you will find that this phi 2 x with be equal to 0 point 6. Similarly,
I just put the values over here. Phi 3 x will also be 0 point 6 and phi 4 x will be 0 point
4. When the input vector is 0 1, phi 1 x will be 0 point 6, phi 2 x will be 1 point 0, phi
3 x will be 0 point 4, phi 4 x will be 0 point 6. For 1 0 this is 0 point 6, this is 0 point
4, this is 1 point 0, this is 0 point 6 again and for the input feature vector 1 1, I have
phi 1 is equal to 0 point 4, phi 2 x will be 0 point 6, phi 3 x will be 0 point 6 and
phi 4 x that will be 1 point 0. So, you find that given a 2 dimensional feature vector
0 0 this has been cost into a 4 dimensional feature vector where the components of this
4 dimensional feature vector are 1 point 0, 0 point 6, 0 point 6 and 0 point 4. Similarly,
0 1 is a 2-dimensional input feature vector, which has been cast into a 4 dimensional feature
vector the components being 0 point 6, 1 point 0, 0 point 4 and 0 point 6. So, every input
feature vector where the input feature vector is a 2-dimensional feature vector. Every 2-dimensional
input feature vector is converted to a four dimensional feature vector by using four radial
basis functions. Now, if a take the linear combination of this and for linear combination
for phi 1, if I give an weight of, say I give the
weight for phi 1, I give an weight of minus 1, for phi 2, I give an weight of plus 1 for
phi 3, I give an weight of plus 1, for phi 4, I give an weight of minus 1 okay. So, function
that I will finally, compute at the output at a node in the output layer will be phi
2 plus phi 3 minus phi 1 minus phi 4 and if I compute this, let us see what are the values
that I get. So, here I will write, sum of W i times phi i where i varies from 1 to 4.
So, here it will be 0 point 6 plus 0 point 6 is 1.2 minus 1 point 4, this will be minus
0.2. Similarly, here it will be 1 point 4 minus 1.2, so it will be plus 0 point 2. Here,
again it will be 1 point 4 minus 1 point 2, so this is plus 0 point 2 and here it will
be 1 point 2 minus 1 point 4, so this is minus 0 point 2. And if I take a decision that if
the value is more than 0 the output will be 1, if it is less than 0 the output will be
0, then the final output that we have is here I write output. This will be 0, this will
be 1, this will be 1 and this will be 0. So, which is nothing, but the XOR function output.
So, over here the architecture of the radial basis function that we have used is we had
two input layer nodes. Where x 1 is fed to 1 node and x 2 is paid to another node, I
had 4 nodes in the hidden layer, which computes the radial basis function. And I had 1 node
in the output layer which I can say that it is finding out, it is a non-linear operator
or a threshold operator. The connections are like this, where each of this connection has
a connection weight is equal to 1. And over here these connections are
as you can see over here phi 1 2 output layer node has a connection weight of minus 1, phi
2 2 output layer node has a connection weight of plus 1 phi 3 2 output layer node again
has a connection weight of plus 1, phi 4 2 output layer node has a connection weight
of minus 1. So, here the connections are minus 1, plus 1, plus 1, minus 1 and this output
actually gives me the XOR function, okay? So, this example clearly shows that by casting
the two-dimensional feature vectors into a four-dimensional feature vector. I can implement
the XOR function using a linear network or a single layer perceptron because this part
is nothing, but a single layer perceptron okay. Now, let us theoretically try to find
out or try to find out an expression for the training of the output layer or how do I find
out this connection weights. So, in general I have a network something like this. I have
a set of input layers, set of input layer nodes, I have a set of hidden layer nodes
and I have a set of output layer nodes. The feature vector is fed to the input layer nodes,
so here I feed x, which are fed to the hidden layer nodes through connection weights, which
are one and outputs of this hidden layer nodes. These are my radial basis functions. Outputs
of the hidden layer nodes are connected to the output layer nodes. So, like this and
I take the output from every output layer node. So, if my input feature vector x belongs
to say i th class then output of the i th output layer node will have a high value likely
to be 1 and outputs of all other output layer nodes will have a low value
likely to be 0. I assume that the i th node in the input layer is connected to the j th
node of the output layer, through a connection weight say W i j. So, given this, if I say
the output of the i th j th layer node is o j, I will have o j is equal to sum of W
i j times phi i X for an input vector X and this summation I have to compute over all
nodes in the hidden layer. So, here I will have this summation has to be computed over
i is equal to 1 to M, as I have M number of nodes in the hidden layer. And naturally over
here, if this feature vector, so I will write, if X belongs to class omega j then I had to
have sum of W i j times phi i X, i is equal to 1 to M. This has to be equal, this must
be greater than 0. I will put this as plus 1 and if X does not belong to omega j that
indicates that sum of W i j into phi i X, i varying from 1 to capital
M that must be equal to 0 or I can also put it as minus 1. So, let us assume that if x
belongs plus omega j W i j times phi I x that has equal to plus 1 and if x does not belong
to omega j this has to be 0 and that is what have to be the output from the j th node in
the output layer. Now, taking this, now I can go for training of the output layer that
means I have to find out what should be the values of this W i j. Now, if I compute only
the connection weights, if I right now consider only the connection weights, which are connected
to the j th node in the output layer then for every vector X k, suppose I have capital
N number of vectors. So, I have vectors X k for k varying from
1 to capital N. I have capital N number of input vectors, which are given for training
purpose or for learning as we are using supervised learning, then phi i of X k. For simplicity
I will write this as phi i k. now, by using this I can, as I said that sum of phi I x
k into W i j k. So, my condition is, if you remember this one, if you remember this one.
So, sum of W i j into phi i k for i varying from 1 to M. This has to be equal to plus
1 if X k belongs to class omega j, if X k belongs to omega j and this has to be 0 if
X k does not belong to omega j. So, this is the output that I expect, so for X k I have
for every X k, I have such a kind of linear equation that this summation will be either
plus 1 or 0 and all those capital N number of equations now, I can write in the form
of a matrix. So, in the matrix form this can be written
as, let me write the matrix equation that phi 1 1, which means phi 1 X 1 okay, phi 2
1 that is phi 2 of X 1, phi 3 1, phi M 1, this means phi M of X 1. Similarly, phi 1
2, which means phi 1 of X 2, phi 2 2, phi 3 2 upto phi M 2 and as I have capital M number
of samples for training so I will have phi 1 N, phi 2 N, phi 3 N upto phi M N, which
indicates phi M of X N into W 1 j, W 2 j upto W M j. So, you find that
what it computes? W 1 j times phi 1 1 plus W 2 j times phi 2 1 continue like this, W
n j times phi M 1 that is for the first input vector x 1. Whatever is the output of individual
middle layer nodes or hidden layer nodes. This equation simply makes a linear combination
of outputs of the hidden layer nodes for the input feature 1 or X 1 okay. So, this has
to be equal to, again I put the output in the form vector b 1 j, b 2 j up b N j. Where
every b i j will be equal to 1, if the corresponding X i belongs to class omega j and that will
be equal to 0, if the corresponding X i does not belong to class omega j. So, every b i
j will assume a binary value either 0 or 1. So, this b i j will be equal to 1. If X i
the corresponding input vector X i on belongs to class omega j, the j th class or it will
be equal to 0, if X i does not belong to omega j. So, this is the kind of situation that
I have. This whole expression, this matrix equation
I can write in a short form, that is phi W j is equal to b j where this phi is this matrix,
W j is weight vector which are connected to the output layer node j and b j is the output
of the j th node in the output layer, which is represented in the form of a vector like
this for different input vectors okay. So, if the network is properly trained, that is
of all the W i j has got the trained value then this equation should be satisfied. But,
what we are trying to do is, we are trying to train the network that means we are trying
to set the weights W j, so you cannot expect that this equation will be satisfied initially.
So, if this equation is not if this equality, this equality is not satisfied then what I
can do is I can define an error e which is nothing but, phi W j minus b j. And now, training
involves adaptation of this weight W j, so that this error can be minimized. So, in order
to do that as we have done one earlier for mean square error optimization for mean square
error technique for classifier learning or classifier training. I can also define here
a function, criteria function j of W j, which is given by phi W j minus b j norm of this
and then I take the gradient with respect to W j, so grad of J W j which will be simply
2 phi transpose into phi W j minus b j. And by equating this to 0 what we get is w j is
equal to phi transpose phi inverse of this into phi transpose b j. And as you have seen
earlier this phi transpose phi inverse phi transpose phi inverse into phi transpose,
this is what is called pseudo inverse and that is represented as phi plus. So, we
have W j by this pseudo inverse technique, so we have this W j is equal to phi pseudo
inverse into b j okay, where this b j is defined beforehand, every component of b j will be
either 1 or 0. It will be equal to 1 in the corresponding feature vector, input feature
vector belongs to plus omega j and the component will be equal to 0 if the corresponding input
feature vector does not belong to plus omega j. So, I have this vector b j, phi actually
indicates that what should be the output of the hidden layer nodes for every feature vector,
so from that I compute what is my matrix phi. So, once I have this matrix phi and I have
this b j I can compute what will be the connection weights for different nodes in the hidden
layer to the j th output layer node. This if I do for every output layer node I can
compute what is the connection weight for, from different outputs of the hidden layer
nodes to different output layer nodes and that is what completes my training of the
RBF neutral network and the RBF neural network will be ready for classification. Now, if
you compare this RBF neural network with your, against say multilayer perceptron you find
that the training of the RBF neural network is faster than the training in multilayer
perceptron, because in case of multilayer perceptron the training is done by back propagation
algorithm which takes large number of iterations okay. So, the training of the RBF neural network
will be faster than training of the multilayer perceptron. The second advantage is that I
can easily interpret, what is the meaning or what is the function of every node in the
hidden layer, okay, which is difficult in case of multilayer perceptron. I cannot easily
interpret the role of different nodes in the hidden layer in case of multilayer perceptron.
And not only that I also cannot easily decide that what should be the number of hidden layers
and what should be the number of nodes in every hidden layer. So, those are the difficulties
in case of multilayer perceptron, which is not there in case of RBF network. However,
RBF network has a disadvantage that though the training is faster, but you find that
the classification takes more time. In case of RBF network than in case of MLP, because
in case of RBF network every node in the hidden layer has to compute the radial basis functional
value for the input feature vector, which is time consuming. So, the classification
in case, the classification in case of RBF network takes more time than the classification
time in case of multilayer preceptor, okay? So, with this so we come to
a conclusion on the new radial basis function neural network. Now, over here I will just
briefly discuss about another kind of classifier which is called a support vector machine.
So, I briefly discuss support vector machine. So, support vector machine is another type
of linear classifier. So, if you remember what we discussed in case of a linear classifier,
that given a two class problem, we have said that I can define a discriminating function
say g of X, which is of the form say W transpose X plus b. and we have said in case of linear
discriminator that if this g of X and W transpose X plus b this is greater than 0 that indicates
that feature vector X belongs to class omega 1. If this is less than 0 then feature vector
X belongs to class omega 2. So, here we find that for classification purposes the actual
value of g X is not really very important, but what is important is what is the sign
of g X. If the sign is positive I infer that x belongs to class omega 1 if the sign is
negative I infer that x belongs to class omega 2, right? So, over here with every X, if I
or if with every x i, I indicate a number say y I that y I can be either plus 1 or minus
1. In that case this Y i times W transpose X i plus b it will always be greater than
0. If the sample X i is properly classified, which is quite obvious, because if I say that
Y i equal to plus 1 for a sample X i which belongs to class omega 1 and for a sample
which belongs to class omega 1 this W transpose X i is greater than 0 Y i is also positive.
So, Y i times this will obviously be greater than 0, if X i belongs to class omega 2 then
W transpose X i will be less than 0. For that
I have set Y i equal to minus 1 times W transpose X i plus b will obviously be greater than
0. And this is a concept that actually we have used when we have discussed about the
perceptron criteria or designing the linear classifier, that is for every feature vector
belonging to class omega 2, we have negated the feature vector before we try to design
the classifier, so that for every feature vector irrespective of whether the feature
vector belongs to class omega 1 or the feature vector belongs to class omega 2 my discriminant
function value will always be positive. If the feature vector is correctly classified,
so that is true if the feature vector belongs to class omega 1 or even if the feature vector
belongs to class omega 2, because of the feature vectors belonging to class omega 2 before
trying to design the classifier we have negated this feature vector. So, we will discuss about
the support vector machine in more details in our next class. Thank you.