Hello. So, in the last few classes we have
talked about the neural networks and the application of neural networks for pattern recognition.
Particularly, we have talked about two types of neural networks; one was the Hopfield neural
network, as we have said that Hopfield neural networks are basically meant for binary feature
vectors. So, if I have a number of binary feature vectors the, Hopfield neural network
basically learns or remembers those feature vectors by updation of weights. So, learning
process is basically updation of weights and in case of back propagation neural network,
the learning is also by means of updation of weights.But for back propagation or feed
forward neural network, the neural network is not confined to the binary feature vectors,
only it considers the feature vectors which contains real components also. Now, when we
talked about the feed forward neural network or back propagation neural network, we have
said that it is a multi-layered neural network, which has got one layer, which is input layer
and it is just a buffer layer, whose responsibility is to feed, is to just pass the information
to the layers above it. And in the simplest case of the back propagation neural
network, we had single layer perceptron, which consists of only one output layer, in addition
to the input layer. So, the neurons in the output layer so if I just draw it in case
of a single layer perceptron, the network architecture of single layer perceptron was
something like this. We had an input layer and we had an output layer so the number of
nodes in the input layer is same as, the dimensionality of the feature vector or if the feature vector
is of the dimension d then the number of nodes in the input layer is d plus 1. So, one node
is to take care of the bias term and the number of nodes in the output layer is same as, the
number of classes. So, the network architecture was something like this, from each input layer
node we had a connection to every output layer node so it was something like this and so
on. So, a feature vector is feed to the input layer nodes and
one of the node was kept at a constant value, which is equal to 1. That is meant for feeding
the bias term and outputs of this different nodes in the output layer, they indicated
the corresponding classes Ci. Every node in the input layer is connected to every other
node in the output layer, through a connection weight say wij. So, ith node in the input
layer is connected to the jth node in the output layer, through a connection with wij.
So, during training what we had done is, we had fed the feature vectors for which the
classes are known. And if the class of a feature vectors known, that
means I know that what should be the output corresponding to that particular feature vector.
So, that output is an output vector say t and when the network is not properly learnt,
in that case for an input vector, I get an output which may not be equal to t. So, I
get an output which is o whereas t is the target output. So, I have an error e which
is equal to square of this, right? So, during training of this back propagation neural network
what we had done is, we tried to minimize this error e by means of updating this connection
weights. So, this was expressed as a function of a set of connection weights wij and then
through gradient descent procedure, we have reduced the error by updation of the connection
weights wij. In case of multilayer perceptron, we had the same input layer, input layer is
always possible, and we also had the number of nodes in the output layer, which is same
as the number of classes. In addition, we had one or more hidden layers, okay. However,
the connection the pattern remains the same, and every node in the input layer is connected
to every other node in the layer above it. So, it continued this way
and so on and finally, we get the output. However, the training for this back propagation
neural network was same as, a single layer perceptron that is we, for unknown feature
vector which is input to this neural network. We know what the output is or we know what
is the target output. And when the neural network is not properly learnt we get an output
for the same vector, same input feature vector, which is not same as the target output. So,
there again we get an error and this error when I express in the form of a function,
of the weight matrix, or the connection weights then will try to reduce the error by optimizing
or by adopting the weights. And when we modified the weights, when we adapt the weights the
error information for reduction of the error, that propagates from output layer to input
layer so it to propagate in the reverse direction. So, this is like this, the error information
or the weight updation information that propagates from the output layer to the input layer.
So, that is why it is called back propagation learning, but during the classification, the
information moves from the input layer to the output layer. So, the information is feed
from input to output so that is why it is feed forward network whereas, the learning
algorithm is called as a back propagation learning algorithm. Now when we discussed
about this, feed forward network or back propagation network, we have said that if the patterns
or if the classes are linearly separable then a single layer perceptron is sufficient. But
if the classes are not linearly separable then single layer perceptron is not sufficient,
we have to have one or more hidden layers, okay. And what these hidden layers do is,
it represents a non-linear boundary by a set of piecewise linear boundaries and to explain
this, we have taken few examples. We have taken an AND function. So, in case of AND
function if I represent the AND function as, a two-dimensional binary vector. So, I have
the input components 0 0 0 1 1 0 and 1 1 so if the function is an AND function, only when
the input is 1 1, the output is 1. In all other cases 1 0 0 0 or 0 1 output is 0 and
now, if this AND function we consider to be a classification problem then you find that
it is quite obvious. I can draw a straight line separating the set of 0’s and the set
of 1’s, so that means the input vector 1 1 that is put in one class. And all other
combinations of the input components, that is 0 0 0 1 1 0 that is, those are put in another
class. And I can separate these two classes by a straight line so that means that this
particular function, AND function is a linearly separable function. Similarly, if I take an
OR function here again I put it as x1 x2 somewhere here, let us put 0 1 0 1 so only in case of
0 0 both x1 and x2 components, when they are 0 the output is 0. In all other cases the
output is 1 so I have this type of situation and again here, we find that I can separate
the 1’s and the 0 by a straight line. So, which
clearly indicates that, this is also a linearly separable problem, but we have faced difficulty
in case of XOR function. So, in case of XOR function if I put it in the same fashion,
I have x1 x2 0 1 0 1 so when the input is 0 0 the output is 0, when the input is one
1 1, output is also 0, when the input is 0 1 or the input is 1 0 the output is 1. Now
given this type of situation, you find that I can no longer separate the outputs 0’s
and 1’s by single straight line. Rather, I need to have two different straight line
over here, to separate the 0’s and 1’s. So, this clearly says that the XOR function
is not linearly separable. And when I discussed about this back propagation neural network,
we have said, we have shown the AND function and OR function they can be represented, they
can be implemented using single layer perceptron’s. But when you go for XOR function, the XOR
function cannot be implemented by single layer perceptron rather, we need at least one hidden
layer for implementation of XOR function by a feed forward neural network because it is
not linearly separable, okay. So, now, today will discuss about another kind of network
which is called radial basis function network. So, what radial basis function network does
is, it performs a non-linear transformation over the input vectors, before the input vectors
are fed for classification. So, by using such non-linear transformation, it is possible
to convert a linearly non separable problem to a linearly separable problem. It is also
possible what this, RBF network or radial basis function network does is, it increases
the dimensionality of the feature vectors, okay. So, I will come to that later first
let us see that, how this introduction of non-linear function to the input feature vectors,
before we go for classification, can make a linearly non separable problem, can convert
a linearly non separable problem to a linearly separable problem. So, let us just take a
situation something like this. Say, I have a set of feature vectors, which
are of this form, okay. So, here you will find that, if I say that all these green crosses
represent feature vectors belonging to one class and all the red crosses represent feature
vectors belonging to one class, another class. So, clearly you will find that, these two
feature vectors cannot be separated by straight line rather, what I need to have is, I have
to have a curve, a quadratic curve at least to separate these two classes. Now if I put,
if I impose on non-linear transformation something like this, that a transformation which will
expand the feature vectors in this direction and contract the feature vectors in this direction.
So, if I do this these feature vectors will be converted, will be clustered into a cluster
something like this and these feature vectors, it is possible they will be clustered into
a cluster like this. And because of this non-linear transformation, now it may be possible that,
I will be able to find out a straight line, which separates these feature vectors from
these feature vector, okay. So, that is why I say that, if I use a non-linear transformation,
if I impose a non-linear transformation to the feature vectors. Before I put them, I
go for classification purposes then it may be possible to that a non-linearly separable
problem can be converted to a linearly separable problem. The other one as I said that the
RBF network, the radial basis function network also increases the dimensionality of the feature
vector. Because it has been found that, if we increase the dimensionality of the feature
vectors then classification problem which is
linearly non separable in a lower dimensional space. It may be possible that, they will
be linearly separable in higher dimensional space. And more and more you increase the
dimension, the linear separability, the possibility of linear separability increases. So, it is
something like this, if I put it in this form where the tips of my fingers, they represent
different feature vectors. And the tips of the fingers of the left hand, they represent
feature vectors belonging to one class and the tips of the fingers of the right hand,
they represent feature vectors of another class. And these feature vectors they are
coplanar so all of them are lying in the same plane. So, if the situation is like this obviously,
this set of feature vectors are not linearly separable. However, if I introduce a third
dimension something like this, I push in the forward and the other one back ward. And in
this third dimension, now we find that they are linearly separable so this is just a simple
logic that, if I increase the dimensionality. Then it is possible to increase the possibility
of linear separability, among the classes which are linearly non separable in lower
dimensional space. So, the RBF network actually imposes both, it imposes a non-linear function,
a non-linear transformation to the feature vectors. And it also increases the dimensionality
of the feature vectors. So, if I have a set of feature vectors whose original dimension
is the dimension P. I increase the dimension of the feature vectors to dimension M, okay.
So, how it is to be done, a given feature vector x is subjected to function phi. So,
phi of x, where this phi of x is the form of phi 1 x, phi 2 x, phi 3 x it is up to phi
M x, okay. If the original feature vector a x is of dimension P, for each such feature
vector x, I impose this set of phi functions, n number phi functions. And each of these
phi functions produce a real value so as I have M number of phi functions so by application
of this phi function onto this feature vector x. I create a number M of real components
so what I get is an M dimensional feature vector and if M is greater than P then obviously,
I am increasing the dimensionality of the feature vector from P to M. And in case of
radial basis function, each of these phi function, whether it is phi 1, phi 2, phi 3 up to phi
M each of this phi functions are radial basis functions. Then what is a radial basis function?
So, let us try to see what a radial basis function. Every radial basis function has
a receptor, the radial basis function has a receptor say t and as I move radially away
from this receptor, the value of the function goes on increasing or goes on decreasing.
And the value at a point, depends upon the radial distance of the point from the receptor
t. Value of the function is either maximum or minimum at location t so if I put concentric
circles around this receptor t, the value of the function phi on these, each of
this concentric circles are constant. So, there are different choices of radial basis
functions. I can put those radial basis functions as,
one of them is called multi quadrics, which is given by phi of r is equal to r square
plus c square to the power half for c greater than 0. Then I can have inverse multi quadrics,
where the function phi r is given as, just inverse of quadrics that is r square pulse
c square to the power half for c greater than 0. I can also have use Gaussian functions,
where phi of r is given as exponential minus r square upon 2 sigma square for sigma greater
than 0. And this r is a measure of the distance of the point from, what I said as the receptor
of the radial basis function. So, this r if I want to compute this radial basis function
at a point say x, coming over here. If I want to compute, the radial basis function at a
point x, where t is the receptor of this radial basis function. Then r is nothing but t minus
x or x minus t which is nothing but the distance between x and t. So, you will find that in
all these different cases if I use this multi quadrics then at r equal to 0 the value of
the radial basis function is equal to c, which is the minimum. And as far r goes on increasing,
the value of the radial basis function goes on increasing. So, the nature of the radial
basis function for multi quadrics will be like this, when I go for the inverse multi
quadrics, at r equal to 0, that means when the point x and the point t they coincide.
That is at the receptor point, the value of the inverse multi quadrics is minimum which
is nothing but 1 upon c, this is the maximum which is nothing but 1 upon c. And as r goes
on increasing, the value of radial basis function goes on decreasing so the kind of, the nature
of the function with inverse multi quadrics will be like this. Just inverse of multi quadrics
at, in case of a Gaussian function you all know that, this has a very, very common form,
the Gaussian function, will be something like this. So, at the receptor location t, the
value of the function phi will be maximum, and as you move away from the receptor point,
the value of the function phi r that goes on reducing following this pattern. And it
is this Gaussian function, which is most commonly used in case of radial basis function. Now,
let us say that how by using this radial basis function neural network, we can convert a
linearly non separable problem to a linearly separable
problem, even keeping the dimensionality of the feature vector same. And so for doing
that I will use the same XOR problem. So, I have this x2, x2 here I have 0, here I have
1 again 0, here it is 1, I have 0 over here, okay. So, these are the outputs of an XOR
function, what I will do is, I will use a radial basis function to nonlinearly transform
these feature vectors into another space. And the radial basis function that I use is,
a Gaussian basis function so the form of the radial basis function will be phi x, which
will be given by, e to the power minus x minus t square, okay and this t is the receptor
and as I said that, I will not increase the dimensionality. So, I will use two of the
receptors, one of the receptor I consider is this point at location 0 1 and other receptor
I consider this point at location 1 0. So, by using this if I compute the non-linear
transformation then the values that I will have will be something like this. I use this
non-linear transformation so I will put this inputs my inputs are 0 0, 0 1, 1 0 and 1 1
and I will have the output functions. One is phi 1 and the other one is phi 2 so if
I say that this is my t 1, the receptor t 1 and this is my receptor t 2. So, phi 1 x
will be simply e to the power minus x minus t 1 square and phi 2 x will be e to the power
minus x minus t 2 square. So, here you will find that, I assume that 2 sigma square that
is equal to 1, okay. However, that does not matter for this non-linear transformation.
So, I define that when it is 0 0 then what will be my phi 1 of 0 0, phi 1 of 0 0 will
be simply equal to, sorry let me use the receptor like this. Let this be t 1 and let this be
t 2, okay. So phi on1e of 0 0 will be equal to 1 and if you compute this term phi 2 of
0 0 will be 0.1. Phi 1 of 0 1 will be equal to 0.4 and phi 2 of 0 1 will also be equal
to 0.4, phi 1 of 1 0 will be 0.4, this will also be 0.4. And for 11 phi of 1 will be 0.1
and phi 2 will be equal to 1. So, this is the kind of situation that I have so what
have used is this 0 0 has been used as one of the receptors of the radial basis function
and 1 1 has been used as another receptor of the second radial basis function. So, obviously
0 0 as it coincides with my receptor t 1, so I have the situation that x minus t 1 square,
that will be equal to 0 so this is e to the power of 0, which is equal to 1.So, phi 1
of this x which is 0 1, 0 0 which is equal to 1 similarly, if I compute phi 2 of x that
will be equal to 0.1. For 0 1 phi 1 of x will be 0.4, phi 2 of x will also be 0.4, for 1
0 which is this point, phi 1 of x will be equal to 0.4, phi 2 of x will also be equal
to 0.4. Because of these exponential functions, for 1 1 if x is equal to 1 1 this coincides
with t 2 so phi 2 of x this distance be equal to 0 now, phi 2 of x will be equal to 1 and
phi 1 of x, this is the receptor the phi 1 of x will be equal to 0.1. So, you find that
0 0 has been mapped to 1, 0.1, 0 1 have been mapped 0.4, 0.4, 1 0 has been mapped to 0.4,
0.4, 1 1 has been mapped to 0.1, 1. So, if I plot this, the kind of plot that
I have is 0 0 is now mapped to 1, 0.1 which will be somewhere over here, this is my x1,
this is my x2. Now, this will be phi 1 instead of x1it will be phi 1 and this will be phi
2. So, this is phi 1 of x this is phi 2 of x so 0 0 has been
mapped to 1, 0.1, 0 1 has been mapped to 0.4, 0.4 which will be somewhat here. 1 0 that
is also mapped to 0.4, 0.4 it is the same point so both the points up to be, now has
been mapped to the same point in phi 1 phi 2 domain. And that has happened because of
the non-linear transformation whereas, 1 1 has been mapped to 0.1, 1 somewhere over here
so this is 1. This being the situation now you find that, I can draw a straight line
separating these points say, problem which was nonlinearly, which was not linearly separable
in original space, it has now been linearly separable in phi space, simply by using: non-linear
transformation. And the possibility of the linearly separability, between the classes
increases as we go on increasing, the number of, the dimension of the feature vectors,
by using these phi functions. So, by using this now the architecture of the network will
be of this form. We have the input layer as before, it always has been present because
this is the layer which accepts the input vectors and feeds to the next higher layers.
We will have only one hidden layer and the number of nodes in the hidden layer will be
equal to M, where M is the dimension, to which we want to increase the dimensionality of
the feature vectors. So, if I have P number, P is the dimensional of the input feature
vectors, we have P number of nodes at the input layer and if M is the increased dimensionality
then M is the number of nodes in the hidden layer. Each node in the hidden layer, actually
performs a radial basis function and as we said, it is
something like this, each node represents or computes a radial basis function. And in
the output layer, we will have number of nodes which is same as the number of classes c,
now from each node in the input layer I have a connection to each node in the hidden layer.
And from each node in the hidden layer, I have a connection to each node in the output
layer. Finally, I have outputs from the classifying neurons and here I feed the input feature
vectors so this is, what is the architecture? So, the nodes in the hidden layer, they actually
perform the radial basis function and nodes in the output layer, they perform linear combination
of the outputs of the hidden layer nodes. So, here I will have a linear classifier so
actually classification is done only at the output layer. Whereas, at the hidden layer
only we perform the radial basis function so that is a transformation of the input feature
vector to a hidden space. Because this is my hidden layer so you can say that it is
a transformation to a hidden space, so when I have a number of nodes in the hidden layer
which performs radial basis function, okay. So, I have two levels of training, the first
level of training is, training of the hidden layer nodes. So, what does this training mean,
as every node in the hidden layer is a radial basis function so for each of the radial basis
function I have to have a receptor t. So, one of the purpose of training of the hidden
layer nodes is, to find out that what should be this receptor, t. And then if I use the
Gaussian function then we have said that, the radial basis function is of the form e
to the power minus r square upon 2 sigma square so that is what is my phi of x, where this
r is nothing but the distance between x and t. So, this sigma actually indicates that
what is the spread of this radial basis function, so that is also what is to be trained. So,
by saying that I want to train the hidden layer nodes, for each of the nodes I have
to find out what is the t and for each of the nodes I have to find out what is the sigma.
So, that is what is meant by training of the hidden layer nodes and at the second level
I have another training operation, by which I have to train these weight vectors, which
connects the outputs of the hidden layer nodes to, the output layer nodes because it is the
linear combination of the outputs layer nodes of the hidden the nodes, which decides that
to which class a sample will belong. And that will be outputted, that with the available
at the output of the, output layer nodes. So, let us see that, how you can train a node
in the hidden layer, okay. So by saying that the
hidden layer nodes are to be trained, what, let us see the situation that we have. If
I go for supervised classification or supervised learning what I have is, I have set of feature
vectors or which are labelled with its class belongingness, which are to be used for training
the neural network. In case of multi layered perceptron, in case of single layered perceptron
we did not have any non-linear transformation so from the label of the input feature vector
and actual label that you got, at the output of the output layer nodes, we have defined
an error function. And our aim was that by updation of the connection weights, we have
to reduce that error function or we have to minimize that error function. So, the algorithm
was that in order to minimize the error function, how the connection weights are to be modified.
And that modification of the connection weights was carried out from the output layer to the
input layer, in the backward ways. whereas in the case of RBF network, the training consists
of two parts, the first one is training of the hidden layer nodes. And when I say it
is the training of the hidden the nodes that means, it is for every hidden layer which
represents a radial basis function, I have to find out that what should be the receptor.
And also I had to find out what is the spread of the radial function, of the radial basis
function which is sigma. So, if we are given a number of training vectors. So, we are given
N number of training vectors, vectors from say x1, x2 up to say xN, we have N number
of training vectors. Where each of these vectors is of say dimension P and I want to
transform these vectors to a vector of dimension capital M, okay. So, naturally at the hidden
layer of the node, at the hidden layer I have to have capital M number of nodes. And when
I have this M number of nodes so for each of the nodes I have to have a receptor. That
means for the ith node, I have to have receptor ti so i varies from 1, 2 up to M and the ith
hidden layer node, I have to have sigma i again, i varying from 1, 2 up to capital M.
So, I have to have M number of receptors, I have to have M number of variances or standard
deviations. So, this can be done in various ways, the simplest way is at a random from
the training samples, you pick M number of samples, which are to be used as receptors,
which is not very logical rather, what appears to be more logical is, you go for clustering
of this capital N number of training samples into M number of clusters. So, when I go for
clustering, I basically get a space where it is more likely to have a feature vector.
Clustering we have not discussed yet, but I will just for the purpose of completion
of this lecture, I will simply indicate what does this clustering mean. So, I have capital
N number of samples, from that I have to form capital M number of clusters. So, it is understood
over here, that this N has to be greater than capital M, if N is less than M that is number
of samples training, samples is less than the number of clusters that are to be formed,
it is not possible. Because, even if in a single cluster I assume that, there has been
only one feature vector, at the most I can have N number of feature vectors or at the
most I can have N numbers of clusters. So, I cannot form N number of clusters which is
more than the number of feature vectors, which is provided. So, obviously our assumption
is that the number of feature vectors N is greater than, the number of clusters that
one has to form that is M. Now, there are various clustering algorithms, one of the
clustering algorithm is say initially, I partitioned this N number of samples into capital M number
of clusters, okay. So, at every cluster will have a number of samples, for each of the
created clusters say. Suppose I have got a number of feature vectors,
which are given like this. So, these are the set of feature vectors now, suppose I want
to form three clusters so initially what I do is at random arbitrarily you divide this
set of samples. You partition the set of samples into three different clusters so this cluster
contains these samples, this cluster contains these samples, and this cluster contains these
samples and so on. Now, once I have these initial clusters then what I can do is for
each of the cluster so formed, I compute the cluster center. So, which will be that representative
of that cluster, of the samples belonging to that cluster. So, if over here suppose
the center of this cluster comes out to be somewhere over here, the center of this cluster
comes out to be somewhere over here and the center of this cluster comes out to be somewhere
over here. So This will be the center cluster, center mu 1, this will be the cluster center
mu 2, and this will be cluster center mu 3 and so on. Now, once I form this cluster centers
in the next iteration what I do is for each of the samples, I find out which cluster is
nearest to that particular sample. So, here you find that it may be possible that this
class, this particular sample say this particular sample it’s distance from mu 3 is less than
its distance from mu 2. So, initially though this sample was put in cluster 2, in the second
iteration because its distance from the cluster center 3 is found to be less than, its distance
from the cluster center 2, this sample will be more from cluster 2 to cluster 3. And this
operation you will perform for every sample so for each
such sample you find out that what its nearest cluster center is and you move this sample
to, the cluster whose center is nearest to it. And you perform this for each of the samples
so at the end of second iteration you will find that, initial clusters that we had formed,
the samples in the initial cluster have changed. I still have three numbers of clusters, but
the sample values have changed. So, in the second iteration at the end of second iteration,
again I compute that with this new set of samples what is its cluster center. After
computing, recomputation of this cluster centers, again I go for the reassignment of the feature
vectors, based on its distance from this new cluster centers. And this operation you will
perform in a number of iterations. And finally, when it stabilizes then we say that the clustering
is complete and at that point whatever cluster centers I have, those cluster centers, have
those cluster centers are my receptors. So, by this what I have done is, for each of the
radial basis functions, or each of the nodes in the hidden layer I have computed the receptors,
these receptors are nothing but the cluster centers. Once I do that, next what I have
to do is... I have to, so for the every ith node I have to compute a ti, I found out what
is ti. Next I have to find out what is sigma i so then what I can do is, I can go for a
nearest neighbor rule. From every ti, I can find out say P number of nearest receptors
and the distance of those P numbers of nearest receptors. So, I get P
number of distances and root mean square of all these distances, I can take as, that as
the variance or the standard deviation sigma i, okay. So, if I am interested to find out
what is sigma j, what I will do is, I will take tj minus ti, take the square of this,
sum of this for i is equal to 1 to P, as I have P number of nearest ti’s 1 upon P of
this and square root of this. This is one of the way in which I can compute sigma so
I have the receptors ti and I have sigma i. So, once I have this ti and sigma i then you
will find that in the middle layer every hidden layer node is defined, for every node I have
ti, for every node I also have sigma i. So, here also have ti, I have say tj, here also
have sigma j so this is defined for every node. So Now, if I feed any feature vector
x to this, I have the corresponding phi q of x, I have the corresponding phi 2 of x,
I have the corresponding phi M of x. So, the original feature vector x is now converted
to M dimensional feature vectors now, using this feature vectors I have to go for classification,
okay. So, for that I have to train the output layer of this radial basis function neural
network because as you see, if you remember the architecture of the radial basis neural
network over here. I had a set of weights W, so I have to compute this Ws for classification
purpose. Now, here as we said that, as I increase the dimensionality and I, as I imposed a non-linear
transformation and increase of dimension and increase of dimensionality taken together.
That increases the possibility, that the classes will be linearly separable in the higher
dimensional space. And as I increase the dimension, as I increase the value of M, the possibility
of linear separability goes on increasing. But I do not know that, how many such radial
basis functions I have to use so that, my original problem becomes linearly separable.
So, though I have increased the possibility of linear separability, I have not been able
to guarantee that, they are linearly separability. So, the best way to find out this connection
weights at the output layer is, by means of least mean square error technique. So, what
we have is, we have already discussed about the LNS technique. And then, we have seen
that if Y is a set of feature vectors and if I have a set of feature N number of feature
vectors Y then using this Y I can form a matrix. So, let me say this feature vector is X, okay.
So if I have N number of feature vectors, every feature vector when I put in the form
of a row, I have a matrix of say N by d, if d is the dimensionality of the feature vectors.
And we call this matrix as Y and the logic that I put is, when I want to train say, first
output node of my classifier, only the samples if I feed a sample which belongs to class
one, only that output should be 1, the rest of output should be 0. So, for every input
feature vector I can define and output vector, okay. So using each of these outputs, I can
define what is, what in case of this LMS algorithm, we have said as a bias vector. And then by
using the pseudo-inverse technique, I can find out what is the output to it. So, this
discussion I will stop here today, I will continue with this discussion in our next
class. Thank you.