SVM10 The Kernel Trick (Part1: Basis Expansion)

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

welcome back so far i talked about two types of support vector machines the hard version and the soft version so the hard version assumes that the data is linearly separable or perfectly linearly variable like in this case so i can create a linear boundary and separate my data perfectly that is all these samples are correctly classified there is no mistake on trading samples however there is a case when there is an overlap between the different samples of different classes like in this case so the data is not linearly separable however we can find a line that you know has a good trade-off and separate my data with the smallest number of mistakes so this is called the soft support vector machine this would allow me to create this kind of boundary and this case when the data is perfectly linearly superb we are talking about the hard support vector machine however there is a case where the data is not linearly separable and we cannot use the soft support vector machine for example this this is a type where the data is linearly not separable and however the use of this soft version of the support vector machine is not a good idea what if i have a distribution like this so in that case there is no line that would give me a good separation between these training samples and in fact we need to find another model instead of a line for example a circular model would do better than a linear model so now the kernel trick is a method that would allow us to use the hard support vector machine to create boundaries of different types it can be circular it can be quadratic it can be polynomial it can be a finish shape and this is done by something that is called the kernel trick however to understand the kernel trick what we need to understand first is something called basis expansion this is the first idea that we need to understand to understand the chronological phases expansion what is this basis expansion just to make it uh clear for you i will consider a one dimensional distribution of my training samples i will consider a feature space of just one dimension so this is a feature space that has only x and let me assume that the data is distributed in this way so i have positive samples here and negative samples on both sides okay so a linear boundary in two-dimensional feature space is analogous to one threshold in this feature space here so you can see that there is no threshold that would best classify or that would give me a good that would be a good model to classify this samples if i take a threshold here i will have a lot of mistakes because i have on the side positive samples and negative signals and if i take a threshold here this would not be a good threshold because it would have here positive samples and negative samples but what if i do something like that what if i create a new feature say this feature is x squared okay assuming that this 0 is here and x squared would be would give me a curve like that and when i map these samples on this curve i get the following feature space or the following distribution of my samples so these are the positive samples and these are would be the negative samples so in that case you can see that it can separate my training samples with a line and this was due to that new feature that i introduced so this is what a basis expansion is we just create a feature such that we expand our feature space okay so this is the idea of the basis expansion we have an original feature space so it can be d one dimensional and what we need to do is to find some transformation that would map these original features into a new dimension let's call it d2 such that the data can be separated with a linear hyperplane in that new feature space okay now there are many ways to create a business expansion or to create new features and in my case i would talk about the quadratic features so we can create a basis expansion using quadratic features that is we generate quadratic features let's assume that x the original feature space has only two elements or two features x1 and x2 so in that case the quadratic features are the following so x1 x2 x1 squared x2 squared and then x1 x2 so you can see here that only from two features i created five features this is one two three four and five okay and the reason why it's called quadratic features the reason why these features are called quadratic is because when we separate the training samples with a linear hyperplane in that feature space of five dimensions this linear hyperplane we transformed back in the original feature space it would be a quadratic boundary so in the transformed feature space the linear hyperplane when transformed back in the original feature space it would be a quadratic boundary and a quadratic boundary the form of a quadratic boundary is like that and of course you can't control its position in its orientation so these are quadratic boundaries okay now after transforming these features into a new feature space what we do is that we can apply either the hard support vector machine or the soft support vector machine and just to make the explanation easier let's consider that i want to apply the hard support vector machine now if you remember in the heart support vector machine this is the expression of the this is the expression that we wanted to minimize for alpha i and minimizing this expression yielded a led to train in the model so just let just me remember you recall this expression this was equal to summation of alpha i minus one half summation of alpha i alpha j y i y j x i dot product with xj and the decision rule was equal if you remember to w transpose or w dotted with x i plus b and the w is equal to the summation of alpha i y i x i so if i replace this here i get for the decision boundary or for the decision rule alpha i y i x i dotted with the new sample i want to classify plus b and i care about the side of this expression so this in the original feature space now in the transformed feature space these equations become the following so they are almost the same the only modification that i do is instead of taking x i i take phi of x i dotted with phi of x j and the same thing for the decision rule a new sample is equal to the summation of alpha i y i x i or rather phi u of x i dotted with phi of x plus b and of course remember that this summation is from i equal one to m such that m is the number of training samples okay now the reason why i wrote these is to tell you that the training the expression required for training depends upon the dot product of appears of samples the same thing is true here in the training process the time that would be required would depend upon the dot product of pairs of samples if the same thing is true here in this case for classification we also want it also depends upon the dot product of pairs of samples the same thing is true here for the transformed feature space now what i want to do is try to compare the computational time required for this case of the transformed feature space in the case of the original feature space so for that and i will talk particularly about the case of the quadratic features because this the computational time depends upon the type of the transformation we are doing so in the case of the quadratic features if we have quadratic features say that we have an original feature space that has only that has a dimension of d one then if i want to create quadratic features so these are quadratic features then the dimension of this new feature space is d2 and d2 is equal to 0.5 or rather 1.5 d1 plus 0.5 d1 squared so we have bad news here because we have a square here and you will see in a moment why this is a bad and let me take an example of d1 equal to 100 so in that case if i do if i compute the number of dimensions in the transformed future space then d2 would be equal to 5150 dimensions what does that mean it means that the number of uh operations required to perform the dot product of pairs of samples upon which [Music] the both training and classification depends on so the i said that the training depends upon the dot product of pair of samples and also the classification depends upon the dot product of pairs of samples now if the number of operations if this dot product increases this means that it would need more time for both classification and training so in that case the dot product would take five thousand one hundred and fifty multiplications plus five thousand one hundred and forty nine additions so now let's compare that with the case where we have only the original features before the transformation so before the transformation the dot product of pairs of samples would take 100 multiplications plus 99 additions and you can see that there is a huge difference between these numbers of operations required for the transformed feature space in the original feature space and this creates a problem for the basis expansion because when we have a large number of dimensions in the original feature space then you would have a problem for both classification it would take a long time or a lot of time for classification and also for training okay now this comes in this case to solve this problem we use something that is called the kernel trick so i will talk about that in more details in the next video but just to tell you the kernel trick is just a function k of x that is equal to the dot product of pairs of samples in the original feature space and what we want to do is to simplify this dot product such that the number of operations required is significantly smaller so we work on the expression of this dot product and we simplify it and then we get a shorter expression where the number of operations is very very small when compared with these numbers of operations and this is why this is what the role of the kernel trick is and i will talk about that in more details in the next video

Info

Channel: Zardoua Yassir

Views: 7,200

Rating: undefined out of 5

Keywords:

Id: as-uwuiu5H8

Channel Id: undefined

Length: 16min 5sec (965 seconds)

Published: Thu Oct 15 2020