SVM Kernels : Data Science Concepts

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[Music] hey everyone welcome back so finally excited to talk about this very cool concept of kernels in svm so as i mentioned in the last video on svm dual which by the way i would highly recommend watching first so you fully understand this as i mentioned in that video this idea of kernels and svm is at the intersection of very very useful topics in machine learning but also kind of difficult to understand at first and for me i learned kernels like several independent times over the course of the years and they never really stuck until now so i want to make sure that you don't have that experience and that you understand kernels a lot better from the get-go so let's start with a motivating example so let's look at this very simple looking 2d case of two classes of data so we have the triangles who have relatively low coordinates for x1 and x2 so this is just two-dimensional data and we have the x's which have relatively higher values of x1 and x2 and the question is can we make a svm classifier let's just say we're using hard svm for this video so the hard margin case which means that can we draw a line that's going to perfectly separate the two classes we stared at for a second and the answer is no even though it seems like there's such a natural way to separate the classes because the hard margin svm case just asks is there a line i can draw and the answer is no but let's see if there's something simple we can do using the existing data and what we will do is add a new feature so we will engineer a new feature based on the existing ones specifically that new feature will be x1 squared plus x2 squared so we take the x coordinate square it and add it to the y coordinate also squared and so we can represent this now as three-dimensional data so the first two dimensions are the same nothing different the third dimension is that new feature we constructed which again looks like this and we see looking at this crude picture in three dimensions it's actually very easy to solve the svm problem and the reason is that this new feature has pretty low values for this triangle class because both x1 and x2 are pretty small in absolute value and these x's have relatively higher values for the new feature which means that we can cleanly separate our two classes in three dimensions using a plane and i won't show it here but actually very interestingly a plane in this three-dimensional feature space is actually a circle in this two-dimensional feature space which is exactly the shape we wanted to separate these two classes but the main point i want to get across is that using svm sometimes it helps us to project our data into a higher dimensional space another way of saying that is it helps to take our original number of variables which was 2 and do some very clever engineering of these variables and add new variables so that the problem becomes easy or trivial in that higher dimensional space so that is the idea we want to keep in mind and so mathematically what we did is we took this original data which just had two numbers in it the x coordinate and the y coordinate or x i 1 and x i 2 and we projected it into a higher dimensional space now if we wanted to consider all two-way interactions between these variables we would have had a vector of length six so we just had a constant one we have our original data x i1 and xi2 and then we consider the three different two-way interactions so that's xi1 xi2 that's the pure interaction term and then we have the two pure quadratic terms and those are the two pure quadratic terms we use to construct this third feature so we went from a two-dimensional space to a six-dimensional space and if we wanted to consider all three-way interactions and four-way our space would just keep growing and growing growing and so we can write this compactly whatever the transformation is as we take some vector x i from our original data which looked like this and we apply some transformation big t which takes us to a higher dimensional space so t of x i is the six dimensional vector and x is that original two-dimensional vector now if we think back to that dual formulation svm video and this is why i highly recommend watching that it'll make it much more clear why kernels are so powerful we said in that dual svm video that if we represent the svm problem in its dual form we only need to care about the inner product between our data that is we only care about x i dot x j for all i and j that's in our data and so equivalently now that we've transformed our data using transformation t we still now only care about the inner products between the transformed data which is t of x i dot t of x j again for all i and j and our original data and after we've taken these inner products between transform vectors we can just run the svm problem regularly and then we can get some answer so the main point of what i just said is that now after we transform the data all we care about is actually just the inner products between the transform vectors let's look at this diagram to better understand at a high level what we're doing so we have our original data and we're trying to get down to this lower right hand box which is the transformed inner products now the most straightforward way to do it especially given everything i've just showed you is that the first thing we're going to do is run the transformation t to get our transform data which is going from this two-dimensional space to the six-dimensional space for example and then we simply just take the inner products in order to get to these transformed inner products now can we think of any issues with this approach so i've put this red exclamation point to kind of draw your attention to this transformation step t although in our case going from two to six variables doesn't seem like a huge deal you can imagine that as you try to consider more and more interactions or if you're trying to do more complex interactions the new size of the space can blow up pretty fast and that's going to cause some inefficiencies down the road for example at least in terms of data storage whereas before you had to store two columns now you have to store who knows how many columns so this could be rather expensive both in terms of time and space to do actually this transformation t and get the transformed data so now we go back to the drawing board is there a different path for us to get from the original data to the transformed inner products and we actually see there is on this board we can alternatively go down this path let's explore that option and see if there's anything there instead of our first step being running the transformation which could get us a pretty inefficient large matrix in memory let's instead just get the original inner products of the data so let me put a couple of dimensions here so we make sure we understand why something's inefficient versus efficient so the original data would be n by p n observations and p variables in our case p was two if we did this transformation we would have n by i'll put big p big p just being some large number of dimensions and that causes the inefficiency if instead we take the original inner products first and to be clear by original inner products i mean we get all of the x i's dot x j so we haven't done any transformations yet how many of those are there there's n times n because we basically need to get the inner product of every one of the n vectors with every other one of the n vectors so this is n squared that's a 2 right there oops okay that's a 2. and then this is the step that's a little bit unclear but let's say there's some kind of function or some kind of transformation that's going to allow us to go from the original inner products that is x i dot xj directly to the transformed inner product that is t of x i dot t of x j and the very crucial thing to note is that we are going directly from the original inner products to the transformed inner products and skipping this transformation this explicit transformation all together and so let me just close the story here and say that the transform inner product space is also n squared why because in transforming the data we're not adding any new samples they're still n samples so they're still going to be n squared transformed inner products so instead of having to go through this route of getting this very high dimensional matrix living in memory which could cause inefficiencies we go down this route which is much more efficient in certain cases but of course this only works this very much depends on if there is some kind of transformation we can do to get from our original inner products to the transformed inner products and it doesn't seem clear now but let's do a little bit of experiments to see if we can get there so let's start from what we want we want the transformed inner product that is we want t of x i dot t of x j if i were to actually do t of x i dot t of x j which you can do in this case because there's only six of them so you imagine you take a vector that looks like this with the eyes and you dot it or inner product of that with a vector it looks like this except with j's and let's ask ourselves the question what sort of terms come out of that operation and you get these six terms so this can be a little bit confusing to look at i admit because there's subscripts and super slips floating around but the main thing you need to keep in mind is that you get these six distinct terms when you explicitly do this transformation so we ask ourselves the question is there another way to get these exact six terms which are going to give us the power of working in a high dimensional space but without explicitly doing this transformation let's consider this magical function i've written up here called k poly2 of xi.xj the first thing even before i define this function for you is that it's just a function of the inner products of the original data okay so it's not doing any transformations using this t itself that's the biggest thing to keep in mind as i talk here this function is defined as one plus x i dot x j squared and we can actually open that up just a little bit because we know the form of x i dot x j that would be taking a vector that looks like this and taking the inner product of that with a vector that looks like this except with the j's and that will give us x i one x j one plus x i two x j two and of course we're squaring that after adding that to one and i won't actually do the algebra of course it would get a little bit a little bit crazy not too bad but the main question again we want to ask is that what are the unique terms that come out of doing this operation and you'll find that it is these six terms now i step away from the board for a second here you compare this list of six terms with that list of six terms and if you stare at it or pause you'll find that they are the exact same and now our gears start turning let's backtrack all the logic and think about what this means for our problem we're able to get these six terms using this operation which only depends on the inner products of the original data now those are the exact six terms coincidentally or not coincidentally actually that we need if we're able to get the transformed inner product which means we're able to go about it this way or this way now the benefit of going about it this way again is that we never actually have to run the transformation t we never have to send our data to a high dimensional space but we reap all of the benefits of sending our data to the high dimensional space looking at our diagram again if it's still a little bit confusing i don't blame you because this took me years to grasp we go for our original data and we take the original data's inner products so we have a bunch of terms that look like x i dot j and then we go ahead and run this new function which now i call k k for kernel so this is the first time i'm calling it a kernel function we run the kernel function of the original inner products and we get the same terms as we would have if we explicitly took the transformed inner products themselves if it sounds a little bit crazy or magical i agree i still think it's crazy or magical but hopefully you see why it works and how it works here and why it's important is again we avoid ever sending our data to this high dimensional space but we still get all of the benefits and so the final thing i'll say in this video is again that this is called a kernel function and in general a kernel function is in the context of svm at least is any function that only uses the inner products of the original data and is able to use those inner products and send them or do some kind of operation to send them to the inner products of the transform data again without ever visiting the transform space itself so i hope that this gave you a good idea about why kernels work how kernels work why they're important in the context of svm any comments are of course welcome below if you like this video please like and subscribe for more just like this and i will see you next time
Info
Channel: ritvikmath
Views: 9,810
Rating: undefined out of 5
Keywords:
Id: OKFMZQyDROI
Channel Id: undefined
Length: 12min 1sec (721 seconds)
Published: Wed Feb 24 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.