Supervised Learning and Support Vector Machines

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[Music] we're gonna move on now to talk about supervised learning again and I want to introduce in the next two lectures two of the most important workhorses of industry or at least they were the workhorses in the industry and teeth until deep neural Nets came along the support vector machine and classification trees so up until deep neural networks came along they were sort of in some sense the two dominant paradigms for going to large-scale data and getting the very best results possible for classification and clustering great supervised algorithms and I want to walk through some of the mathematical detail of them of course based on optimization procedures and I want to just walk through some of these again most of the details here from the book data-driven science and engineering at databook u-dub comm so let's talk about this so this is going to be one of the really important canonical supervised learning techniques support vector machine so I want to talk about what this terminology means and talk about the optimization procedure and then show you how you actually can execute this in MATLAB very easily and there's a very simple equivalent and in scikit-learn and python so the idea here is the following I have data this is what we've been playing around with and what I have here are the magenta and the green balls which typically we've been talking about you know if something like dogs versus cats and the goal here is to find a vector a hyperplane or in this case since it's two-dimensional just a line that separates those green from magenta balls now what's interesting about the support vector is this W which tells me how to get out here to characterize this line is basically the support vector and in fact what we want to look at here is I want to not only find this line but you see this margin here I want to penalize everything to keep those balls out of of out of this margin so my goal is to maximize this margin so in this data what I'd like to do is figure out how do i construct this line such that my margins are as big as possible so here is a good representation of the SVM here is a poor representation notice that both of them in training if this was my real data would give me perfect separation of the data right if you look here the data is perfectly separated by that line as well as this line but here the margins look at the margin here versus look at the margin there so my goal is then to find a line a W vector that will allow me an NB that will allow me to basically figure out how to mopped optimize that margin excuse me okay so that is the goal so we're going to go after this and try to train this here and figure out how to set that up as an optimization routine by the way notice that the margin has all the balls to the right or to the left of the margin but in practice you know we've already seen data where the magenta and green balls overlap on each other sides so what you do is you penalize any single ball that is across that line right so if Co obviously we'd like to not have any balls from the green on this side no magenta balls on outside but the penalisation that you would impose on the in the optimization would handle that and try to do the best job it can and separating those so we're gonna first start off with this idea of a linear support vector machine which is exactly what I just showed you which is say alright I'm gonna try to find this hyperplane separating the two parts of the data that give me a maximal cushion on each side so I can in really maximize that margin so this linear plaintiff hyper separation the important thing that's going to happen here is once I define this plane what I'm really going to look at is then my labels are gonna be Y which is a function of W X plus B which is actually just the sign of W X plus B so plus one or minus one so I'm gonna just gonna take the sign so if I'm on the hyperplane this is zero if I'm on one side it's plus one if I'm on sorry found on one side this is positive if I'm on the other side it's negative and I will just take positive our negatives and make them one or minus one in other words the labels of the dogs and the cats okay so this is in fact my output that comes from knowing my hyperplane it's an integer plus one minus one or whatever you however you choose to frame your classification so this leads us to be able to formulate in fact the loss function we need for the optimization procedure so the loss function is given here and I'm looking at Y of J Y bar of J so when I train this YJ is gonna be the truth in other words I've given you a training data with labels so I know what the Y J's are supposed to be Y bar of J is my approximation to the label given my hyperplane the sign of the hyperplane which side of that hyperplane and armaan so this is what I think the label is y bar this is what it actually is so this thing is actually 0 if in fact I got it right and plus 1 if I got it wrong so in other words this is a more succinct way to say this it's this loss function is 0 if you label the piece of data correctly it's plus 1 if you labeled it incorrectly so obviously everything you get wrong you hit a penalty of +1 okay so and of course the more you hit wrong the higher your losses are going to be so we want to basically minimize over those losses and here is the linear SVM optimization model that you can write down for this minimize this loss function with some the norm of that W vector as well so you also make this as small as possible if you can subject to constraint okay on so this is that you have to minimize so sorry X of j-dub use is one length one okay so this thing here is your basic formulation of the support vector machine optimization so it's a constraint optimization you want to minimize the loss function and one of the difficulties in the loss function especially if you go to high dimensional data is oftentimes the way you would solve a problem like this is through maybe gradient descent but the problem is this loss function is discrete to non smooth right because it's you take a one penalty or a zero so it's as you walk through it's very hard to produce gradients on non smooth functions so oftentimes this is not the way that the optimization procedures actually solved it's actually solved by introducing what's called a hinge loss now the hinge loss is given here so what you're doing essentially is smoothing that loss function instead of being 0 1 and taking these discrete jumps this will let you smooth this out so that if you apply gradient descent on here you have a smooth function and in fact an easy want to differentiate which is exactly what you like ok so this is in fact the SVM type of architecture for finding optimizing to find that hyperplane that best separates the data okay but that's not what makes SVM special what makes SVM special is this ability that people learned early on is to say like well I don't have to in fact work with just the data X I can in fact work with some functions of X knows I can create functional of X in others I can take this into higher dimensional representation space through this fee here which takes me not doesn't let me forced me to live in a linear space I can lift it into higher spaces other versions of the variables I'm going to show you two concrete examples of this in a minute but what this allows you to do is if I can move into an ultra representation which is much richer much higher dimensional in other words project and higher dimensional space then I do my linear functional here in other words my my hyperplane separation in this new variable so the goal of SVM is to take all your data into a higher dimensional representational space and then draw hyper planes separating the data okay so let me just show you here this is gonna I'm gonna give you one example that's gonna hopefully highlight this to you I have data x1 x2 from that data I'm gonna move into a new space Z 1 Z 2 Z 3 so I went from two dimensions to three and the third dimension I define is x1 squared plus x2 squared okay so I went from two dimensional data to three dimensional data so just to go back this this is just this process right here and I can define projections into all kinds of directions with this and this is exactly what I want to do because most data isn't easily separated with a line however if I go to higher dimensional spaces I'm gonna show you the moment all of a sudden lines separate data amazingly well or hyper planes do okay so the goal is to project a higher dimensional spaces where hyper planes can very easily separate your data and give you very good margins between the different clusters okay that's the goal so here that I'm gonna give you this example I've gone from x1 x2 to this from X to Z what does that buy you well let me give you a really challenging data set which is trivial now with support vector machines here's my original data in the X 1 X 2 plane so I have the magenta balls down here and the green balls on the interior so notice the optimal separation between these would be sort of this circle here right and of course when you do these linear hyperplane lines they're not circles they're line and there's no easy way to separate the data with a line however if I now move to the new variable set Z 1 Z 2 Z 3 where the Z 3 direction is x1 squared plus x2 squared this is what the data looks like up here I have the magenta balls sitting above the green balls and now I can very easily draw a hyperplane separating those magenta from the green ok so this is a canonical picture of what going to higher dimension can buy you in terms of being able to do discrimination and classification tasks and this is why SVM has become so popular is because data lifted into high dimensions you can use linear discriminators very effectively the cost of this is that if you have high dimensional data then you even go to even higher dimensional representation so it's very costly to do SVM in many applications partly because what you're doing is going from already a high dimensional state space to just continuing to go even higher dimensionals by making up these other variable directions ok so that's an important feature but you can see how powerful it is if you can do it there's one another interesting thing that happened with with SVM which is by committing to going to very high dimensional spaces you get yourself into a very large computational optimization procedure and oftentimes this may be in fact intractable however one of the things that was realized early on is where it called kernel methods for support vector machines and this is this idea that if I start projecting into these planes so this is just generic representation to these functionals then what I can do is say I'm gonna take this W and I'm projected onto these spaces then my functional here is gonna be the sum of these alpha J's alphas fee of X of J's dotted with V of X plus B so what I have to compute here when I go to very high dimensions is this dot product between all of these high dimensional representations of the data but what people realize in Colonel methods is say why don't since I'm doing this why don't in fact I just define that as some kind of kernel in other words the idea being say I will take this guy and I'll say since I have to produce this I will reproduce this as what's called a kernel function in other words this is the thing that's killing me to compute because I'm projecting everything into very high dimensional spaces and having the dot product them together so instead I'm gonna represent that kernel in a much more compact way okay first of all but by the way here is the here is now the representation of the same which is some hinge loss with these functionals that you ought to put up here so the optimization procedures and change much but it's now being done on a much much bigger scale because you have all of these functions you're projecting into but now what you do is say my kernel function let me assume it takes on very simple forms in fact for instance that kernel here could be something like a radial basis function where all you're doing now is producing this exponential between Gaussian between data points or polynomial kernels so this is a very compact representation of this thing here which is typically expanding into a very large space so what you often can do is say well this often can look like a Taylor series and Taylor series can often be very compactly represented for instance by Exponential's sines cosines so you do that here instead so this is a compact way of representing a very high dimensional projection into into the space but the compact representation gives you a very tractable way to compute this so these kernels become really important it's called the kernel trick because now I just got around having to do a really expensive optimization by choosing some kernel that effectively projects me to an exceptionally high space but I can compute it very easily on the data so that was a really important step forward in the optimization procedure for the SVM because the SVM is in fact highly successful because when you take data and you project a high demand as the example showed you can very easily now draw hyperplanes between the data and do a pretty good job with it you just need a tractable way to go to scale with it and the kernel method does that there's a lot of more details about the optimization procedures and innovations on this this is more just to highlight some of the key themes that go along with SVM and just to bring it back to key to key things project a higher dimensional spaces make up some functional representation into higher dimensional spaces where linear discriminant work and number two do something like a kernel trick to make that a tractable process okay so let me just show you how to run an SVM inside a matlab here's the code so all I'm doing up here is loading the cat dog data we've been playing around this for a long time I just did in the last lecture I did it with I took the dog cat data and showed how to do this with when you're discriminants but now I'm gonna do with SVM so load the cat dog data stack it up pull out the features and the labels and now let's run an SVM on it so this is should be fairly standard by now and here's what we do fit C SVM training data and labels and it's going to give you back this thing object called a model MDL okay so I'm going to go fit the data I'll have to give it as training data which is the actual data plus labels it's going to build a model and then if I want to see how well it works I can use the predict command predict on the model and now I can want to test that through and it's giving me the test labels that's it okay and if you want to use one of the kernel tricks on this thing all you got to do is fit SVM here's an option kernel function RBF Radio basis function so you can pick different basis functions there as you wish so this is a very powerful algorithm that's available to you and a you very quick access to Colonel methods as well there they are and so you have this ability to just very simply train on one of these very high-end methods and what this is doing again is projecting to higher dimensional spaces trying to build linear classification lines and so when you give it a test data set it's gonna also project into these higher dimensional spaces you've built all these linear classifiers or higher and linear hyper planes up there to separate the data and it will give you back your results okay you can even get very nice diagnostics out of this so this model structure cross Val MDL so can cross validate the model give you a cross validation statistics as well as K fold loss which is a you know gives you the class loss across things here so let's just just run this piece of code here and and there it is okay so the class loss here it is 5% so by the way I think that was from previous previous room he had done but that's what comes out of this of this class lost from there so lots of good features and Diagnostics out of the code and it's very simple to execute and now you have the power of an SVM engine at your fingertips and by the way Python has something similar and you should just have in mind that if you're gonna use supervised algorithms that are not deep neural Nets SVM classification trees those are two most powerful methods available they're very simple to use and implement so again everything can be found here code base some examples more lectures on clustering and classification with different techniques Dana book you double comm all the code and python and matlab plus data and all the notes are there at data book PDF
Info
Channel: Nathan Kutz
Views: 2,356
Rating: undefined out of 5
Keywords: supervised learning, machine learning, support vector machine, SVM, hyperplanes, kernel methods, kernel trick, kutz, brunton
Id: NOKOJWQ2_iE
Channel Id: undefined
Length: 19min 53sec (1193 seconds)
Published: Fri Apr 24 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.